## Lab 1: Primer Design for Polymerase Chain Reaction for Tissue Identification

### Goal: To design PCR reactions to identify contents of mystery tissue samples

### Different Primer Strategies
We take into consideration if a product matches a genome or not, and we also compare the length.
We were given three options for creating primer strategies:
1. 2 primers:
   We have a forward and reverse universal primer that will yield the different length product for each for each animal.
2. 4 primers:
   We could make one of the primers be a universal forward primer that primes in all species.
   We would then have unique reverse primer per species which will each yield different length products. 
3. 6 primers:
   We could make the primers

Our lab group decided to do the 4 primer method.

### Finding Our Universal Primer
To find the universal primer, we manually looked through the clustal sequence alignment file of the three gene sequences provided to find a contiguous substring of length 20 which would only have one mismatch between the three gene sequences.

We were able to find the universal substring as the following: "CAAATATCATTCTGAGG-GC".
The universal substring starts (non-zero indexed) at position 409 of the pork and beef gene sequence and at position 412 of the chicken sequence. 

We decided to make the mismatch position (again, we enabled one mismatch for the universal primer) a gap nucleotide. Another option would be to make it the most frequent nucleotide appearing in that differing region, which was 'G', but we decided to make it a gap nucleotide.

### Finding Our Unique Primers
Then, to find the 

In [67]:
from Bio.SeqUtils import MeltingTemp as mt
from Bio.Seq import Seq
universal_primer = "CAAATATCATTCTGAGG-GC" 

seq_records = []
for seq_record in SeqIO.parse("data/cytochrome_b.fasta", "fasta"):
	seq_records.append(seq_record)

# Extract the gene sequences from the the SeqIO parse
pork_data = seq_records[0].seq
beef_data = seq_records[1].seq
chicken_data = seq_records[2].seq

# diff_letters account the number of different letters from two string inputs of the same length
def diff_letters(a,b):
    return sum (a[i] != b[i] for i in range(len(a)))

# find_valid_temp_substrings checks if we can  
def find_valid_temp_substrings(genome, start_index, substring_len): 
	substring_list = []
	for i in range(start_index, len(genome)-substring_len):
		curr_substring = (genome[i:i+substring_len]).reverse_complement()
		if('N' not in curr_substring and 52.0 <= mt.Tm_Wallace(curr_substring) and 58.0 >= mt.Tm_Wallace(curr_substring)):
			substring_list.append((curr_substring.reverse_complement()))
	return substring_list

# check_substring_validity checks if a s
def check_substring_validity(substring, matching_genome, genome_1, genome_2):
	num_matches = 0
	for i in range(len(matching_genome)):
		if(matching_genome[i:i+len(substring)] == substring):
			num_matches += 1
	if num_matches != 1:
		return False
	for i in range(len(genome_1)-len(substring)):
		if(diff_letters(genome_1[i:i+len(substring)], substring) < 4):
			return False
	for i in range(len(genome_2)-len(substring)):
		if(diff_letters(genome_2[i:i+len(substring)], substring) < 4):
			return False
	return True

# find_primers finds the valid primers for a genome with a given length and two genomes that we would not like to have
# similarity with from the start_index of our genome of interest 
def find_primers(substring_len, matching_genome, genome_1, genome_2, start_index):
	substring_list = find_valid_temp_substrings(matching_genome, start_index, substring_len)
	valid_outputs = []
	for i in range(len(substring_list)):
		if(check_substring_validity(substring_list[i], matching_genome, genome_1, genome_2)): 
			valid_outputs.append(substring_list[i])
	if len(valid_outputs) > 0:
		return valid_outputs[0].reverse_complement()
	else:
		return ""

print("Chicken Primer Sequence: ", find_primers(21, chicken_data, beef_data, pork_data, 550))
#chicken len 171
print("=======================================================")
print("Beef Primer Sequence: ", find_primers(21, beef_data, chicken_data, pork_data, 750))
#424
print("=======================================================")
print("Pork Primer Sequence: ", find_primers(21, pork_data, chicken_data, beef_data, 950))
#565

Chicken Primer Sequence:  AGTAATACCTGCGATTGCAAA
Beef Primer Sequence:  ATGCAAATAGGAAGTACCACT
Pork Primer Sequence:  AATAGGCATTGACTTAGTGGT


### Primer Figure
Figure Assignment:
Design a figure to show how your experiment will work with your designed
primers. Consider showing the products expected from each PCR reaction you 
would run. There are no conclusions for this figure, but you may have made
concessions in your design that would cause potential errors. If you are aware
of any potential sources of errors, describe them.

### Experimental Design:

### Potential Sources of Error
Some potential sources of error is the temperature range the primers will be able to withstand. We decided to choose using the Wallace ("Rule of Thumb") method of approximating the temperature resistance of the primers. There are two other methods provided by the BioPython package that we could have used:
1. Estimating melting temperature point based on GC content, which would include salt and mismatch corrections
2. Estimating melting temperature point based on thermodynamic nearest neighbors

We decided to chose the Wallace method because it allowed for the largest number of primer sequences to be included. If one of the methods is a better estimator of temperature range, then we would have chosen that method in favor for the Wallace method, but we ultimately choose the Wallace method because it is listed as the "Rule of Thumb" method.