## Lab 1: Primer Design for Polymerase Chain Reaction for Tissue Identification

### Goal: To design PCR reactions to identify contents of mystery tissue samples

### Different Primer Strategies
We were given three options for creating primer to use in PCR to predict the mystery tissue samples:
1. 2 primers:
   We have a forward and reverse universal primer that will yield the different length product for each for each animal.
2. 4 primers:
   We could make one of the primers be a universal forward primer that primes in all species.
   We would then have unique reverse primer per species which will each yield different length products. 
3. 6 primers:
   We could make the primers

Our lab group decided to do the 4 primer method.

### Finding Our Universal Primer
To find the universal primer, we manually looked through the clustal sequence alignment file of the three gene sequences provided to find a contiguous substring of length 20 which would only have one mismatch between the three gene sequences.

We were able to find the universal substring as the following: "CAAATATCATTCTGAGG-GC".
The universal substring has a length of 20.
The universal substring starts (non-zero indexed) at position 408 of the pork and beef gene sequence and at position 411 of the chicken sequence. 

We decided to make the mismatch position (again, we enabled one mismatch for the universal primer) a gap nucleotide. Another option would be to make it the most frequent nucleotide appearing in that differing region, which was 'G', but we decided to make it a gap nucleotide.

### Finding Our Unique Primers
Then, to find our unique primers, we firstly loaded the three organism's gene sequences from our given FASTA file using the Biopython package.

Then, as instructed by the lab handout, we found primer sequences with the following requirements:
1. Primers should be 20-25 bases long.
2. Your PCR products should be between 90 and 1000 bases in length.
3. Your primers should have a melting temperature between 52 and 58 C.
4. Your primers should have less than 95% similarity with others substrings in the other gene sequences.

Please look at the code comments below to see which of the different requirements the code accounted for.

In [81]:
from Bio.SeqUtils import MeltingTemp as mt
from Bio.Seq import Seq
universal_primer = "CAAATATCATTCTGAGG-GC" 
pork_universal_primer_start = 408
beef_universal_primer_start = 408
chicken_universal_primer_start = 411

seq_records = []
# Loading three organism's gene sequences from our given FASTA file
for seq_record in SeqIO.parse("data/cytochrome_b.fasta", "fasta"):
	seq_records.append(seq_record)

# Extract the gene sequences from the the SeqIO parse
pork_data = seq_records[0].seq
beef_data = seq_records[1].seq
chicken_data = seq_records[2].seq

# diff_letters account the number of different letters from two string inputs of the same length
def diff_letters(a,b):
    return sum (a[i] != b[i] for i in range(len(a)))

# find_valid_temp_substrings ranges through a gene sequence to find substrings of length substring_len that meet 
# requirement 3.
def find_valid_temp_substrings(gene_sequence, start_index, substring_len): 
	substring_list = []
	for i in range(start_index, len(gene_sequence)-substring_len):
		curr_substring = (gene_sequence[i:i+substring_len]).reverse_complement()
		if('N' not in curr_substring and 52.0 <= mt.Tm_Wallace(curr_substring) and 58.0 >= mt.Tm_Wallace(curr_substring)):
			substring_list.append((curr_substring.reverse_complement(), i + 1))
	return substring_list

# check_substring_validity checks if a substring is a valid substring by firstly checking that there exists
# one primer sequence that matches the string in the genome and requirement 4.
def check_substring_validity(substring, threshold, matching_gene_sequence, gene_sequence_1, gene_sequence_2):
	num_matches = 0
	for i in range(len(matching_gene_sequence)):
		if(diff_letters(matching_gene_sequence[i:i+len(substring)], substring) < threshold):
			num_matches += 1
	if num_matches != 1:
		return False
	for i in range(len(gene_sequence_1)-len(substring)):
		if(diff_letters(gene_sequence_1[i:i+len(substring)], substring) < threshold):
			return False
	for i in range(len(gene_sequence_2)-len(substring)):
		if(diff_letters(gene_sequence_2[i:i+len(substring)], substring) < threshold):
			return False
	return True

# find_primers finds the valid primers for a genome with a given length and two genomes that we would not like to have
# similarity with from the start_index of our genome of interest 
def find_primers(substring_len, threshold, matching_gene_sequence, gene_sequence_1, gene_sequence_2, start_index):
	substring_list = find_valid_temp_substrings(matching_gene_sequence, start_index, substring_len)
	valid_outputs = []
	for i in range(len(substring_list)):
		if(check_substring_validity(substring_list[i][0], threshold, matching_gene_sequence, gene_sequence_1, gene_sequence_2)): 
			valid_outputs.append(substring_list[i])
	if len(valid_outputs) > 0:
		return (valid_outputs[0][0].reverse_complement(), valid_outputs[0][1])
	else:
		return ("", 0)

# By declaring different start indices that are greater than 100 nucleotides from the universal primer, we ensure
# that we meet requirement 2.
# We also meet requirement 1. by specifying the length of primers to be 21 base pairs for all find_primer() calls
chicken_primer_info = find_primers(21, 1, chicken_data, beef_data, pork_data, 550)
beef_primer_info = find_primers(21, 1, beef_data, chicken_data, pork_data, 650)
# beef_primer_info = find_primers(21, 1, beef_data, chicken_data, pork_data, 750)
pork_primer_info = find_primers(21, 1, pork_data, chicken_data,  beef_data, 950)

print("Chicken Primer Sequence: ", chicken_primer_info[0])
print("Chicken Primer Sequence Start Index: ", chicken_primer_info[1]) 
print("Chicken PCR Product Length: ", chicken_primer_info[1]-pork_universal_primer_start)
print("=======================================================")
print("Beef Primer Sequence: ", beef_primer_info[0]) 
print("Beef Primer Sequence Start Index: ", beef_primer_info[1]) 
print("Beef PCR Product Length: ", beef_primer_info[1]-beef_universal_primer_start)
print("=======================================================")
print("Pork Primer Sequence: ", pork_primer_info[0]) 
print("Pork Primer Sequence Start Index: ", pork_primer_info[1]) 
print("Pork PCR Product Length: ", pork_primer_info[1]-pork_universal_primer_start)

Chicken Primer Sequence:  AGTAATACCTGCGATTGCAAA
Chicken Primer Sequence Start Index:  562
Chicken PCR Product Length:  154
Beef Primer Sequence:  TGTCCTTAATGGTATAGTAGG
Beef Primer Sequence Start Index:  665
Chicken PCR Product Length:  257
Pork Primer Sequence:  AATAGGCATTGACTTAGTGGT
Pork Primer Sequence Start Index:  954
Pork PCR Product Length:  546


### Primer Figure
![Figure with pork, beef, and chicken primers.](figure.jpeg "Primer Figure")

### Potential Sources of Error
Some potential sources of error is the temperature range the primers will be able to withstand. We decided to choose using the Wallace ("Rule of Thumb") method of approximating the temperature resistance of the primers. There are two other methods provided by the BioPython package that we could have used:
1. Estimating melting temperature point based on GC content, which would include salt and mismatch corrections
2. Estimating melting temperature point based on thermodynamic nearest neighbors

We decided to chose the Wallace method because it allowed for the largest number of primer sequences to be included. If one of the methods is a better estimator of temperature range, then we would have chosen that method in favor for the Wallace method, but we ultimately choose the Wallace method because it is listed as the "Rule of Thumb" method.

Another potential source of error is not making the difference in the lengths of the PCR product distinct enough. To meet the requirements of ensuring that we had enough characters in our universal primer, we used the universal primer starting from ~408-411 region of the gene sequence. This may have limited our ability to get greater length differences in our PCR products, as we have the length of the PCR products as 154 bp, 257 bp, and 546 bp. 

## Experimental Design

### Controls
Here, we describe some potential negative and positive controls that we will use in our experiment, as welll as the expected results from the controls.

#### Negative Controls
1. A petri dish with nothing <br>
   Expected Result: No DNA strands generated
2. A petri dish with water <br>
   Expected Result: No DNA strands generated
3. A petri dish with snake meat <br>
   Expected Result: No DNA strands generated

#### Positive Controls
1. A petri dish with only chicken meat <br>
   Expected Result: Only DNA strands of length 154 bp generated
2. A petri dish with only beef meat <br>
   Expected Result: Only DNA strands of length 257 bp generated
3. A petri dishes with only pork meat <br>
   Expected Result: Only DNA strands of length 546 bp generated
4. A petri dishes with 1/2 chicken meat, 1/2 beef meat, well-mixed <br>
   Expected Result: 1/2 of DNA strands have length 154 bp, 1/2 of DNA strands have length 257 bp
5. A petri dishes with 1/2 chicken meat, 1/2 pork meat, well-mixed <br>
   Expected Result: 1/2 of DNA strand have length 154 bp, 1/2 of DNA strands have length 546 bp
6. A petri dishes with 1/2 beef meat, 1/2 pork meat, well-mixed <br>
   Expected Result: 1/2 of DNA strands have length 257 bp, 1/2 of DNA strands have length of 546 bp

### Procedure
1. Order DNA primers and universal primers that are described above.
2. Prepare a total of nine petri dishes that all have the controls above and check if there are expected results from the controls. <br>
If the controls do not lead to expected results, find other potential unique DNA primers using our program and repeat Step 1.
3. Prepare five different petri dishes of our sausage meat of interest, using different parts of the sausage and mixing well.
4. Perform the PCR procedure on each of the petri dishes with our ordered DNA primers and universal primers.
5. Analyze the results of the PCR products with respect to the length of each of the sequences yielded from the PCR procedure.
6. Write a conclusion which gives predicts the composition of the sausage meat based on the PCR results.