## Lab 1: Primer Design for Polymerase Chain Reaction for Tissue Identification

### Goal: To design PCR reactions to identify contents of mystery tissue samples

### Different Primer Strategies
We take into consideration if a product matches a genome or not, and we also compare the length.
We were given three options for creating primer strategies:
1. 2 primers:
   We have a forward and reverse universal primer that will yield the different length product for each for each animal.
2. 4 primers:
   We could make one of the primers be a universal forward primer that primes in all species.
   We would then have unique reverse primer per species which will each yield different length products. 
3. 6 primers:
   We could make the primers

Our lab group decided to do the 4 primer method.

### Finding Our Universal Primer
To find the universal primer, we manually looked through the clustal sequence alignment file of the three gene sequences provided to find a contiguous substring of length 20 which would only have one mismatch between the three gene sequences.

We were able to find the universal substring as the following: "CAAATATCATTCTGAGG-GC".
The universal substring has a length of 20.
The universal substring starts (non-zero indexed) at position 408 of the pork and beef gene sequence and at position 411 of the chicken sequence. 

We decided to make the mismatch position (again, we enabled one mismatch for the universal primer) a gap nucleotide. Another option would be to make it the most frequent nucleotide appearing in that differing region, which was 'G', but we decided to make it a gap nucleotide.

### Finding Our Unique Primers
Then, to find our unique primers, we firstly loaded the three organism's gene sequences from our given FASTA file using the Biopython package.

Then, as instructed by the lab handout, we found primer sequences with the following requirements:
1. Primers should be 20-25 bases long.
2. Your PCR products should be between 90 and 1000 bases in length.
3. Your primers should have a melting temperature between 52 and 58 C.
4. Your primers should have less than 95% similarity with others substrings in the other gene sequences.

Please look at the code comments below to see which of the different requirements the code accounted for.

In [72]:
from Bio.SeqUtils import MeltingTemp as mt
from Bio.Seq import Seq
universal_primer = "CAAATATCATTCTGAGG-GC" 

seq_records = []
# Loading three organism's gene sequences from our given FASTA file
for seq_record in SeqIO.parse("data/cytochrome_b.fasta", "fasta"):
	seq_records.append(seq_record)

# Extract the gene sequences from the the SeqIO parse
pork_data = seq_records[0].seq
beef_data = seq_records[1].seq
chicken_data = seq_records[2].seq

# diff_letters account the number of different letters from two string inputs of the same length
def diff_letters(a,b):
    return sum (a[i] != b[i] for i in range(len(a)))

# find_valid_temp_substrings ranges through a gene sequence to find substrings of length substring_len that meet 
# requirement 3.
def find_valid_temp_substrings(gene_sequence, start_index, substring_len): 
	substring_list = []
	for i in range(start_index, len(gene_sequence)-substring_len):
		curr_substring = (gene_sequence[i:i+substring_len]).reverse_complement()
		if('N' not in curr_substring and 52.0 <= mt.Tm_Wallace(curr_substring) and 58.0 >= mt.Tm_Wallace(curr_substring)):
			substring_list.append((curr_substring.reverse_complement()))
	return substring_list

# check_substring_validity checks if a substring is a valid substring by firstly checking that there exists
# one primer sequence that matches the string in the genome and requirement 4.
def check_substring_validity(substring, threshold, matching_gene_sequence, gene_sequence_1, gene_sequence_2):
	num_matches = 0
	for i in range(len(matching_gene_sequence)):
		if(diff_letters(matching_gene_sequence[i:i+len(substring)], substring) < threshold):
			num_matches += 1
	if num_matches != 1:
		return False
	for i in range(len(gene_sequence_1)-len(substring)):
		if(diff_letters(gene_sequence_1[i:i+len(substring)], substring) < threshold):
			return False
	for i in range(len(gene_sequence_2)-len(substring)):
		if(diff_letters(gene_sequence_2[i:i+len(substring)], substring) < threshold):
			return False
	return True

# find_primers finds the valid primers for a genome with a given length and two genomes that we would not like to have
# similarity with from the start_index of our genome of interest 
def find_primers(substring_len, threshold, matching_gene_sequence, gene_sequence_1, gene_sequence_2, start_index):
	substring_list = find_valid_temp_substrings(matching_gene_sequence, start_index, substring_len)
	valid_outputs = []
	for i in range(len(substring_list)):
		if(check_substring_validity(substring_list[i], threshold, matching_gene_sequence, gene_sequence_1, gene_sequence_2)): 
			valid_outputs.append(substring_list[i])
	if len(valid_outputs) > 0:
		return valid_outputs[0].reverse_complement()
	else:
		return ""

# By declaring different start indices that are greater than 100 nucleotides from the universal primer, we ensure
# that we meet requirement 2.
# We also meet requirement 1. by specifying the length of primers to be 21 base pairs for all find_primer() calls
print("Chicken Primer Sequence: ", find_primers(21, 1, chicken_data, beef_data, pork_data, 550))
print("=======================================================")
print("Beef Primer Sequence: ", find_primers(21, 1, beef_data, chicken_data, pork_data, 750))
print("=======================================================")
print("Pork Primer Sequence: ", find_primers(21, 1, pork_data, chicken_data, beef_data, 950))

Chicken Primer Sequence:  AGTAATACCTGCGATTGCAAA
Beef Primer Sequence:  TATGCAAATAGGAAGTACCAC
Pork Primer Sequence:  AATAGGCATTGACTTAGTGGT


### Primer Figure
Figure Assignment:
Design a figure to show how your experiment will work with your designed
primers. Consider showing the products expected from each PCR reaction you 
would run. There are no conclusions for this figure, but you may have made
concessions in your design that would cause potential errors. If you are aware
of any potential sources of errors, describe them.

CAPTION: TO BE DONE INDIVIDUALLY

### Experimental Design
TO BE DONE INDIVIDUALLY

### Potential Sources of Error
Some potential sources of error is the temperature range the primers will be able to withstand. We decided to choose using the Wallace ("Rule of Thumb") method of approximating the temperature resistance of the primers. There are two other methods provided by the BioPython package that we could have used:
1. Estimating melting temperature point based on GC content, which would include salt and mismatch corrections
2. Estimating melting temperature point based on thermodynamic nearest neighbors

We decided to chose the Wallace method because it allowed for the largest number of primer sequences to be included. If one of the methods is a better estimator of temperature range, then we would have chosen that method in favor for the Wallace method, but we ultimately choose the Wallace method because it is listed as the "Rule of Thumb" method.

Another potential source of error is not having access to the other suggested alignments for the three gene sequences. If there is a equal alignment in terms of scoring, but the provided FASTA file isn't the optimal alignment, then we may not have differences in the length of the PCR product.

A third potential source of error is not making the difference in the lengths of the PCR product distinct enough. 