<a href="https://colab.research.google.com/github/shawnmuhr/BIOL_398/blob/main/ShawnsGroupSolutions/GRADED_Group12_biol300_hw2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import random

# Exercise 0: Practicing using GitHub (10 points)

We will be checking for commit messages, and to practice writing them, they will be worth points on this first assignment. To earn full points, *each* team member must submit *at least one* commit message during the process of working on this homework.



### <font color='red'> 10/10 </font> 

# Exercise 1a: Melting temperature calculator (10 pts)

Let's say you want to make your own function for validating the design of a given PCR primer sequence. (Note many tools already exist [online](https://bioinfo.ut.ee/primer3-0.4.0/), but it's so much more satisfying to write your own!) As the first step, we want to calculate the melting temperature of a given primer sequence. We will use the following equation:

$$ T_m = 81.5 + 0.41 \times (\%GC) - 675/N,$$

where $T_m$ is the melting temperature in $^\circ$C, %GC is the GC content (**in percent, not decimal!**), and $N$ is the length of the sequence. Write a function that will take in a DNA sequence, and output the melting temperature.

Note you may want to make use of your `GC_content()` function from last week. You can redefine that function here to call it within the `melting_temp()` function below. 

In [None]:
def melting_temp(DNA_seq):
    GC_Count = 0
    total_seq = len(DNA_seq)
    for letter in DNA_seq:
        if letter == 'G':
            GC_Count += 1
        if letter == 'C':
            GC_Count += 1
    GC = (GC_Count/total_seq)*100
    the_final_ans = (81.5 + .41*GC - 675/total_seq)
    return the_final_ans

In [None]:
# should be 71.8 (with rounding)
melting_temp("ATCGATCAGTTACGATAGCCGAC") 

71.76086956521738

In [None]:
# should be 64.5
melting_temp("GCATCGATCGATTACGAC") 

64.5

### <font color='green'> This looks good. For something simple, this is perfect. But you can also use your same GC_count function from last week in a cell above and just call it in your new function. Also, I'm not sure why they weren't there, but there were no cells testing your function. I added them back in, but I have to take off a point for not having those. </font> 

### <font color='red'> 9/10 </font> 

# Exercise 1b: Primer checker (20 pts)

Let's now develop our full primer checker. For this, we will make reference to [ThermoFisher's guidlines for primer design](https://www.thermofisher.com/blog/behindthebench/pcr-primer-design-tips/). Specifically, your function should to make the following checks on a given DNA sequence:

- GC content to be between 40 and 60%. (Again you can use your function from last week.)
- Primer ends in a G or C
- Length between 18 and 30 bases
- Make sure the melting temperature is between 65$^\circ$C and 75$^\circ$C
- Avoid strings of 4 of the same nucleotide (i.e. 'AAAA', 'CCCC', etc.)

You'll want to check each of these rules in turn and `print` output to let the user know if they have violated any (or all!) of these design guidelines. If everything checks out, you can return the melting temperature so you can go about designing your PCR reaction!

Note there are *a lot* of checks to do here. To keep the `primer_checker()` function more manageable, you may want to write some other smaller functions that you can call in this main function. How you approach the code is up to you.  


In [None]:
def primer_checker(DNA_seq):
    empty_list = []
    GC_Count = 0
    total_seq = len(DNA_seq)
    final_ans = ''
    for letter in DNA_seq:
        if letter == 'G':
            GC_Count += 1
        if letter == 'C':
            GC_Count += 1
    GC = (GC_Count / total_seq) * 100
    if GC > 60 or GC < 40:
        final_ans += f'there is a GC Content of {GC:.3f} which is out of range'
    if DNA_seq[total_seq - 1] == 'G' or DNA_seq[total_seq - 1] == 'C':
        final_ans += '- the last base is G or C'
    if total_seq > 30 or total_seq < 18:
        final_ans += f'- the length is {total_seq} bases which is not between 18 and 30'
    if melting_temp(DNA_seq) < 65 or melting_temp(DNA_seq) > 75:
        final_ans += f'- the melting temperature is {melting_temp(DNA_seq)} which is not between 65 nad 75'
    if 'AAAA' in DNA_seq or 'TTTT' in DNA_seq or 'CCCC' in DNA_seq or 'GGGG' in DNA_seq:
        final_ans += '- there are at least four of the same bases in sequence'
    if final_ans == '':
        return print(f"This sequence works! The melting temperature is {melting_temp(DNA_seq):.2f}'C")
    else:
        return print(f'This sequence does not work because {final_ans}')



### <font color='green'> Great job also returning the GC content or whatever is asked so you know exactly what needs to change rather than just saying "Error".  </font> 

### <font color='red'> 20/20  </font>  

# Exercise 1c: Testing your primer checker (10)

For each of the five rules we put in place for good primer design, test out your function below to demonstrate that it responds appropriately for when the rule is followed and when the rule is broken. 

In [None]:
primer_checker('ATGCATGCATGCATGCATGCT')
primer_checker('AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG')

This sequence works! The melting temperature is 68.88'C
This sequence does not work because there is a GC Content of 1.316 which is out of range- the last base is G or C- the length is 76 bases which is not between 18 and 30- there are at least four of the same bases in sequence


### <font color='green'> Does not check for melting point outside of range. </font> 

### <font color='red'> 8/10  </font> 

# Exercise 2: Mutate sequence (20 points)

Last time we wrote code to mutate a single nucleotide to another nucleotide. Now we will expand the functionality to mutagenize an entire DNA sequence at a specified mutation rate. For example, if we have a DNA sequence of length ten and mutagenize at rate 0.1, we would expect *on average* one base pair to get mutated. You may want to reference [Python's documentation about the random module](https://docs.python.org/3/library/random.html) for functions that may help you here.

In [None]:
def mutate_seq(DNA_seq, mut_rate):
  """Returns a mutated DNA sequence at the mutation rate provided"""
  x = list(DNA_seq)
 # this lets the loop know how many times to loop
  for i in range(len(DNA_seq)):
# this tells it when to mutate
    rand_var = random.random()
# this is how the base pairs actually mutate
    if (rand_var < mut_rate) and (DNA_seq[i]== "G"):
      
      mutate = random.choice(["A", "T", "C"])
      x[i] = mutate
    if (rand_var < mut_rate) and (DNA_seq[i] == "A"):
      
      mutate = random.choice(["G", "T", "C"])
      x[i] = mutate
    if (rand_var < mut_rate) and (DNA_seq[i] == "T"):
      
      mutate = random.choice(["G", "A", "C"])
      x[i] = mutate
    if (rand_var < mut_rate) and (DNA_seq[i] == "C"):
      
      mutate = random.choice(["G", "T", "A"])
      x[i] = mutate
    i += 1
    print
      
  return x

Make sure to try your code out on a few sequence to see that they do in fact mutate!

In [None]:
mutate_seq("ATGCTGGACGTA",.5)

['A', 'T', 'G', 'G', 'T', 'G', 'T', 'C', 'C', 'C', 'G', 'A']

In [None]:
mutate_seq('TGCATAGCGTATG', .1)

['T', 'T', 'C', 'G', 'T', 'A', 'G', 'C', 'G', 'T', 'T', 'T', 'G']

In [None]:
for i in range(10):
  print(mutate_seq("AAAAAAAAAA", 0.1))

['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A']
['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A']
['A', 'A', 'A', 'A', 'A', 'A', 'C', 'A', 'A', 'A']
['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A']
['A', 'A', 'T', 'A', 'A', 'A', 'A', 'A', 'A', 'A']
['A', 'T', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A']
['A', 'A', 'T', 'A', 'A', 'A', 'T', 'A', 'A', 'A']
['A', 'A', 'A', 'T', 'A', 'A', 'A', 'A', 'A', 'A']
['A', 'A', 'C', 'A', 'A', 'A', 'A', 'A', 'A', 'A']
['A', 'A', 'C', 'A', 'G', 'A', 'A', 'A', 'A', 'A']


### <font color='green'>Truly random, rather than just doing mut_rate*length every time you run it, but returning it in a string is much more functional and easier to take and use somewhere else if necessary in this case. </font> 

### <font color='red'> 19/20  </font> 

# Exercise 3: ORF finder (30 pts)

Again building off the code we wrote last time, we will now write an Open Reading Frame (ORF) finder. This means rather than just blinding translating from the start of the provided DNA sequence, we will want to be mindful of start codons and various reading frames. Your function should return a `list` of possible ORFs translated into amino acid sequences, considering all possible start codons in all possible reading frames. 

As an example, your code may output something like:
`["MRPSLRAMHGAVRT*","MHGAVRT*","MAETMLP","MLP"]`, where some ORFs can be a subset of another ORFs, and some ORFs don't have a terminating stop codon because they've reached the end of the provided sequence.

Again, we need to define the dictionary for corresponding codons to amino acids. 

In [None]:
aa_dict = \
{'TTT': 'F', 'TTC': 'F', 'TTA': 'L', 'TTG': 'L', 'TCT': 'S', 'TCC': 'S', 
 'TCA': 'S', 'TCG': 'S', 'TAT': 'Y', 'TAC': 'Y', 'TAA': '*', 'TAG': '*', 
 'TGT': 'C', 'TGC': 'C', 'TGA': '*', 'TGG': 'W', 'CTT': 'L', 'CTC': 'L', 
 'CTA': 'L', 'CTG': 'L', 'CCT': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P', 
 'CAT': 'H', 'CAC': 'H', 'CAA': 'Q', 'CAG': 'Q', 'CGT': 'R', 'CGC': 'R', 
 'CGA': 'R', 'CGG': 'R', 'ATT': 'I', 'ATC': 'I', 'ATA': 'I', 'ATG': 'M', 
 'ACT': 'T', 'ACC': 'T', 'ACA': 'T', 'ACG': 'T', 'AAT': 'N', 'AAC': 'N', 
 'AAA': 'K', 'AAG': 'K', 'AGT': 'S', 'AGC': 'S', 'AGA': 'R', 'AGG': 'R', 
 'GTT': 'V', 'GTC': 'V', 'GTA': 'V', 'GTG': 'V', 'GCT': 'A', 'GCC': 'A', 
 'GCA': 'A', 'GCG': 'A', 'GAT': 'D', 'GAC': 'D', 'GAA': 'E', 'GAG': 'E', 
 'GGT': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G'}

And now you can write your function.

In [None]:
 # YOUR CODE HERE
def translate(DNA_seq):
  string = DNA_seq
  n =  3
  output_list = []
  split_string = [string[i:i+n] for i in range(0, len(string), n)]

  for x in split_string:
    if x in aa_dict:
      if aa_dict[x] == '*':
        output_list.append('*')
        break
      else:
        output_list.append(aa_dict[x])
  final = "".join(output_list)
  return final

In [None]:
def ORF_finder(DNA_seq):
  """Returns a list of all translated ORFs in the provided
  DNA sequence"""
  
  ORF = []
  
  string = DNA_seq
  #Find out where ATG is
  while 'ATG' in string:
    start = string.find('ATG')
    start_codons = string.count('ATG')
    new_DNA = string[start:]

    x =  translate(new_DNA)
    string = string[start+1:]

    ORF.append(x)
  
  return ORF

In [None]:
ORF_finder('ATGTTTGGTAATGGATAAGTC')

['MFGNG*', 'MDK']

In [None]:
ORF_finder('TTTATGGGTATGAGATGATAAGTC')

['MGMR*', 'MR*', 'MIS']

In [None]:
ORF_finder("ATGCAGCATCAGATGCATCGACAATGCGACGACAGTCAGCATAGACGCA")

['MQHQMHRQCDDSQHRR', 'MHRQCDDSQHRR', 'MRRQSA*']

### <font color='green'> Well done using a separate function and just calling it in your new one. I've seen lots of unnecessary errors arise when people try to put a function in a function, so keep that up.  </font> 

### <font color='red'> 30/30  </font> 

# Overall: 96/100 Great job!

Test out your function on a few inputs to make sure it works as expected. A few things you may want to check for:
- Does your function find multiple ORFs?
- Does your function find overlapping ORFs?
- Does your function find ORFs in different reading frames?



# How long did this take? 

With a new course and new assignments, I want to be conscientious with how much time this course takes. Please let me know how long this took, so I can adjust future homeworks if needed.

# References

If you referenced any external sources for completing this homework, please list them below. (Just the links are fine.)

# Submitting your homework

Please make sure to state what each group member contribute and have each group member "sign off" that they agree they are satisfied with the final submission of this homework.

You will submit this homework (and all subsequent homeworks via GitHub). Unless you have an approved extension or opt to submit the homework late (with a 10% deduction per day), your homework will be graded based on what is submitted on GitHub at the time of the deadline. So don't forget to push! 

Group Member Contributions:
Shawn completed problem 1 and helped with pseudocode.
Jody completed problem 2 and Maggie helped her debug it.
Maggie completed problem 3.

References: We went to office hours.

Time that this took:
Question 1 took about 40 mins. Question 2 took around 2 hours mainly due to debugging. Question 3 took around 2 hours also due to debugging.
Signed off by:
Maggie Dixon
Shawn Muhr
Jody Romero