# Analyzing Biological Sequences using For Loops and If statements

This section analyzes how to analyze biological sequences using lists and for loops in python.

#### Prerequisites for this section
* Have Anaconda python installed
* Be familiar with how to run python
* Be familiar with the Central Dogma of Molecular Biology, and how to represent DNA,RNA, and Protein sequences using python strings.

* Be familiar with how to use python like a calculator including:
   * assigning variables
   * order of operations
   * multiplication and division
   * basic string operations
   
#### In this section you will learn
* Use the string count method to measure the composition of a biological sequence
* Calculate the GC content of a DNA sequence
* Use 'for loops' to simplify repetitive code
* The principle of DRY (do not repeat yourself) coding
* How to use 'if statements' to let your code handle different conditions


## Using python strings to represent biological sequences

## Using for loops and dictionaries to analyze biological sequences.

#### Application: Calculate GC content

The [GC content](https://en.wikipedia.org/wiki/GC-content#Applications) of DNA or RNA sequences is defined as the percentage of total nucleotides that are G or C out of the total count of all nucleotides in the sequence. Let's use what we know so far to calculate the GC content of an example DNA sequence. 

**Goal**: given a DNA sequence represented as a string, calculate the percentage GC content of that sequence.

**Approach**: we will use the **count** method of python string objects to count up how many 'A','G','C',and 'T' strings occur in our sequence and then do some arithmatic to convert those counts into GC content.

In [42]:
DNA_sequence = """CTCAGTCAGGCGCTCAGCTCCGTTTCGGTTTCACTTCCGGTGGAGGGCCGCCTCTGAGCGGGCGGCGGGCCGACGGCGAGCGCGGGCGGCGGCGGTGACGGAGGCGCCGCTGCCAGGGGGCGTGCGGCAGCGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGAGGCGGCGGCGGCGGCGGCGGCGGCGGCGGCTGGGCCTCGAGCGCCCGCAGCCCACCTCTCGGGGGCGGGCTCCCGGCGCTAGCAGGGCTGAAGAGAAGATGGAGGAGCTGGTGGTGGAAGTGCGGGGCTCCAATGGCGCTTTCTACAAGGTACTTGGCTCTAGGGCAGGCCCCATCTTCGCCCTTCC"""

total_nucleotides = len(DNA_sequence)

g_count = DNA_sequence.count("G")
c_count = DNA_sequence.count("C")

gc_content = ((g_count + c_count)/total_nucleotides)*100
print(f"The GC content of our sequence is: {gc_content}")


The GC content of our sequence is: 76.28571428571429


So we've defined a new python string called DNA sequence, and pasted into it the sequence of a human gene - in this case the *FMR1* gene on the X-chromosome. We've used len to count up the total nucleotides in that function, and repeated applications of the string count method to count up the frequency of each nucleotide. Then we do a little division to calculate the fraction of Gs and Cs, and multiply by 100 to make it a percentage.



#### Application: Calculate Amino Acid Composition

Now let's do a similar task but for the amino acid composition of the protein produced by the FMR1 gene. There is no direct analog of GC content for amino acids, so we can instead calculate the percentage of each amino acid in the sequence. 

In [43]:
protein_sequence ="""MEELVVEVRGSNGAFYKAFVKDVHEDSITVAFENNWQPDRQIPFHDVRFPPPVGYNKDINESDEVEVYSRANEKEPCCWWLAKVRMIKGEFYVIEYAACDATYNEIVTIERLRSVNPNKPATKDTFHKIKLDVPEDLRQMCAKEAAHKDFKKAVGAFSVTYDPENYQLVILSINEVTSKRAHMLIDMHFRSLRTKLSLIMRNEEASKQLESSRQLASRFHEQFIVREDLMGLAIGTHGANIQQARKVPGVTAIDLDEDTCTFHIYGEDQDAVKKARSFLEFAEDVIQVPRNLVGKVIGKNGKLIQEIVDKSGVVRVRIEAENEKNVPQEEGMVPFVFVGTKDSIANATVLLDYHLNYLKEVDQLRLERLQIDEQLRQIGASSRPPPNRTDKEKSYVTDDGQGMGRGSRPYRNRGHGRRGPGYTSGTNSEASNASETESDHRDELSDWSLAPTEEERESFLRRGDGRRRGGGGRGQGGRGRGGGFKGNDDHSRTDNRPRNPREAKGRTTDGSLQIPPVKVVGCARVKIVTRRKRSQTAWMVSNHS"""
print(protein_sequence)

#Let's check which amino acids are in the sequence
all_amino_acids = set(protein_sequence)
print("The following amino acids are in our sequence:",all_amino_acids)
print("Total unique amino acids:",len(all_amino_acids))

alanine_count = protein_sequence.count("A")
print("% Alanine:",alanine_count/len(protein_sequence))
arginine_count = protein_sequence.count("R")
print("% Arginine:",arginine_count/len(protein_sequence))
asparagine_count = protein_sequence.count("N")
print("% Asparagine:",asparagine_count/len(protein_sequence))
aspartic_acid_count = protein_sequence.count("D")
print("% Aspartic acid:",aspartic_acid_count/len(protein_sequence))

#It's getting pretty boring to type these!

MEELVVEVRGSNGAFYKAFVKDVHEDSITVAFENNWQPDRQIPFHDVRFPPPVGYNKDINESDEVEVYSRANEKEPCCWWLAKVRMIKGEFYVIEYAACDATYNEIVTIERLRSVNPNKPATKDTFHKIKLDVPEDLRQMCAKEAAHKDFKKAVGAFSVTYDPENYQLVILSINEVTSKRAHMLIDMHFRSLRTKLSLIMRNEEASKQLESSRQLASRFHEQFIVREDLMGLAIGTHGANIQQARKVPGVTAIDLDEDTCTFHIYGEDQDAVKKARSFLEFAEDVIQVPRNLVGKVIGKNGKLIQEIVDKSGVVRVRIEAENEKNVPQEEGMVPFVFVGTKDSIANATVLLDYHLNYLKEVDQLRLERLQIDEQLRQIGASSRPPPNRTDKEKSYVTDDGQGMGRGSRPYRNRGHGRRGPGYTSGTNSEASNASETESDHRDELSDWSLAPTEEERESFLRRGDGRRRGGGGRGQGGRGRGGGFKGNDDHSRTDNRPRNPREAKGRTTDGSLQIPPVKVVGCARVKIVTRRKRSQTAWMVSNHS
The following amino acids are in our sequence: {'D', 'N', 'G', 'W', 'P', 'E', 'M', 'S', 'F', 'H', 'R', 'Q', 'K', 'C', 'A', 'Y', 'V', 'T', 'I', 'L'}
Total unique amino acids: 20
% Alanine: 0.0625
% Arginine: 0.09191176470588236
% Asparagine: 0.04963235294117647
% Aspartic acid: 0.0661764705882353


The above approach 'works' in the sense that we could, with enough labor, get an answer to our question. However, if you try retyping the code above and extending it for all the amino acids, you will quickly notice several things:

* It is very boring to write similar code over and over again
* It is easy to make mistakes that could give the wrong answer. In particular, when you copy the print statement, it is easy to forget to update which count you are dividing by the total length of the sequence. 
* If you want to change the code, you will now have to go back through all 20 print statements and all 20 count commands and change each - without making any mistakes.
* It would be easy to accidentally leave off one of the amino acids.

There must be a better way to do this - after all most of us didn't get into bioinformatics because we enjoyed repetitive tasks. Our obsevations trying to calculate amino acid composition highlight a general principle of coding that you can use in many places in your work. That principle is known as DRY coding.

>#### Coding Style Sidebar: DRY coding
>
>The DRY in DRY coding stands for **Don't Repeat Yourself**. It is a shorthand way of saying that you should find a way to write your code that avoids repetitive statements. It also is a shorthand way of summing up many of our >observations above. If code is repetitive, it has many disadvantages:
>* It makes code look more complex (or at least much longer) than it should be
>* It makes code more prone to errors
>* It makes code harder to change and maintain
>
>So based on the principles of DRY coding, we now know that we should tell python once what we want it to do. >That is, somehow we want to tell python what it would be easy to tell a friend: for each of the amino acids in >our sequence, we want to get a count and convert that count to a percentage.
>

Python for loops can help us to automate this task and do it in just a few lines of code. They can also help with many other repetitive tasks in python.

-----------------------------------------------------------------------------------------------------------------
#### Python: For Loops

If we were leaving a detailed note telling a friend how to solve this problem, we might say something like:

>   For each amino acid in our amino acid sequence, I want you to:
>      * count that amino acid in the sequence
>      * calculate a percentage for that amino acid
>      * write down what you got.
>   
>   Thanks!
   
A for loop in python is actually not too dissimilar from the above note. I'll show how it looks and then we will consider the details of how it works.

In [44]:
amino_acids = set(protein_sequence)
sequence_length = len(protein_sequence)

for amino_acid in amino_acids:
    amino_acid_count = protein_sequence.count(amino_acid)
    amino_acid_percent = (amino_acid_count/sequence_length) * 100
    print(amino_acid,":",round(amino_acid_percent,1))
    
print("Done!")

D : 6.6
N : 5.0
G : 7.9
W : 0.9
P : 4.4
E : 8.5
M : 1.8
S : 6.2
F : 3.5
H : 2.6
R : 9.2
Q : 3.9
K : 6.2
C : 1.1
A : 6.2
Y : 2.6
V : 8.1
T : 4.6
I : 5.0
L : 5.7
Done!


This looks much better. We now calculate amino acid percentages for all the amino acids in only 7 lines of code (including the blank line before the for loop to make the code easier to read). Moreover, we didn't have to type out 40 lines of very repetitive calculations.

So how does this for loop work? Here's what's happening when python executes the loop:

1. Get the first item of the iterable object (here the amino acid 'D' from amino_acids).
2. Assign that item to a variable (in our case set amino_acid = 'D' for this time through the loop) 
3. Run all the *indented* code after the line defining the for loop (this is all considered part of the for loop)
4. Get the next item from the iterable object (here the amino acid 'N' from amino acids), and repeat from step 1.
5. When you run out of items in your iterable object, stop and continue with the rest of the unindented code after the for loop (here, print("Done!") to the screen).

The first line of the loop is 'for amino_acid in amino_acids:'. In the context of our code, amino_acids is the set of all the amino acid letters in our sequence. For loops can loop over any *iterable* object. That means basically any python object that holds other objects, including strings, lists, sets, dictionaries, etc. Python will take our amino_acids object, and pick out the first item in it. In this case that's a D. It will then give that item the name amino_acid (effectively setting amino_acid = D). Finally it will execute all the code that is indented after the for loop for that amino acid. Then it will start over at the beginning, set amino_acid = N, and repeat the process until it runs out of amino acids.

If you want to test out how for loops work for yourself, you can modify the above code to add print statements that report the values of variables throughout the process. For example, right after the first line of the for loop, you might add the code: print("Current amino acid = ",amino_acid). This would then report to you what the value of the variable amino_acid is each time through the loop.

> **Terminology Sidenote: Iterations**
> When programmers talk about code repeating a task, they will often call each time the task repeats an *iteration*. So we might say,'for each *iteration* of the for loop, the code does X,Y, and Z'. 
> If it is important to keep track of how many times you've repeated a task with a number, an integer named
> i is often used to do so. In that case i stands for 'iteration'.


#### Storing data from a loop in a dictionary

Our version of the amino acid profiler using a for loop is much improved, but still isn't wholly satisfactory in one important respect: the data that we generate is printed to screen, but isn't stored in an accessible way.
Dictionaries are python objects that associate keys with values - just like a physical dictionary associates words with their definitions. In our amino acid composition profiler, we could use a dictionary to store the counts of each amino acid. In that case strings representing the amino acids would be the keys, and the counts of those amino acids would be the values.

> For a very brief introduction to dictionaries in python see the [Tour of Python Data Types](../04_exploring_python/exploring_python_data_types.ipynb)

The code below revises our amino acid profiler to use a dictionary to hold results. Therefore, it is no longer necessary to print each amino acid as it is calculated. I've added some extra comments to the code that would not normally be necessary highlighting what each new line does:

In [67]:
amino_acids = set(protein_sequence)
sequence_length = len(protein_sequence)

#create an empty dictionary
#to hold our results
amino_acid_profile = {} 

for amino_acid in amino_acids:
    amino_acid_count = protein_sequence.count(amino_acid)
    amino_acid_percent = (amino_acid_count/sequence_length) * 100
    
    #add a new entry to our amino_acid_profile 
    #using the amino acid (e.g. "D") as the key, 
    #and it's count as the value (e.g. 35)
    #the general syntax to set a dictionary key equal
    #to a value is my_dictionary[key] = value
    amino_acid_profile[amino_acid] = amino_acid_count

print("Counts for each amino acid:",amino_acid_profile)

Counts for each amino acid: {'D': 36, 'N': 27, 'G': 43, 'W': 5, 'P': 24, 'E': 46, 'M': 10, 'S': 34, 'F': 19, 'H': 14, 'R': 50, 'Q': 21, 'K': 34, 'C': 6, 'A': 34, 'Y': 14, 'V': 44, 'T': 25, 'I': 27, 'L': 31}


-----------------------------------------------------------------------------------------------------------------
**Exercise: use the approaches outlined above to write code to calculate the frequency of each nucleotide in an RNA sequence.**
Keep these things in mind:
* Be sure that the code can be easily run on new sequences. 
* Use DRY coding methods and a for loop to avoid lots of repeated code
* Be sure to check your code using a sequence where you know the right answer. For example, on the sequence:
  "AAUUGGGG", your code should return frequencies A: 25%, U:25%, G:50%, C:0%.

-----------------------------------------------------------------------------------------------------------------
Once you have jotted down your answers, you can check them in the [answer key](./biological_sequences_exercise_answers.ipynb)

### If statements in python

Programs often need to make decisions based on data. For example, when comparing two sequences, we might want to do one thing if they are very similar in sequence, and something else if they are different. If statements are a key way that python scripts can make decisions based on conditions. 

Here is some code that uses an if statement:


In [4]:
from random import choice
print("Will you win the coin flip?")

coin_flip = choice(["Heads","Tails"])

if coin_flip == "Heads":
    print("You win!")
else:
    print("You lose!")

Will you win the coin flip?
You lose!


Here are some key features of the if statement:
* it starts with the keyword 'if' 
* after the 'if' we place any expression that evaluates to True or False. Such expressions include:
* we end the first line of the if statement with a colon
* the code that is indented after the if statement (here <pre>print("You win!")</pre>) will execute
  only if the condition evaluates to True
* if the condition does not evaluate to True, the code indented below the else statement will execute instead. You can think about an if/else pair like this as a fork or decision point in a flowchart.

We can design new if statements by using different expressions that evaluate to True or False in python.
 -  tests for equality **(x == y)**. Note that two equals signs must be used to distinguish this from when you *set* x equal to y. If you use just one (x = y) in an if statement python will send you a SyntaxError as a reminder :).
   -  inequalities, include less than **(x < y)**, less than or equal to **(x <= y)**, greater than **(x > y)**, greater than or equal to **(x >= y)**, or not equal **( x != y)**.
   - tests for whether a sequence includes a specific item **(x in y)**
   - tests for whether two variables refer to exactly the same thing (not just equal things), for example **x is y**.

Interestingly, in python many normal objects also have truth values (this is called truthiness). We can also use these in if statments. Here are some examples:
   - 0 is treated as False, but any non-zero integer is treated as True
   - an empty list,dict,set, or string is treated as false, but one with contents is treated as True.
   
You can figure out the truthiness of an object by calling the bool() function on it.

#### Application:  Using a for loop and if statement together to compare sequence identity

A common way to measure the similarity of two sequences in by comparing **sequence identity**. Sequence identity is defined as the percentage of positions in two sequences in which they have exactly the same letter. A related idea of **sequence similarity** counts up the percentage of positions in a sequence that have 'similar' units. What 'similar' means depends on the application, but generally refers to the biochemical properties of the nucleotides or amino acids (e.g. two hydrophobic amino acids might count as 'similar' even if not identical).

Let's write code to calculate the sequence identity of two sequences:

In [8]:
#Find the number of mismatches between two sequences
seq1 = "ATATCGCGGCGCGCTTACGATGCTACGTCGCGGCGGGGTATATTAGCGGGATTCA"
seq2 = "ATATCGCGGCGCCCTTACGATGCTACGTCGCGGCGCGGTATATTAGCGGGATACA"
length_of_shorter_seq = min(len(seq1),len(seq2))

shared = 0
different = 0
for i in range(length_of_shorter_seq):
    if seq1[i] == seq2[i]:
        shared += 1
    else:
        different +=1

sequence_identity = shared/length_of_shorter_seq
print(f"There were {shared} shared nucleotides and {different} different nucleotides among the sequences")
print(f"Sequence identity: {round(sequence_identity,2)}")       

There were 52 shared nucleotides and 3 different nucleotides among the sequences
Sequence identity: 0.95


Some notes about the above code:
* the range() function generates all the numbers between 0 and a specified number one by one. It is very commonly used in python to run a for loop a given number of times. Here we use it to run the for loop over each nucleotide in the sequence.
* if using an if statement inside a for loop, the code inside the if statment is indented an *additional* level, as shown above.
* notice that *before* we start the loop, we have already prepared two variables, shared and different, to hold the counts of shared and different nucleotides we generate inside the loop. It is very common to define some variable to hold your data before you start a loop. Notice that if these variables were instead defined *inside* the loop, they would reset the count every time the loop ran, resulting in the wrong answer.