# Week 2 - Some tools before building an RNA Translator


Over the next couple of sessions, we're going to program an RNA translator.  
Here are the four key steps we need to complete:  
- Step 1: Load the codon table into a list of (codon,amino-acid) pairs.  
- Step 2: Split an RNA sequence into a list of its codons.
- Step 3: Given a single codon, loop over the codon table to find the amino-acid it codes for.  
- Step 4: Given a list of codons, create a list of the amino-acid each codon codes for.

Today we introduce most of the tools needed for this then work on step 1. The main tools we'll cover are:
- Indexing and extracting part of a string
- Opening, reading, and manipulating a text file
- Lists
- Loops

### Indexing

You can access any letter of a string by specifying its position in the string (its "index") between square brackets. The first letter has index 0. For example:

In [32]:
word = 'Honors College'
print(word[0])

H


Here are the indices of all the letters in 'Honors College':

    H   o   n   o   r   s       C   o   l   l   e   g   e    #String
    0   1   2   3   4   5   6   7   8   9   10  11  12  13   #Positive indexing. Starts at 0.

If you have a long string and you need to select an item towards the end, you can count backwards from the end of the string, starting at the index number -1.

     H   o   n   o   r   s       C   o   l   l   e   g   e   #String
    -14 -13 -12 -11 -10 -9  -8  -7  -6  -5  -4  -3  -2  -1   #Negative indexing. Starts at -1.

By referencing index numbers, we can isolate one of the characters in a string. We do this by putting the index numbers in square brackets. 

In [36]:
# Positive and negative indices are two equivalent ways of obtaining the same information.
# Use whichever one is easiest in your situation.
# Here are the two ways to get the 's' in 'Honors College':
word = 'Honors College' 
print(word[5])
print(word[-9])


s
s


### Extracting part of a string (aka "slicing")

What do we do if we have a long string, but we only want a short portion of it? This is known as taking a substring, and it has its own notation in Python. To get a substring, we follow the variable name with a pair of square brackets which enclose a start and stop position, separated by a colon. Again, this is probably easier to visualize with a couple of examples – let's reuse a section of our DNA sequence from your task in week and only stract the microsatellite (

In [5]:
dna = "ACTGATTGG"

# print positions two to five
# Five is not included; we get positions 2,3,4.
print(dna[2:5])

# positions start at zero, not one
print(dna[0:6])

# if we miss out the last number, it goes to the end of the string
print(dna[2:])

# If we omit the first number, the slice starts at the beginning of the string (position 0).
print(dna[:5])

# If we omit the second number, the slice end at the end of the string.
print(dna[5:])

# We can also slice with negative numbers. This is the last two characters of the string.
print(dna[-2:])

# A sliced string is still a string.
# We can do with it anything we could do with the original string.
# For example, concatenate it with another string:
print(dna[:5]+dna[5:])
# This is positions 0 to 5 (not included), then 5 to the end.
# In other words, it's the entire string.

TGA
ACTGAT
TGATTGG
ACTGA
TTGG
GG
ACTGATTGG


To recap, `dna[i1:i2]` lets you access the part of the string between indices i1 and i2. i1 and i2 can be positive or negative; if they're negative, they're counted from the end. If you omit i1, it's assumed you want to start at the beginning of the string. If you omit i2, it's assumed you want to end at the end of the string.

There's one more thing we can do with this tool: instead of every character between index i1 and index i2, we can request every n-th character (e.g., every other character, every third character, etc) with `dna[i1:i2:n]`.

In [6]:
# Here is every other character starting at index 1 and ending just before index 5.
print(dna[1:5:2])

# Here is every third character starting at the beginning and ending at the end.
print(dna[::2])

# The number after the two colons is called the "increment". It tells python how many 
# indices to move forward to get the next character.

# If the increment is negative, the string is read backwards. 
# Here is the entire string backwards.
print(dna[::-1])

CG
ATATG
GGTTAGTCA


It takes a little bit of practice to be fully comfortable with slicing. A good way to practice is to make up a slicing operation, say `dna[3::2]`, `dna[1:4:-1]`, but instead of just executing it, first try to predict what the outcome will be. Then execute it and compare the result with your prediction. If there's a discrepancy, take the time to understand where it comes from. Rinse and repeat until you feel comfortable with slicing.

<div class="alert alert-block alert-danger">
<b>Task 1:</b>
Extract and print the first codon from the variable "dna", then the second one, then the third.
</div>

### Opening, reading, and manipulating a text file
modified from https://pythonforbiologists.com

The data that we as biologists work with is usually stored in files, so we need a way to get the data out of files and into our programs (and vice versa). Usually real life data will be much longer than what we have in the examples.

In programming, when we talk about text files, we are not necessarily talking about something that is human readable. Rather, we are talking about a file that contains characters and lines – something that you could open and view in a text editor, regardless of whether you could actually make sense of the file or not. Examples of text files which you might encounter include:

- FASTA files of DNA or protein sequences
- files containing output from command-line programs (e.g. BLAST)
- FASTQ files containing DNA sequencing reads
- HTML files

and of course, Python code itself.

In contrast, most files that you encounter day-to-day will be binary files – ones which are not made up of characters and lines, but of bytes. Examples include:

- image files (JPEGs and PNGs)
- audio files
- video files
- compressed files (e.g. ZIP files) 

#### Using open to read a file

In Python, as in the physical world, we have to open a file before we can read what's inside it. The Python function that carries out the job of opening a file is very sensibly called open(). It takes one argument – a string which contains the name of the file – and returns a file object:

In [7]:
my_file = open('dna_sequence.txt')

A file object is a little more complicated than the string and number types that we have seen. With strings it was easy to understand what they represented – a single bit of text, or a single number. A file object, in contrast, represents something a bit less tangible – it represents a file on your computer's hard drive.

The way that we use file objects is a bit different to strings and numbers as well. From week 1 class you'll see that most of the time when we want to use a variable containing a string or number we just use the variable name:

In [8]:
#Week1 example with a string
my_dna = "ACTGATCGATTACTTTTTTTTTTTGCTATCATACATATATATCGATGCGTTCAT"
length = len(my_dna)
print(length)

54


But if we try to view our file object in the same way we just get Python's representation of it, and not the content of the file:

In [9]:
my_file

<_io.TextIOWrapper name='dna_sequence.txt' mode='r' encoding='UTF-8'>

This answer is not useful. When we're working with file objects most of our interaction will be through methods. This style of programming will seem unusual at first, but as we'll see in this lesson, the file type has a well thought-out set of methods which let us do lots of useful things.

The first thing we need to be able to do is to read the contents of the file. The file type has a read() method which does this. It doesn't take any arguments, and the return value is a string, which  can be stored in a variable. Once we've read the file contents into a variable, we can treat them just like any other string – for example, we can print them:

#### Read the contents of a file.

Once the file is open you can read its entire content with `read()`.

In [10]:
my_file = open('dna_sequence.txt')
file_contents = my_file.read()
print(file_contents)

ATGTCTTTAATGTCTTTAATGTCTTTAATGTCTTTAAAAAAAA



This is how we get information out of a file and into our Python program. The text that ends up in the file_contents variable is exactly the same as the text in the file. Try editing the dna.txt file in a text editor (jupyter has a built in text editor that you can use if you like) and check that when you re-run the bit of code above you see the new text.

In [11]:
#Just to compare and remember:

my_sequence_name = "my_dna.txt" # this is a string, and it stores the name of a file on disk.
my_file = open("dna_sequence.txt") # this is a file object, and it represents the file itself.
my_file_contents = my_file.read() # this is a string, and it stores the text that is in the file.


In [12]:
my_sequence_name

'my_dna.txt'

In [13]:
my_file

<_io.TextIOWrapper name='dna_sequence.txt' mode='r' encoding='UTF-8'>

In [14]:
my_file_contents

'ATGTCTTTAATGTCTTTAATGTCTTTAATGTCTTTAAAAAAAA\n'

The `\n` at the end is a *newline character*. It indicates that the line is over and any additional text will belong to the next line. When we use `print`, the `\n` is not shown as `\n` but as an actual new line. Since there's nothing on this new line, it's blank.

In [15]:
print(my_file_contents)

ATGTCTTTAATGTCTTTAATGTCTTTAATGTCTTTAAAAAAAA



`\n` is a so-called *special character*. It's rendered differently from regular characters (e.g., `a`), but for the purpose of slicing it's just a character. In particular, we can get rid of it by grabbing every character until the last one not included.

In [16]:
# This is the dna sequence without the newline character at the end.
my_file_contents[:-1]

'ATGTCTTTAATGTCTTTAATGTCTTTAATGTCTTTAAAAAAAA'

In [17]:
# If we use print, we get the same sequence but no extra blank line below.
print(my_file_contents[:-1])

ATGTCTTTAATGTCTTTAATGTCTTTAATGTCTTTAAAAAAAA


This will be important in a moment because if we load a dna sequence with a `\n` at the end and use `len` to get its length, what we really get is the number of characters, i.e., the actual length of the dna sequence plus 1 for the `\n` character.

In [18]:
print(my_file_contents)
print(len(my_file_contents))

print(my_file_contents[:-1])
print(len(my_file_contents[:-1]))

ATGTCTTTAATGTCTTTAATGTCTTTAATGTCTTTAAAAAAAA

44
ATGTCTTTAATGTCTTTAATGTCTTTAATGTCTTTAAAAAAAA
43


### Using information within a file 

<div class="alert alert-block alert-danger">
<b>Task 2:</b>
Compute the GC content of the dna sequence in `dna_sequence.txt`. You're not allowed to copy and paste the content of the file; you have to have python open and read the file as explained above.  
*Hint:* Once you've read the file you can use last week's task to compute the GC content.
</div>

### Lists

 A very common situation in biological research is to have a large collection of data that all need to be processed in the same way. In this case, you may need to build other tools.
 The first tool we will introduce are lists.
 
 To make a new list, we put several strings or numbers inside square brackets, separated by commas:

In [None]:
# Here is a list of strings.
Animals = ["Human", "Dog", "Fish", "Coral"] 
# Here is a list of numbers (made up GC content values, not those of the animals above).
GC_content = [90, 98, 88, 70]

Each individual item in a list is called an element. We can add an element at the end of a list with `append`:

We can define an empty list and fill it later:

In [24]:
# Here is an empty list.
Animals = []
Animals.append("Human")
Animals.append("Dog")
Animals.append("Fish")
Animals.append("Coral")
print(Animals)

['Human', 'Dog', 'Fish', 'Coral']


We can add two lists. The result is a single list made up of the elements of the first list followed by those of the second list.

In [None]:
Animals1 = ["Human", "Dog"] 
Animals2 = ["Fish", "Coral"] 
print(Animals1 + Animals2)

To get a single element from the list, you can write the variable name followed by the index of the element you want in square brackets. It's the same syntax you would use to access a singel character in a string:

In [55]:
Animals = ["Human", "Dog", "Fish", "Coral"] 
GC_content= [90, 98, 88, 70]

print(Animals[0])
last_content = GC_content[3]
print(last_content)

Human
70


Remember that in Python we start counting from zero rather than one, so the first element of a list is always at index zero. If we give a negative number, Python starts counting from the end of the list rather than the beginning – so it's easy to get the last element from a list:

In [56]:
last_animal = Animals[-1]
print(last_animal)

Coral


What if we want to get more than one element from a list? We can give a start and stop position, separated by a colon, to specify a range of elements. Similar as we did before when we were startcing a part of a string.

In [58]:
Animals = ["Human", "Dog", "Fish", "Coral"] 
Vertebrates = Animals[0:3] #Cut in index position 3- you cut Coral 
print(Vertebrates)
print(Animals[::2])

['Human', 'Dog', 'Fish']
['Human', 'Fish']


Again, accessing the elements of a list works very much like accessing the characters in a string.

### Slicing a list

You can extract part of a list the same way you'd extract part of a string.

Say you have a list L, then `L[i1:i2:n]` gives you every n-th element of the list between indices i1 and i2, not including i2.

In [None]:
Animals = ["Human", "Dog", "Fish", "Coral"] 

# Cut in index position 3 -- you cut "Coral".
Vertebrates = Animals[0:3]
print(Vertebrates)

# Here is a list of a few national parks.
A = ["Yellowstone", "Yosemite", "Grand Canyon", "Everglades"]

# Here we start at index 1 (=second element) and stop right before index 2.
# In other words, only keep index 1 ("Yosemite").
print(A[1:2])
# Note that the result is not just "Yosemite" though. The brackets are still there.
# It's a list with a single element, and that element is "Yosemite"

# Here we start at the beginning (index 0) and stop right before index -2 ("Grand Canyon").
# In other words, we keep the first two parks.
print(A[:-2])

Overall, string and lists are very similar. In many ways a string is a list of characters. Conversely, you can think of a list as a generalization of strings wherein the elementary building block, instead of being a character, can be just about anything. `Animals` above is a list of strings like `"Human"`. `GC_content` above is a list of integers like `90`. You can even have a list of lists:

In [28]:
list_of_lists = [ [1,2], [3,4] ]
print(list_of_lists)

# The first element of list_of_lists is the "inner list" or "sublist" [1,2].
print(list_of_lists[0])
# To acccess the 3, we request the first element of the second sublist of list_of_lists.
print(list_of_lists[1][0])

[[1, 2], [3, 4]]
[1, 2]
3


In [30]:
# We can also build a list of lists with append. 
# To append a new sublist, first build the sublist, then append it.
# Let's start with a single sublist.
list_of_lists = [ [1,2] ]
# Now create a second sublist.
new_sublist = [3,4]
# Now append the new sublist to the list of sublists.
list_of_lists.append(new_sublist)
# We can the same list as in the previous cell.
print(list_of_lists)

# We don't really need to create a variable for the new sublist.
# We can just create it on the fly.
list_of_lists = [ [1,2] ]
list_of_lists.append([3,4])
print(list_of_lists)
# The result is the same.

[[1, 2], [3, 4]]
[[1, 2], [3, 4]]


### Loops


In [None]:
# Imagine we wanted to take our list of animals and and print out each element on a separate line, like this:
Animals = ["Human", "Dog", "Fish", "Coral"]
    
# Human is an animal
# Dog is an animal
# Fish is an animal
# Coral is an animal

# One way to do it would be to just print each element separately:
print(Animals[0] + " is an animal") 
print(Animals[1] + " is an animal") 
print(Animals[2] + " is an animal") 
print(Animals[3] + " is an animal")

This is not a great solution though. What if we didn't know in advance how many animals are in the list? What if there were a thousand animals in the list? We can't realistically write a thousand lines like this.

What we need is a way to say something along the lines of:  
For each element in the list of Animals, print out the element, followed by the words "is an animal".

This is exactly what loops are for. Here is the syntax.

In [71]:
Animals = ["Human", "Dog", "Fish"]
Animals.append("Coral")
print(Animals)

for animal in Animals:
    print(animal + " is an animal")

['Human', 'Dog', 'Fish', 'Coral']
Human is an animal
Dog is an animal
Fish is an animal
Coral is an animal


The print statement is executed as many times as there are elements in the list Animals. Each time the variable `animal` takes on the value of a new element of `Animals`, until all elements have been used. This is a bit different from the variable behaviors we've seen before. Usually a variable is defined explicitely (e.g., `animal="Human"`), after which it keeps the same value until we change it explicitely (e.g., `animal="Dog"`). Here the value of `animal` changes every time the loop is reread (every "iteration" of the loop).

Pay attention to the syntax of the loop. The `:` at the end of the first line is not optional.
Neither are four spaces before the second line. More on that in a moment.

Here is the computer's "tought process" as it reads the loop:  
<pre>
    animal = Animals[0]                                          \ First iteration
    print(animal + " is an animal")                              / of the loop.
    Was that the last element? No. Move on to the next element.   
    animal = Animals[1]                                          \ Second iteration
    print(animal + " is an animal")                              / of the loop.
    Was that the last element? No. Move on to the next element.   
    animal = Animals[2]                                          \ Third iteration
    print(animal + " is an animal")                              / of the loop.
    Was that the last element? No. Move on to the next element.   
    animal = Animals[3]                                          \ Fourth and last iteration
    print(animal + " is an animal")                              / of the loop.
    Was that the last element? Yes. Stop here.                    
</pre>

Note that the result would be exactly the same if the variable `animal` was named anything else. What matters is that the name used on the "for" line matches the one used below. For example, this does the exact same thing:

In [None]:
for some_ridiculous_name in Animals:
    print(some_ridiculous_name + " is an animal")

A loop can contain multiple instructions:

In [69]:
for animal in Animals:
    print("Is "+animal+" an animal?")
    print("Yes it is.")

Is Human an animal?
Yes it is.
Is Dog an animal?
Yes it is.
Is Fish an animal?
Yes it is.
Is Coral an animal?
Yes it is.


Note how the lines that are inside the loop (the two print statements) start with four white spaces. We say they are indented. That's how python knows they are part of the loop. Being part of the loop means they are executed multiple times, once for each element of `Animals`.

What if we want the program to do something only once, after the loop is over? All we need to do is not indent it. See the last print statement in the example below.

In [70]:
for animal in Animals:
    print("Is "+animal+" an animal?")
    print("Yes it is.")
print("That's it for now.")

Is Human an animal?
Yes it is.
Is Dog an animal?
Yes it is.
Is Fish an animal?
Yes it is.
Is Coral an animal?
Yes it is.
That's it for now.


<div class="alert alert-block alert-danger">
<b>Task 3:</b>
Use a loop to print the first letter of each element of the list `Animals`.
</div>

In [None]:
Animals = ["Human", "Dog", "Fish", "Coral"]

### Looping over a file

For loops are also useful to read a file line by line. Instead of a string, or a list, here what we loop over is the file handle.

In [72]:
file_handle = open('rna_codon-table.txt')
for line in file_handle:
    print(line)

UUU,Phe

UCU,Ser

UAU,Tyr

UGU,Cys

UUC,Phe

UCC,Ser

UAC,Tyr

UGC,Cys

UUA,Leu

UCA,Ser

UAA,---

UGA,---

UUG,Leu

UCG,Ser

UAG,---

UGG,Urp

CUU,Leu

CCU,Pro

CAU,His

CGU,Arg

CUC,Leu

CCC,Pro

CAC,His

CGC,Arg

CUA,Leu

CCA,Pro

CAA,Gln

CGA,Arg

CUG,Leu

CCG,Pro

CAG,Gln

CGG,Arg

AUU,Ile

ACU,Thr

AAU,Asn

AGU,Ser

AUC,Ile

ACC,Thr

AAC,Asn

AGC,Ser

AUA,Ile

ACA,Thr

AAA,Lys

AGA,Arg

AUG,Met

ACG,Thr

AAG,Lys

AGG,Arg

GUU,Val

GCU,Ala

GAU,Asp

GGU,Gly

GUC,Val

GCC,Ala

GAC,Asp

GGC,Gly

GUA,Val

GCA,Ala

GAA,Glu

GGA,Gly

GUG,Val

GCG,Ala

GAG,Glu

GGG,Gly



There is something a bit strange here. Why do we get a blank line in between every line of the file? The answer is two-fold. Each line ends with a newline (`\n`) character. That's how the text file tells you the line is over. Then `print` adds its own newline character. Why would it do that? It's so you can write something like this:

In [73]:
print(1)
print(2)

1
2


Getting rid of the blank line in the loop above is easy: just don't print the last character of each line (the newline character):

In [74]:
file_handle = open('rna_codon-table.txt')
for line in file_handle:
    print(line[:-1])

UUU,Phe
UCU,Ser
UAU,Tyr
UGU,Cys
UUC,Phe
UCC,Ser
UAC,Tyr
UGC,Cys
UUA,Leu
UCA,Ser
UAA,---
UGA,---
UUG,Leu
UCG,Ser
UAG,---
UGG,Urp
CUU,Leu
CCU,Pro
CAU,His
CGU,Arg
CUC,Leu
CCC,Pro
CAC,His
CGC,Arg
CUA,Leu
CCA,Pro
CAA,Gln
CGA,Arg
CUG,Leu
CCG,Pro
CAG,Gln
CGG,Arg
AUU,Ile
ACU,Thr
AAU,Asn
AGU,Ser
AUC,Ile
ACC,Thr
AAC,Asn
AGC,Ser
AUA,Ile
ACA,Thr
AAA,Lys
AGA,Arg
AUG,Met
ACG,Thr
AAG,Lys
AGG,Arg
GUU,Val
GCU,Ala
GAU,Asp
GGU,Gly
GUC,Val
GCC,Ala
GAC,Asp
GGC,Gly
GUA,Val
GCA,Ala
GAA,Glu
GGA,Gly
GUG,Val
GCG,Ala
GAG,Glu
GGG,Gly


Soon we'll need to treat the codon and the amino-acid on each line separately. Let's practice by printing only the amino-acids.

<div class="alert alert-block alert-danger">
<b>Task 4:</b>
Loop over `rna_codon-table.txt`, but only print the amino-acid on each line.
</div>

### Using  a loop to build a list

Our first step towards an RNA translator will be to create a python object to hold the codon table. Specifically, we're going to build a list of codon-aminoacid pairs:  
`[ ['UUU','Phe'], ['UCU','Ser'], ['UAU','Tyr'], ['UGU','Cys'], ... ]`  
where each pair in the list corresponds to a line of `rna_codon-table.txt`.

The basic idea is to start with an empty list and populate it as we read `rna_codon-table.txt`.

Let's start by building a list of the amino-acids in `rna_codon-table.txt`.

<div class="alert alert-block alert-danger">
<b>Task 5:</b>
Create a list of the amino-acids in `rna_codon-table.txt`. First create an empty list. Call it `aacids`. Then, open `rna_codon-table.txt`. Loop over it. For every line of `rna_codon-table.txt`, extract the amino-acid from the line and add it at the end of (in other words, append it to) the `aacids` list. Once the loop is over, print the list you just created.
</div>

## RNA translator, step 1

<div class="alert alert-block alert-danger">
<b>Task 6:</b>
Create an empty list `codon_table`. Open `rna_codon-table.txt`. For every line in `rna_codon-table.txt`, extract the codon and the amino-acid, put them together in a pair (a list with two elements), and add that to `codon_table`.
Once you're done building `codon_table`, print it.
</div>

## Recap

We've learned new tools that are basic for programming and that will allow us to build the RNA translator.

- Indexing and extracting part of a string.
- Open, reading, and manipulating a text file.
- Lists. Indexing and extracting parts of them (basically same as strings).
- Loops and using them to automating repetitive tasks, aka, what programming is best at.

We've completed the first step of the RNA translator: loading the codon table into a list of (codon,amino-acid) pairs. Next week we'll see why this way of storing the codon table is so convenient in the context of RNA translation.