# Exercise 1: reading FASTA files

The [FASTA](http://en.wikipedia.org/wiki/FASTA_format)-format is a text-based format for nucleotide and protein sequences. A FASTA File begins with a single line description which is indicated by a leading ">".

Example:

> \>gi|31563518|ref|NP_852610.1| microtubule-associated proteins 1A/1B light chain 3A isoform b [Homo sapiens] MKMRFFSSPCGKAAVDPADRCKEVQQIRDQHPSKIPVIIERYKGEKQLPVLDKTKFLVPDHVNMSELVKIIRRRLQLNPTQAFFLLVNQHSMVSVSTPIADIYEQEKDEDGFLYMVYASQETFGFIRENE

* parse the nucleotide sequence in the file `GPD1_seq.fasta` (omit the description line)
* write a function `compute_nt_composition( sequence )`, which returns a dictionary containing the number of ocurrences for each base in a given sequence
* compute the nucleotide composition of the GDP1 protein and pickle the result to a file


In [49]:
import cPickle as pickle
import numpy as np

with open('GPD1_seq.fasta', 'r') as data:
    data_lines = data.readlines()

test = 'MKMRFFSSPCGKAAVDPADRCKEVQQIRDQHPSKIPVIIERYKGEKQLPVLDKTKFLVPDHVNMSELVKIIRRRLQLNPTQAFFLLVNQHSMVSVS'
data_list = []
for line in data_lines:
    if line.startswith('>'):
        continue
    data_list.append(line.strip())
data_string = ''.join(data_list)
    
def compute_nt_composition( sequence ):
    opt = set(sequence)
    zeros = np.zeros(len(opt))
    nt_dict = dict(zip(opt, zeros))

    for c in sequence:
        nt_dict[c] += 1
    return nt_dict

print compute_nt_composition(data_string)
compute_nt_composition(test)

{'A': 324.0, 'C': 240.0, 'T': 336.0, 'G': 276.0}


{'A': 4.0,
 'C': 2.0,
 'D': 5.0,
 'E': 4.0,
 'F': 5.0,
 'G': 2.0,
 'H': 3.0,
 'I': 6.0,
 'K': 9.0,
 'L': 8.0,
 'M': 4.0,
 'N': 3.0,
 'P': 7.0,
 'Q': 7.0,
 'R': 7.0,
 'S': 7.0,
 'T': 2.0,
 'V': 10.0,
 'Y': 1.0}

# Exercise 2: Plot a histogram

take the nucleotide composition of the gene above and plot a histogram of the A, T, G and C frequency. Have your histogram labeled nicely and give it a title. Please, choose yourself if you would like to display horizontal or vertical bars. Advanced options include change of color for individual bars, width of the bars and alignment of labels and bars.

In [79]:
%matplotlib inline 
from pylab import *

# Exercise 3: Plot a scatterplot

The file `mycoplasma_gene_sequences.csv` contains the genomic sequences of all *Mycoplasma genitalium* genes. The file contains two columns separated by a coma, the `WholeCellModelID` and the `Sequence`. 

* Read and parse the file and compute the nucleotide composition for each gene using the `compute_nt_composition( seq )` function that you have defined in Exercise 1. Collect the nucleotide compositions f Then use the scatter function to plot a scatterplot of A content versus T content for each gene (don't forget to normalize the nucleotide content by gene length).

* Indicate the length of each sequence by the dot-size in the scatterplot (hint: s input of scatter function)

* Plot the scatterplot for each combination of A,G,T,C (use subplot)

# Exercise 4: Plot the phasespace

in the numpy tutorial yesterday, you examined how a population of predator and prey can evolve over time theoretically (Lotka-Voltera System). Today, revisit the system and plot the phase space of the two species. In a phase space we plot the two variables against each other. 

In a next step, imagine, we would like to visualize how different starting conditions impact population behavior. Try having different conditions in the same phase space plot.

In [1]:
import scipy.integrate