In [None]:
#Counting Codons Lab Oct. 17 Submission
#Nov 5, 2020
#Zeke Van Dehy

# Codon Count Dictionary #

We first saw the codon counting problem back when we first learned to use loops. The purpose is to count how many times different codons are used in coding gene sequences. Knowing codon frequencies is the basis of some pattern recognition methods, that are used to predict whether there is a previously  unknown gene in a DNA sequence.

The first thing we need is to generate a list that will give us a *reference* list for all the codons. We need that so that we can tell the .count() statement what to count when we get to counting the codons in an actual sequence.

We did this problem once already to show how loops could nest. You need an outer loop that steps through A,T,G,C to get the first nucleotide, and then an inner loop that does the same thing to get the second nucleotide, and then a further inner loop that gets the third nucleotide.

Create a function called ref_codons that loops through and makes a list of all the possible codons, then returns the list.

In [3]:
def ref_codons():
    codons = []
    for base1 in ["A","T","G","C"]:
        for base2 in ["A","T","G","C"]:
            for base3 in ["A","T","G","C"]:
                codons.append(base1+base2+base3)
    return codons
print(ref_codons())

['AAA', 'AAT', 'AAG', 'AAC', 'ATA', 'ATT', 'ATG', 'ATC', 'AGA', 'AGT', 'AGG', 'AGC', 'ACA', 'ACT', 'ACG', 'ACC', 'TAA', 'TAT', 'TAG', 'TAC', 'TTA', 'TTT', 'TTG', 'TTC', 'TGA', 'TGT', 'TGG', 'TGC', 'TCA', 'TCT', 'TCG', 'TCC', 'GAA', 'GAT', 'GAG', 'GAC', 'GTA', 'GTT', 'GTG', 'GTC', 'GGA', 'GGT', 'GGG', 'GGC', 'GCA', 'GCT', 'GCG', 'GCC', 'CAA', 'CAT', 'CAG', 'CAC', 'CTA', 'CTT', 'CTG', 'CTC', 'CGA', 'CGT', 'CGG', 'CGC', 'CCA', 'CCT', 'CCG', 'CCC']


# Extracting codons from a gene sequence #

If we just straight up used .count() on the sequence to count each of our types of codons, that wouldn't quite get us what we want. We need to count only the codons in the reading frame of the gene.

For this version of the problem, assume that the gene starts where the sequence starts and that your gene codes in reading frame 1.  

To make a list of the reading frame codons, you'll need to use a range with step.  

Reminder: here's how you can use a range with step:

```for i in range(0,100,5):
    print(i)```
    
Reminder: here's how you can make a slice based on an index:

```for i in range(0,len(mystring),5):
    print(mystring[i:i+5])```
    
Create a function that steps through the sequence and takes a 3-letter slice of the sequence every 3 characters. Append that slice to a list of codons and return the list at the end. Be sure that you end your range before the sequence ends so that you don't try to take a slice that goes outside the length of the string.

In [4]:
def getCodons(sequence):
    codons = []
    for i in range(0,len(sequence),3):
        codons.append(sequence[i:i+3])
    return codons
print(getCodons("AAACCCBBBTTTAAACCCBBBA"))

['AAA', 'CCC', 'BBB', 'TTT', 'AAA', 'CCC', 'BBB', 'A']


# Counting the codons in the list #

Now you need a function that will take the list of codons created by the slicing function, and actually make a count of each type of codon. To do this, you can loop through the reference codon list and count each kind.

The last time we did this, we didn't know any data structures to put the codons and the counts in, so we only counted stop codons and incremented a counter if the codon matched a stop codon pattern. 

This time, we know about dictionaries and have a way to save the count of each type of codon.

Create a function that steps through each possible codon type, and adds the codon and the number of times it occurs to a dictionary as a key-value pair.

Remember, you add a key-value pair to a dictionary with an assignment statement, like codoncounts[codon] = count.

In [5]:
def countCodons(sequence):
    codon_counts = {}
    seq_codons = getCodons(sequence)
    for codon in ref_codons():
        count = seq_codons.count(codon)
        if  count != 0:
            codon_counts[codon] = seq_codons.count(codon)
    return codon_counts

print(countCodons("AAAGGGCCCGGGAAA"))
    
    

{'AAA': 2, 'GGG': 2, 'CCC': 1}


# Counting codons in a collection of sequences #

There are two genome files, genomeA.fna and genomeB.fna, that go along with this exercise. There's also a smaller file, test.fna, that you can use to try out your script.

The files are in FASTA format:

```> description line
SEQUENCESEQQUENCESEQUENCE
SEQUENCESEQQUENCESEQUENCE
SEQUENCESEQQUENCESEQUENCE```

To read this file, we need to re-create the FASTA file parser that we used before. The parser uses the trigger of reaching a '>' when a variable 'name' already has a value, to move its current values to a data structure and get a new set of values. Let's re-create that function again.

In [19]:
# todo: don't read between sequences, separate them into lists
def genomic_fasta(genomeFile):
    """genomic_fasta: parses the sequence lines out of a genomic DNA FASTA file
    parameters: expects an open file object
    return: a single DNA sequence string
    """
    DNASeqs = [] #list for dna sequences
    currentSeq = [] # list for storing one seq
    newSeq = False #tracks if sequence is different
    
    for line in genomeFile.readlines():
        
        if line[0] != ">": # filters out the header line that starts with >
            
            #if true, store the previous seq and start over with current. if false, keep adding to current
            if newSeq:
                DNASeqs.append(''.join(currentSeq))
                currentSeq = []
                newSeq = False
                
            currentSeq.append(line.strip("\n")) # appends the remaining lines to the empty list after stripping
            
        else: #if there's a header, then there is a new Sequence
            newSeq = True

    return DNASeqs[1:] #returns list of DNA sequences

DNAFile = open("genomeB.fna") # creates a file object from a stored file
DNASeqs = genomic_fasta(DNAFile) # passes the file object to the fasta parser function
DNAFile.close() # closes the file object

for seq in DNASeqs[:10]:
    print(seq[:10])

ATGAATACTA
ATGAATATTT
ATGGTTATAT
ATGAATAATT
ATGAATTTCA
ATGTTTAAAT
ATGCTAAATG
ATGATTATTA
ATGATAGGAA
ATGAATTTAT


# Frequencies #

When you're trying to represent a distribution and especially when you're going to compare two distributions, it is more useful to express the values as frequencies rather than as simple counts.

A frequency could be expressed as count_of_this_codon/total_number_codons.

Modify your codon dictionary builder so it stores codon:frequency instead of codon:count. Store the frequencies rounded to four significant figures using round(number,figs)

In [25]:
def getFrequency(seq_codons):
    codon_freq = {}
    for codon in ref_codons():
        freq = seq_codons.count(codon)/(len(seq_codons))
        if  freq != 0:
            codon_freq[codon] = round(freq,4)
    return codon_freq

print(getFrequency(getCodons("AAAGGGAAAGGG")))
print(getFrequency(getCodons("AAAGGGAAAGGGTTTTTT")))
print(getFrequency(getCodons("AAACCCAAAGGGAAATTT")))

{'AAA': 0.5, 'GGG': 0.5}
{'AAA': 0.3333, 'TTT': 0.3333, 'GGG': 0.3333}
{'AAA': 0.5, 'TTT': 0.1667, 'GGG': 0.1667, 'CCC': 0.1667}


# Now you have everything you need #

In the main body of your code, open the genome A file and make a frequency dictionary for it. Then do the same for genome B.

To get frequencies for the whole genome, you will need to add together all the codon lists from each of the sequences BEFORE you get the counts and frequencies.  Since lists can be added together with +=, you could do this as you loop over the collection of sequences.

# Extract some values #

For each genome, get the codon with the highest frequency value and the lowest frequency value from each genome (use min() and max() on the values, and then use the value to find the key (or keys), as we did in the examples earlier.

In [41]:
def getGenomeFreq(fileName):
    with open(fileName) as fo:
        genome_codons = []
        for seq in genomic_fasta(fo):
            genome_codons += getCodons(seq)
        return getFrequency(genome_codons)

def getMinAndMax(genome_freq):
        maxFreq = max(genome_freq.values())
        minFreq = min(genome_freq.values())
        maxCodon = ""
        minCodon = ""
        for codon,freq in genome_freq.items():
            if freq == maxFreq:
                maxCodon = codon
            if freq == minFreq:
                minCodon = codon
        print("max:", maxCodon, maxFreq)
        print("min:", minCodon, minFreq)
        return {"max":(maxCodon,maxFreq), "min":(minCodon,minFreq)}


In [43]:
aDict = getGenomeFreq("genomeA.fna")
bDict = getGenomeFreq("genomeB.fna")
print("genomeA")
getMinAndMax(aDict)
print("genomeB")
getMinAndMax(bDict)

genomeA
max: GAA 0.0401
min: TGA 0.0005
genomeB
max: AAA 0.1412
min: CGC 0.0


{'max': ('AAA', 0.1412), 'min': ('CGC', 0.0)}

# Compare some values #

Iterate over the keys of one of the dictionaries. Since they have the same keys, you can use the codon keys to get the corresponding values from both dictionaries. Which codon has the biggest change in frequency from genome A to genome B? delta_freq = abs(freq_a - freq_b).

In [48]:
maxChangeCodon = ""
maxChange = ""
for key in aDict.keys():
    freq_a = aDict[key]
    freq_b = bDict[key]
    delta_freq = round(abs(freq_a - freq_b),4)
    if maxChange == "" or delta_freq > maxChange:
        maxChange = delta_freq
        maxChangeCodon = key
print(maxChangeCodon, maxChange)
print("genomeA","genomeB")
print(aDict[maxChangeCodon], bDict[maxChangeCodon])

AAA 0.1053
genomeA genomeB
0.0359 0.1412
