# Step 3 Hamming Analysis
This notebook takes the nucleotide counts produced in step 2, FastQ to Nucleotides and converts them to the amino acid counts while accounting for and subtracting expected error in illumina sequencing, termed Hamming Filtering. 

Previously in FastQ to Nucleotides (analysis step 2) we threw out any read where the base calls in the coding region (or coding region of the sublibrary of interest, specifically) had a Q score of less than 30. Here we will conservatively assume that every nucleotide has a Q score of 40, or a 0.01% chance of being an errant base call.

Hamming Analysis, created by James McCormick (in this context) is the process for accounting for the error rate in Illumina sequencing. Because each read is counted as a single mutant, a base call with a quality score Q40 has a 99.99% base call accuracy. But when there are a large number of reads, numbering into the billions for this set of experiments, the contribution of miscalled mutations that are actually wild type (or a mutant that is misread as wild type) becomes an important contribution.

This source of noise is also not uniform. If we sequence only an identical wild type sequence with a very large number of reads, apparent mutations that require two adjacent mutations are much less likely than ones that are a single base miscall away. Such as ATG being called falsely as ATC (met to ile) vs ATG to AAC (met to asn). 

From the Hamming distance Wikipedia article: "In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other, or the minimum number of errors that could have transformed one string into the other. In a more general context, the Hamming distance is one of several string metrics for measuring the edit distance between two sequences. It is named after the American mathematician Richard Hamming."

The probability of two errant base calls happening (a hamming distance of two) is 0.0001 x 0.0001, three errant base calls is 0.0001 x 0.0001 x 0.0001 or P^H where P is the probability of a errant base call and H is the hamming distance. 

Once we have the probability of an errant mutant stemming from another sequence, we can calculate the number of errant calls we would expect from a given number of other counts, e.g. wild type. If there are 1,000,000 reads of the wild type sequence and mutation A is one hamming distance from wild type, we can estimate that there are (1,000,000*0.0001^1) = 100 mutant A's that are errantly called wild type. We can repeat this for every observed mutation type at that codon. 

### This process makes a few simplifying assumptions:

1. That every other observed mutant/wild type is correct.
2. There are no mutants that are misread in one location and then misread again to produce this mutation. 
    e.g. ATTCGG --> ATCCGA. 
3. The probability of the sequencer missing a base call is equal to the resultant mutation.
    i.e. that an erroneous ATC -> ATG is equally as likely as an erroneous ATC -> ATT
4. That every base call has a Q score equal to the average as reported by genewiz 
    This is not true, and is a source of bias, as the average Q score is better at the start of a read, which is why USEARCH (step 1) and paired end coverage is employed. Scores below Q30 were thrown out in the previous script and USEARCH increases the Q score from that of illumina base calling. 
    
These assumptions result in a "hard filtering" where more reads are thrown away than strictly need to be.

Error not accounted for: Index Hopping. Where the index is incorrectly assigned. 
This happens generally and in this library about 1.5% of the time. I can't filter it out and have to live with it.


In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import matplotlib
import pandas as pd
from scipy import stats #as stats
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna
from Bio.SeqUtils import seq3
from Bio.SeqUtils import seq1
from matplotlib import rcParams
from pathlib import Path #introduced in Python 3.4

Set up some constants

In [2]:
#for this set of turbidostat experiments there is no non chimeric protein (DL121 is the wild type)
#However the LOV2 insertion is not mutated, and is not sequenced. 
#the wt sequence of DHFR
dhfr_wt = 'MISLIAALAVDRVIGMENAMPWNLPADLAWFKRNTLNKPVIMGRHTWESIGRPLPGRKNIILSSQPGTDDRVTWVKSVDEAIAACGDVPEIMVIGGGRVYEQFLPKAQKLYLTHIDAEVEGDTHFPDYEPDDWESVFSEFHDADAQNSHSYCFEILERR'
dhfr_wt_ix = np.arange(len(dhfr_wt[:]))

#477 nt long, or 159 residues. WT codon sequence of DHFR
dhfr_nuc_wt = 'ATGATCAGTCTGATTGCGGCGTTAGCGGTAGATCGCGTTATCGGCATGGAAAACGCCATGCCGTGGAACCTGCCTGCCGATCTCGCCTGGTTTAAACGCAACACCTTAAATAAACCCGTGATTATGGGCCGCCATACCTGGGAATCGATCGGTCGTCCGTTGCCAGGACGCAAAAATATTATCCTGAGCTCACAACCGGGTACGGACGATCGCGTAACGTGGGTGAAGTCGGTGGATGAAGCAATTGCGGCGTGTGGTGACGTACCAGAAATCATGGTGATTGGCGGCGGCCGCGTTTATGAACAGTTCTTGCCAAAAGCGCAAAAGCTTTATCTGACGCATATCGACGCAGAAGTGGAAGGCGACACCCATTTCCCGGATTACGAGCCGGATGACTGGGAATCGGTATTCAGCGAATTCCACGATGCTGATGCGCAGAACTCTCACAGCTATTGCTTTGAGATTCTGGAGCGGCGG'
dhfr_nuc_wt_ix = np.arange(len(dhfr_nuc_wt[:]))

#all amino acids (* is stop) 
aas = 'ACDEFGHIKLMNPQRSTVWY*'
aas_ix = np.arange(len(aas[:]))

#to build my list of all 64 codons. 

codon_list = ['GCT', 'GCC', 'GCA', 'GCG', 'CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG', 'AAT', 'AAC', 'GAT', 'GAC', 'TGT', 'TGC'\
, 'CAA', 'CAG', 'GAA', 'GAG', 'GGT', 'GGC', 'GGA', 'GGG', 'CAT', 'CAC', 'ATT', 'ATC', 'ATA'\
, 'CTT', 'CTC', 'CTA', 'CTG', 'TTA', 'TTG', 'AAA', 'AAG', 'ATG', 'TTT', 'TTC', 'CCT', 'CCC', 'CCA', 'CCG'\
, 'TCT', 'TCC', 'TCA', 'TCG', 'AGT', 'AGC', 'ACT', 'ACC', 'ACA', 'ACG', 'TGG', 'TAT', 'TAC'\
, 'GTT', 'GTC', 'GTA', 'GTG', 'TAA', 'TAG', 'TGA']

Import the count of each nucleotide from step 2, DL121_fastq_analysis.py

In [3]:
#importing each sample and storing the data in the form of a dictionary

#import the counts
data_dict = {}
path_to_input = Path("../Step_2_fastq_to_nucleotides/output/")
#all filenames in masterfile so the selection can be changed quickly
filenames = open('masterfile.txt','r').readlines() 
for filename in filenames: #open the counts in each sample
    filename = filename.strip() #removes whitespaces and lines
    filepath = path_to_input / filename #gets the absolute path to the previous output files
    data = open(filepath,'r').readlines() 
    filename = filename[:-7]
    data_dict[filename] = {} #add sample to the dict that contains all of your data

    for line in data: #iterate through each line in the sample
        sp_line = line.split('\t') 
        if sp_line[0][0:2] != 'SL' and sp_line[0][0:2] != 'CL': #only looks at lines that start with SL/CL
            continue #skip over statistics in this file 
        else: 
            sl_id = sp_line[0] #save the SL it belongs to
            mut = sp_line[1] #determine the mutant 
            mut_count = sp_line[2] #determine the number of reads for that mutant
            
            # adds the reads for that mutant to the dictionary and categorizes the SL this datapoint goes to
            if sl_id in data_dict[filename].keys(): 
                data_dict[filename][sl_id][mut] = float(mut_count.strip())
            else:
                data_dict[filename][sl_id] = {}
                data_dict[filename][sl_id][mut] = float(mut_count.strip())
#data_dict['T6V2']['SL1']['WT'] #example of how to access the data point in dictionary [sample][sub-lib][mutant]

In [4]:
#the matrix will have 64 columns and 159 rows
full_codon_matrix = np.zeros(shape = [len(dhfr_wt_ix),len(codon_list)])

In [5]:
#for going from nucleotides to aa
def translate_seq(seq):
    seq = Seq(seq,generic_dna)
    seq_translate = str(seq.translate().strip())
    return seq_translate

In [6]:
def ix_mutant(mut,mut_count,reads_mat,i,j):
    
    #organizing mutant counts into matricies
    #global variables are the matricies of zeros that will store counts for DHFR mutants and mini-library mutants
    #local variables will be the mutation id and the counts of this mutant
    
    if mut in ['WT','fail_multimutant']:
        pass
    else:
        #determine position of the mutant
        mut_pos = mut[3:-3] #use name of mutant to pull out the position in DHFR
        mut_ix = dhfr_wt_ix[int(mut_pos)-1] #find the ix of the mutant position

        #determine position of the aa of the mutation
        mut_nuc = mut[-3:] #this gives the nucelotide sequence
        #mut_aa = translate_seq(mut_nuc) #converts it into aa
        mut_nuc_ix = codon_list.index(mut_nuc)
        
        #set corresponding position in the matrix to the mutant count! 
        reads_mat[i,j] = mut_count 

We then use i to signify each of the 159 amino acids of DHFR that have been mutated (all of them) and j to be the 64 different possible codons. Using a series of loops we iterate through every possible codon mutation in DHFR and pull the number observed out from the dictionary. This allows us to convert that information into a matrix. This is done for every timepoint and vial separately. 
Due to read length restriction for the illumina library the DHFR protein is broken up into 4 sublibraries, each in its own set of wild type counts, hence the SL# used below.


In [7]:
#This is to build matrix of all counts. 
#full_codon_matrix is the size, codon_list is the nucelotide list

#if the mutant is not in the dictionary (never seen) it is saved in missing_mut
missing_mut = [] 

#for a non-wildtype ending in A/T (which shouldnt be made by NNS primers), it is saved in wrong_mut
#but not converted into the residue count, as it most likely is from sequencing error
wrong_mut = []

nucleotide_matrix = {}
for condition in list(data_dict.keys()):
    working_dict = data_dict[condition]
    reads_mat = np.zeros(shape = [len(dhfr_wt_ix),len(codon_list)])
    i = 0 # Down the matrix (dhfr_wt aas) 0-158
    j = 0 # Across the matrix (nucleotides) 0-63
    while (i <= 158):
        j = 0
        while (j <= 63):
            ione = (np.multiply(i,3)) #these allow for grabbing wt codon from wt aa sequence position.
            itwo = (np.multiply(i,3) +1)
            ithree = (np.multiply(i,3) +2)        
            #this gives the name in the directory to look up
            position_name = (dhfr_nuc_wt[ione]+dhfr_nuc_wt[itwo]+dhfr_nuc_wt[ithree]+str((i+1))+codon_list[j])
            if (i < 40): #SL1
                sl = 'SL1' #sublibrary one
                if (dhfr_nuc_wt[ione]+dhfr_nuc_wt[itwo]+dhfr_nuc_wt[ithree]) == codon_list[j]: #WILD TYPE!
                    mut_count = working_dict[sl]['WT']
                    ix_mutant(position_name,mut_count,reads_mat,i,j)
                    j = j + 1
                    continue    
                else: #A and T should not show up in NNS mutations and are not counted
                    if position_name in working_dict[sl].keys():
                        if codon_list[j][-1] == 'A':
                            wrong_mut.append(position_name)
                            j = j + 1
                            continue
                        elif codon_list[j][-1] == 'T':
                            wrong_mut.append(position_name)
                            j = j + 1
                            continue                          
                        else:
                            mut_count = working_dict[sl][position_name]
                            ix_mutant(position_name,mut_count,reads_mat,i,j)
                            j = j + 1
                            continue 
                    else:
                        missing_mut.append(position_name)
                        j = j + 1
                        continue 
            elif (i >= 40) and (i < 80): #SL2
                sl = 'SL2' #sublibrary two
                #cheeck if the mutant matches the wild type sequence
                if (dhfr_nuc_wt[ione]+dhfr_nuc_wt[itwo]+dhfr_nuc_wt[ithree]) == codon_list[j]: #WILD TYPE!
                    mut_count = working_dict[sl]['WT']
                    ix_mutant(position_name,mut_count,reads_mat,i,j)
                    j = j + 1
                    continue    
                else: #A and T should not show up in NNS mutations and are not counted
                    if position_name in working_dict[sl].keys():
                        if codon_list[j][-1] == 'A':
                            wrong_mut.append(position_name)
                            j = j + 1
                            continue
                        elif codon_list[j][-1] == 'T':
                            wrong_mut.append(position_name)
                            j = j + 1
                            continue                            
                        else:
                            mut_count = working_dict[sl][position_name]
                            ix_mutant(position_name,mut_count,reads_mat,i,j)
                            j = j + 1
                            continue 
                    else:
                        missing_mut.append(position_name)
                        j = j + 1
                        continue
            elif (i >= 80) and (i < 120): #SL3
                sl = 'SL3' #sublibrary three
                if (dhfr_nuc_wt[ione]+dhfr_nuc_wt[itwo]+dhfr_nuc_wt[ithree]) == codon_list[j]: #WILD TYPE!
                    mut_count = working_dict[sl]['WT']
                    ix_mutant(position_name,mut_count,reads_mat,i,j)
                    j = j + 1
                    continue    
                else: #A and T should not show up in NNS mutations and are not counted
                    if position_name in working_dict[sl].keys():
                        if codon_list[j][-1] == 'A': #NNS filtering
                            wrong_mut.append(position_name)
                            j = j + 1
                            continue
                        elif codon_list[j][-1] == 'T': #NNS filtering
                            wrong_mut.append(position_name)
                            j = j + 1
                            continue                            
                        else:
                            mut_count = working_dict[sl][position_name]
                            ix_mutant(position_name,mut_count,reads_mat,i,j)
                            j = j + 1
                            continue 
                    else:
                        missing_mut.append(position_name)
                        j = j + 1
                        continue
            elif (i >= 120) and (i < 159): #SL4
                sl = 'SL4' #sublibrary four
                if (dhfr_nuc_wt[ione]+dhfr_nuc_wt[itwo]+dhfr_nuc_wt[ithree]) == codon_list[j]: #WILD TYPE!
                    mut_count = working_dict[sl]['WT']
                    ix_mutant(position_name,mut_count,reads_mat,i,j)
                    j = j + 1
                    continue    
                else: #A and T should not show up in NNS mutations and are not counted
                    if position_name in working_dict[sl].keys():
                        if codon_list[j][-1] == 'A':
                            wrong_mut.append(position_name)
                            j = j + 1
                            continue
                        elif codon_list[j][-1] == 'T':
                            wrong_mut.append(position_name)
                            j = j + 1
                            continue                              
                        else:
                            mut_count = working_dict[sl][position_name]
                            ix_mutant(position_name,mut_count,reads_mat,i,j)
                            j = j + 1
                            continue  
                    else:
                        missing_mut.append(position_name)
                        j = j + 1
                        continue
            else: #This should never be triggered. 
                print('something is wrong') #for error reporting. 
                j = j + 1
                continue
        i = i + 1    

    nucleotide_matrix[condition] = reads_mat

This function calculates the hamming distance between two 3 letter strings. 
It is given the string being compared and the reference one that the count is being adjusted from.

In [8]:
def HAMCALCULATOR(ham_calc_wt,ham_calc_compare):
    ham_dist = 3 #starts off with the assumption they are tottaly different strings
    for i in range(0, 3): #loops for each of the nucleotides
        #if the nucleotide matches that being compared against, subtract one from the hamming distance
        if ham_calc_wt[i] == ham_calc_compare[i]: 
            ham_dist = np.subtract(ham_dist, 1)
        else:
            continue
    return(ham_dist)#this is a number 1,2,or 3

## This is the hard hamming filter. 
Hard refers to that the position being adjusted assumes all other counts are correct.
This code section takes about 2 minutes to run. 

It iterates in the same way as before: We use I to signify each of the 159 amino acids of DHFR that have been mutated (all of them) and J to be the 64 different possible codons. Each wild type from each sublibrary has to be calculated separately. 

It loops through every codon in each timepoint and vial separately and calculates the number expected errant reads one could expect from another codon given:
## \begin{equation*} NErrant_t^{Mut} = N_t^{Wt}(-10^{(\frac{\bar{x}Q}{10})})^{HD} \end{equation*}
Where the number of errant mutants observed, $NErrant_t^{Mut}$  is equal to the number of observed wild type, $N_t^{Wt}$ (or any other mutant count being compared) multiplied by the probability that a base call is incorrect, $-10^{(\frac{\bar{x}Q}{10})}$ and raised to the power of the hamming distance, $HD$. 

 The sum of $NErrant_t^{Mut}$ for every possible position is then subtracted from the actual mutant count. This is calculated for every possible mutation observed (the NNK mutants are not added to the nucleotide_matrix and thus do not contribute). 


In [9]:
#This is the hard hamming filter. It will make a new matrix of every position.
#hard refers to that the position being adjusted assumes all other counts are correct.
#This code section takes about 2 minutes to run. 


hamming_value = -4.0 #this is -1/10th the Illumina Q score average. (Or filtering value)
#Genewiz (the sequencing company) reported an average Q score across all lanes + runs as Q40. 
hamming_count = 0 #sets the count as zero, the observed mutant count will be subtracted by hamming_count
#to get the final adjusted count. 
hard_hamming_matrix = {} #creates a blank dictionary to store the hamming counts sorted by condition e.g. T6V2


for condition in list(nucleotide_matrix.keys()): #37 of em. 
    #FOR WILD TYPE CALCULATION, which has to be done in blocks
    wt_hamming_sl1 = data_dict[condition]['SL1']['WT']
    wt_hamming_sl2 = data_dict[condition]['SL2']['WT']
    wt_hamming_sl3 = data_dict[condition]['SL3']['WT']
    wt_hamming_sl4 = data_dict[condition]['SL4']['WT']
    I = 0 #amino acid position of DHFR
    J = 0 #cycle through list of 3 letter nucleotides
    while (I <= 158):
        J = 0 
        while (J <= 63):
            hamming_count = 0
            K = 0 #K is the iterator for every other position in that compare against
            Ione = (np.multiply(I,3)) #these allow for grabbing wt codon from wt aa sequence position. 
            Itwo = (np.multiply(I,3) +1)
            Ithree = (np.multiply(I,3) +2) 
            #check if the sequence is wild type, if so it is adjusted below. 
            #we have to cycle through every mutant that can contribute to an errant wild type
            #which is every other mutant in the sublibrary, different than a mutation where we only consider
            #the contributing reads at that codon.
            if (dhfr_nuc_wt[Ione]+dhfr_nuc_wt[Itwo]+dhfr_nuc_wt[Ithree]) == codon_list[J]:
                #WT SL1                   
                if (I < 40): #SL1
                    while (K <=63):
                        if K == J :
                            K = K + 1
                            continue
                        else:
                            temp_hamming_count = 0
                            hamming_number = HAMCALCULATOR(codon_list[K],codon_list[J])
                            if hamming_number == 1 : #hamming number 1, value is 10^(Qscore/-10)
                                temp_hamming_count = (nucleotide_matrix[condition][I,K] * \
                                                          np.power(10,hamming_value))
                                hamming_count = (hamming_count + temp_hamming_count)
                                K = K + 1
                                continue
                            elif hamming_number == 2 : #Hamming number of 2, values squared
                                temp_hamming_count = (nucleotide_matrix[condition][I,K] * \
                                                          (np.power(10,hamming_value) * np.power(10,hamming_value)))
                                hamming_count = (hamming_count + temp_hamming_count)
                                K = K + 1
                                continue
                            elif hamming_number == 3 : #Hamming number of 3. Values cubed
                                temp_hamming_count = (nucleotide_matrix[condition][I,K] * \
                                                          (np.power(10,hamming_value) * np.power(10,hamming_value) * np.power(10,hamming_value)))
                                hamming_count = (hamming_count + temp_hamming_count)
                                K = K + 1
                                continue
                            elif hamming_number == 0 :
                                print('SOMETHING IS WRONG') #check to see if it is throwing in wild type for some reason. 
                                K = K + 1
                                continue
                    #here we subtract the observed count from the estimated errant counts, producing the wt adjusted count.
                    wt_hamming_sl1 = (wt_hamming_sl1 - hamming_count)
                    J = J + 1
                    continue
                    
                #WT SL2                     
                elif (I >= 40) and (I < 80): #SL2
                    while (K <=63):
                        if K == J :
                            K = K + 1
                            continue
                        else:
                            temp_hamming_count = 0
                            hamming_number = HAMCALCULATOR(codon_list[K],codon_list[J])
                            if hamming_number == 1 : #hamming number 1, value is 10^(Qscore/-10)
                                temp_hamming_count = (nucleotide_matrix[condition][I,K] * \
                                                          np.power(10,hamming_value))
                                hamming_count = (hamming_count + temp_hamming_count)
                                K = K + 1
                                continue
                            elif hamming_number == 2 : #Hamming number of 2, values squared
                                temp_hamming_count = (nucleotide_matrix[condition][I,K] * \
                                                          (np.power(10,hamming_value) * np.power(10,hamming_value)))
                                hamming_count = (hamming_count + temp_hamming_count)
                                K = K + 1
                                continue
                            elif hamming_number == 3 : #Hamming number of 3. Values cubed
                                temp_hamming_count = (nucleotide_matrix[condition][I,K] * \
                                                          (np.power(10,hamming_value) * np.power(10,hamming_value) * np.power(10,hamming_value)))
                                hamming_count = (hamming_count + temp_hamming_count)
                                K = K + 1
                                continue
                            elif hamming_number == 0 :
                                print('SOMETHING IS WRONG') #check to see if it is throwing in wild type for some reason. 
                                K = K + 1
                                continue
                    #here we subtract the observed count from the estimated errant counts, producing the wt adjusted count.
                    wt_hamming_sl2 = (wt_hamming_sl2 - hamming_count)
                    J = J + 1
                    continue                                       
                #WT SL3                       
                elif (I >= 80) and (I < 120): #SL3
                    while (K <=63):
                        if K == J :
                            K = K + 1
                            continue
                        else:
                            temp_hamming_count = 0
                            hamming_number = HAMCALCULATOR(codon_list[K],codon_list[J])
                            if hamming_number == 1 : #hamming number 1, value is 10^(Qscore/-10)
                                temp_hamming_count = (nucleotide_matrix[condition][I,K] * \
                                                          np.power(10,hamming_value))
                                hamming_count = (hamming_count + temp_hamming_count)
                                K = K + 1
                                continue
                            elif hamming_number == 2 : #Hamming number of 2, values squared
                                temp_hamming_count = (nucleotide_matrix[condition][I,K] * \
                                                          (np.power(10,hamming_value) * np.power(10,hamming_value)))
                                hamming_count = (hamming_count + temp_hamming_count)
                                K = K + 1
                                continue
                            elif hamming_number == 3 : #Hamming number of 3. Values cubed
                                temp_hamming_count = (nucleotide_matrix[condition][I,K] * \
                                                          (np.power(10,hamming_value) * np.power(10,hamming_value) * np.power(10,hamming_value)))
                                hamming_count = (hamming_count + temp_hamming_count)
                                K = K + 1
                                continue
                            elif hamming_number == 0 :
                                print('SOMETHING IS WRONG') #check to see if it is throwing in wild type for some reason. 
                                K = K + 1
                                continue
                    #here we subtract the observed count from the estimated errant counts, producing the wt adjusted count.
                    wt_hamming_sl3 = (wt_hamming_sl3 - hamming_count)
                    J = J + 1
                    continue                                              
                #WT SL4                     
                elif (I >= 120) and (I < 159): #SL4
                    while (K <=63):
                        if K == J :
                            K = K + 1
                            continue
                        else:
                            temp_hamming_count = 0
                            hamming_number = HAMCALCULATOR(codon_list[K],codon_list[J])
                            if hamming_number == 1 : #hamming number 1, value is 10^(Qscore/-10)
                                temp_hamming_count = (nucleotide_matrix[condition][I,K] * \
                                                          np.power(10,hamming_value))
                                hamming_count = (hamming_count + temp_hamming_count)
                                K = K + 1
                                continue
                            elif hamming_number == 2 : #Hamming number of 2, values squared
                                temp_hamming_count = (nucleotide_matrix[condition][I,K] * \
                                                          (np.power(10,hamming_value) * np.power(10,hamming_value)))
                                hamming_count = (hamming_count + temp_hamming_count)
                                K = K + 1
                                continue
                            elif hamming_number == 3 : #Hamming number of 3. Values cubed
                                temp_hamming_count = (nucleotide_matrix[condition][I,K] * \
                                                          (np.power(10,hamming_value) * np.power(10,hamming_value) * np.power(10,hamming_value)))
                                hamming_count = (hamming_count + temp_hamming_count)
                                K = K + 1
                                continue
                            elif hamming_number == 0 :
                                print('SOMETHING IS WRONG') #check to see if it is throwing in wild type for some reason. 
                                K = K + 1
                                continue
                    #here we subtract the observed count from the estimated errant counts, producing the wt adjusted count.
                    wt_hamming_sl4 = (wt_hamming_sl4 - hamming_count)
                    J = J + 1
                    continue                       
            #NOT WT
            else:#not wild type
                J = J + 1
                continue
        I = I + 1
                   
    #END WILD TYPE CALCULATIONS, now onto the mutations:       
    hamming_array = np.zeros(shape = [len(dhfr_wt_ix),len(codon_list)])
    i = 0 # Down the matrix (dhfr_wt) 0-158
    j = 0 # Across the matrix (nucleotides) 0-63
    while (i <= 158):
        j = 0
        while (j <= 63):
            hamming_count = 0
            k = 0 #k is the iterator for every other position in that compare against
            ione = (np.multiply(i,3)) #these allow for grabbing wt codon from wt aa sequence position. 
            itwo = (np.multiply(i,3) +1)
            ithree = (np.multiply(i,3) +2)        
            #this gives the name in the directory to look up, may not use here
            position_name = (dhfr_nuc_wt[ione]+dhfr_nuc_wt[itwo]+dhfr_nuc_wt[ithree]+str((i+1))+codon_list[j])
            
           #check if we hit a wild type in this loop.
            if (dhfr_nuc_wt[ione]+dhfr_nuc_wt[itwo]+dhfr_nuc_wt[ithree]) == codon_list[j]: #WILD TYPE!
                if (i < 40): #SL1
                    hamming_array[i,j] = wt_hamming_sl1 #calculated above
                    j = j + 1
                    continue
                elif (i >= 40) and (i < 80): #SL2
                    hamming_array[i,j] = wt_hamming_sl2 #calculated above
                    j = j + 1
                    continue
                elif (i >= 80) and (i < 120): #SL3
                    hamming_array[i,j] = wt_hamming_sl3 #calculated above
                    j = j + 1
                    continue
                elif (i >= 120) and (i < 159): #SL4
                    hamming_array[i,j] = wt_hamming_sl4 #calculated above
                    j = j + 1
                    continue
            else: #not wild type
                while (k <=63):
                    if k == j : #this prevents including a mutant's own count in its hamming adjustment
                        k = k + 1
                        continue
                    else: 
                    #implement hamming filtering. 
                        temp_hamming_count = 0
                        hamming_number = HAMCALCULATOR(codon_list[k],codon_list[j])
                        if hamming_number == 1 : #hamming number 1, value is 10^(Qscore/-10)
                            temp_hamming_count = (nucleotide_matrix[condition][i,k] * \
                                                      np.power(10,hamming_value))
                            hamming_count = (hamming_count + temp_hamming_count)
                            k = k + 1
                            continue
                        elif hamming_number == 2 : #Hamming number of 2, values squared
                            temp_hamming_count = (nucleotide_matrix[condition][i,k] * \
                                                      (np.power(10,hamming_value) * np.power(10,hamming_value)))
                            hamming_count = (hamming_count + temp_hamming_count)
                            k = k + 1
                            continue
                        elif hamming_number == 3 : #Hamming number of 3. Values cubed
                            temp_hamming_count = (nucleotide_matrix[condition][i,k] * \
                                                      (np.power(10,hamming_value) * np.power(10,hamming_value) * np.power(10,hamming_value)))
                            hamming_count = (hamming_count + temp_hamming_count)
                            k = k + 1
                            continue
                        elif hamming_number == 0 :
                            print('SOMETHING IS WRONG') #check to see if it is throwing in wild type for some reason. 
                            k = k + 1
                            continue
                #here we make sure that we dont set a negative mutant count. 
                if (nucleotide_matrix[condition][i,j] - hamming_count) < 0 :
                    hamming_array[i,j] = 0
                else:
                    #here we subtract the observed count from the estimated errant counts, producing the adjusted count.
                    hamming_array[i,j] = (nucleotide_matrix[condition][i,j] - hamming_count)
                hamming_count = 0
                j = j + 1
                continue
        i = i + 1       
    #After looping though every possible mutation in DHFR we then save the array and advance to the next condition.        
    hard_hamming_matrix[condition] = hamming_array
    #sorry about all the loops. 

This function allows for translating the codon into single letter protein code. 

In [10]:
#turn the DNA into protein
def translate_seq(seq):
    seq = Seq(seq,generic_dna) #using the Seq function from Bio.Seq
    seq_translate = str(seq.translate().strip())
    return seq_translate

Function to write mutant counts into a .txt file located in ./output/ for each condition. 

In [11]:
def writeOutputFile_aa(outName):
    #outName is the condition, e.g. T1V1 (for timepoint 1, vial 1)
    output_file = open('output/'+outName+'hard_adjusted_residue'+'.txt','w') 
    #write out mutant counts 
    for sl in aa_counts.keys():
        for key in aa_counts[sl].keys():
            #this produces a tab separated list of the sublibrary, the mutant, and the count
            #e.g. SL1 /t M1A /t 2770.0
            output_file.write(sl+'\t'+key+'\t'+str(aa_counts[sl][key])+'\n')

    output_file.close()

Here we convert the sets of nucleotides into aminio acids, combine the ones that had multiple codons, round them to a whole number, and then save the file out for each condition.

In [12]:
#save out hard hamming matrix
for condition in list(hard_hamming_matrix.keys()): #37 conditions.       
    aa_counts = {'SL1':{},'SL2':{},'SL3':{},'SL4':{}}
    i = 0 # Down the matrix (dhfr_wt) 0-158
    j = 0 # Across the matrix (nucleotides) 0-63
    while (i <= 158):
        j = 0
        while (j <= 63):
            ione = (np.multiply(i,3)) #these allow for grabbing wt codon from wt aa sequence position. 
            itwo = (np.multiply(i,3) +1)
            ithree = (np.multiply(i,3) +2)        
            #this gives the name in the directory to look up, may not use here
            position_name = (dhfr_nuc_wt[ione]+dhfr_nuc_wt[itwo]+dhfr_nuc_wt[ithree]+str((i+1))+codon_list[j])
            
            ####### WILD TYPE#######
            if (dhfr_nuc_wt[ione]+dhfr_nuc_wt[itwo]+dhfr_nuc_wt[ithree]) == codon_list[j]: #True_WILD TYPE!
                if (i < 40): #SL1
                    sl = 'SL1'
                    if 'WT' in aa_counts[sl].keys():
                        j = j + 1
                        continue
                    else:
                        aa_counts[sl]['WT'] = round(hard_hamming_matrix[condition][i,j])
                    j = j + 1
                    continue
                elif (i >= 40) and (i < 80): #SL2
                    sl = 'SL2'
                    if 'WT' in aa_counts[sl].keys():
                        j = j + 1
                        continue
                    else:
                        aa_counts[sl]['WT'] = round(hard_hamming_matrix[condition][i,j])
                    j = j + 1
                    continue
                elif (i >= 80) and (i < 120): #SL3
                    sl = 'SL3'
                    if 'WT' in aa_counts[sl].keys():
                        j = j + 1
                        continue
                    else:
                        aa_counts[sl]['WT'] = round(hard_hamming_matrix[condition][i,j])
                    j = j + 1
                    continue
                elif (i >= 120) and (i < 159): #SL4
                    sl = 'SL4'
                    if 'WT' in aa_counts[sl].keys():
                        j = j + 1
                        continue
                    else:
                        aa_counts[sl]['WT'] = round(hard_hamming_matrix[condition][i,j])
                    j = j + 1
                    continue
            else: #not wild type
                wild_type = translate_seq(dhfr_nuc_wt[ione]+dhfr_nuc_wt[itwo]+dhfr_nuc_wt[ithree])
                residue_name = translate_seq(codon_list[j])
                aa_position_name = (wild_type+str((i+1))+residue_name)
                if wild_type == residue_name:
                    aa_position_name =  'SNEAKY_WT' #this catches mutations that are just turned into WT
                if (i < 40): #SL1
                    sl = 'SL1'
                    if aa_position_name in aa_counts[sl].keys():
                        aa_counts[sl][aa_position_name] +=round(hard_hamming_matrix[condition][i,j])
                        j = j + 1
                        continue
                    else:
                        aa_counts[sl][aa_position_name] = round(hard_hamming_matrix[condition][i,j])
                        j = j + 1
                        continue
                elif (i >= 40) and (i < 80): #SL2
                    sl = 'SL2'
                    if aa_position_name in aa_counts[sl].keys():
                        aa_counts[sl][aa_position_name] +=round(hard_hamming_matrix[condition][i,j])
                        j = j + 1
                        continue
                    else:
                        aa_counts[sl][aa_position_name] = round(hard_hamming_matrix[condition][i,j])
                        j = j + 1
                        continue
                elif (i >= 80) and (i < 120): #SL3
                    sl = 'SL3'
                    if aa_position_name in aa_counts[sl].keys():
                        aa_counts[sl][aa_position_name] +=round(hard_hamming_matrix[condition][i,j])
                        j = j + 1
                        continue
                    else:
                        aa_counts[sl][aa_position_name] = round(hard_hamming_matrix[condition][i,j])
                        j = j + 1
                        continue
                elif (i >= 120) and (i < 159): #SL4
                    sl = 'SL4'
                    if aa_position_name in aa_counts[sl].keys():
                        aa_counts[sl][aa_position_name] +=round(hard_hamming_matrix[condition][i,j])
                        j = j + 1
                        continue
                    else:
                        aa_counts[sl][aa_position_name] = round(hard_hamming_matrix[condition][i,j])
                        j = j + 1
                        continue

        i = i + 1
            
    writeOutputFile_aa(condition)