# Prepping files for TargetedTilingPrimers input

In [1]:
import pandas as pd
import dms_tools2
from Bio import SeqIO
from Bio import SeqUtils
import Bio

First, we need a codon frequency table such that our mutant primers can be codon-optimized. I pulled a codon frequency table for Influenza H3N2 from [here](https://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=41857). Here I'm just getting it into the correct input format for this script.

In [2]:
# read in rough codon freq table
df = pd.read_csv('data/h3n2_codon_freq_table.txt', sep = '\t', names=['data'])
df = df.data.str.split(expand=True)

# add column names and drop unnecessary data
df.columns = ['aa', 'codon', 'extra', 'frequency', 'extra']
df = df.drop(columns=['extra'], axis=1)

# change 'aa' to single letter code and drop stop codons
df = df[df.aa != 'End']
df['aa'] = df.apply(lambda x: SeqUtils.IUPACData.protein_letters_3to1[x.aa], axis=1)

# calculate freq (default numbers are freq out of 1000)
df['frequency'] = df['frequency'].astype('float')
df['frequency'] = df.apply(lambda x: x.frequency/1000, axis=1)

# save formatted output
df.to_csv('results/TargetedTilingPrimers_inputs/h3n2_codon_freq_table_formatted.csv', index=False)

Next, the script specifies that the range of nucleotides you want to edit should be uppercase, and the flanking sequences should be lowercase. Define a function to format the sequence correctly given a fasta file and a range of codons to mutate.

In [3]:
# get a correctly formatted input seq from a fasta file and *codon* range to mutate
# mut_range is in format (first_codon, last_codon)
def get_input_seq(fasta, mut_range, outfile):
    # read in just the first entry of the sequence file
    wt_nts = next(SeqIO.parse(fasta, 'fasta'))
    
    # get uppercase string for sequence to mutate
    mut_seq = wt_nts.seq[(mut_range[0]*3)-3 : mut_range[1]*3]
    mut_seq = str(mut_seq).upper()
    
    # get lowercase string for flanking seqs
    flank5 = str(wt_nts.seq[:(mut_range[0]*3)-3]).lower()
    flank3 = str(wt_nts.seq[mut_range[1]*3:]).lower()
    
    # get full seq and export as .txt file
    primer_input_seq = flank5 + mut_seq + flank3
    
    f = open(outfile, "w")
    f.write(primer_input_seq)
    f.close()
    
    return primer_input_seq

The csv of mutations to make was based off of numbering from the WT sequences for HK/45/2019 and Kansas/14/2017, in order to keep consistent with the WT Perth09 sequence being analyzed. However, we now want to make primers that will complement the *chimeric* constructs. These constructs include the first 19 amino acids from WSN HA at the 5' end, which are then directly joined to the ectodomain (starting at **codon 17** in the WT H3 sequences). The first two codons of the ectodomain have also been edited to remove a polyA run. So, we want mutagenesis to begin at the **third codon** of the ectodomain, and finish at the end of the ectodomain. In our chimeric construct, these are codon numbers 22 to 523. 

In [4]:
# output seq files for both HK/19 and KS/17
hk19_chim_seq = get_input_seq('data/hk19_chimeric_coding_seq.fasta', (22, 523), 
                              'results/TargetedTilingPrimers_inputs/wsnha_hk19_chimeric_coding_seq.txt')
ks17_chim_seq = get_input_seq('data/ks17_chimeric_coding_seq.fasta', (22, 523), 
                              'results/TargetedTilingPrimers_inputs/wsnha_ks17_chimeric_coding_seq.txt')

As the table of mutants to generate was based off of the WT H3 sequence numbering, the third codon of the ectodomain (where we want our first mutant) is site 19. Again, this directly corresponds to site 22 in the chimera. 

For the targeted tiling primers input, the sequence to be mutated is uppercase, and it's assumed that the first uppercase site is codon 1. So all we need to do to keep this numbering consistent with chimera numbering is to change site 19 (in WT numbering) to site 1. 

In [5]:
hk19_targeted_muts = pd.read_csv('data/targeted_mutations_hk-45-2019.csv')
hk19_targeted_muts['site'] = hk19_targeted_muts.apply(lambda x: x.site - 18, axis=1)
hk19_targeted_muts

Unnamed: 0,site,wildtype,mutant,gen_mut
0,1,I,A,19A
1,1,I,D,19D
2,1,I,E,19E
3,1,I,F,19F
4,1,I,G,19G
...,...,...,...,...
2500,502,G,S,520S
2501,502,G,T,520T
2502,502,G,V,520V
2503,502,G,W,520W


In [6]:
ks17_targeted_muts = pd.read_csv('data/targeted_mutations_ks-14-2017.csv')
ks17_targeted_muts['site'] = ks17_targeted_muts.apply(lambda x: x.site - 18, axis=1)
ks17_targeted_muts

Unnamed: 0,site,wildtype,mutant,gen_mut
0,1,I,A,19A
1,1,I,D,19D
2,1,I,E,19E
3,1,I,F,19F
4,1,I,G,19G
...,...,...,...,...
2500,502,G,S,520S
2501,502,G,T,520T
2502,502,G,V,520V
2503,502,G,W,520W


We want to make absolutely sure that this numbering is consistent. So pull the uppercase sequence from our chimeric sequence input, and make sure it lines up with the wildtype AAs in our mutant input table, assuming that both numberings start at 'site 1'.

In [7]:
def check_numbering_conversion(wt_mut_csv, chim_seq): 
    # pare csv down to single site entries
    wt_mut_csv = wt_mut_csv.drop_duplicates(subset=['site'])
    
    # get the chimeric seq to be mutated (uppercase portion) and translate
    chim_seq = ''.join(x for x in chim_seq if not x.islower())
    chim_seq = Bio.Seq.translate(chim_seq)
    
    # loop through every chimeric seq AA, defining the first as site 1.
    # If that site is in our mutant table, make sure the WT amino acid is consistent.
    site = 1
    for chim_aa in chim_seq:
        if site in wt_mut_csv.values:
            wt_aa = wt_mut_csv.loc[wt_mut_csv['site'] == site, 'wildtype'].iloc[0]
            assert chim_aa == wt_aa
        site+=1

In [8]:
check_numbering_conversion(hk19_targeted_muts, hk19_chim_seq)
check_numbering_conversion(ks17_targeted_muts, ks17_chim_seq)

Looks like this all worked out! So now, export mutation csv in format required by targeted tiling primers script. This is just a site column and a mutant column.

In [9]:
ks17_primer_input = ks17_targeted_muts[['site','mutant']]
ks17_primer_input.to_csv('results/TargetedTilingPrimers_inputs/ks17_muts_input.csv')

hk19_primer_input = hk19_targeted_muts[['site','mutant']]
hk19_primer_input.to_csv('results/TargetedTilingPrimers_inputs/hk19_muts_input.csv')

Now, the `results/TargetedTilingPrimers_inputs` has all of the files we need for running the [TargetedTilingPrimers script](https://github.com/jbloomlab/TargetedTilingPrimers) and generating primers.