## Multi factor study of variants

This notebook calculates cumalative values for each variant in a list and gives a table of these observations. Some graphs are also plotted. 

#### Inputs 
* List of variants mutations over backbones that you want to compare.
* Ideally you would have your own table of reference. See below table for inspiration. 

|Name | 3-Letter | symbol | 1-Letter | symbol | Molecular_weight | Molecular_formula | Residue_formula | Residue_Weight(-H20) | pKa | pKb | pKx | pi | Hydrophobicity at ph2 | Hydrophobicity at ph7|
|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|
|0 | Alanine | Ala | A | 89.10 | C3H7NO2 | C3H5NO | 71.08 | 2.34 | 9.69 | – | 6 | 47.0 | 41.0|
|1 | Arginine | Arg | R | 174.20 | C6H14N4O2 | C6H12N4O | 156.19 | 2.17 | 9.04 | 12.48 | 10.76 | -26.0 | -14.0|
|2 | Asparagine | Asn | N | 132.12 | C4H8N2O3 | C4H6N2O2 | 114.11 | 2.02 | 8.8 | – | 5.41 | -41.0 | -28.0|
|3 | Aspartic | acid | Asp | D | 133.11 | C4H7NO4 | C4H5NO3 | 115.09 | 1.88 | 9.6 | 3.65 | 2.77 | -18.0 | -55.0|
|4 | Cysteine | Cys | C | 121.16 | C3H7NO2S | C3H5NOS | 103.15 | 1.96 | 10.28 | 8.18 | 5.07 | 52.0 | 49.0|
|5 | Glutamic | acid | Glu | E | 147.13 | C5H9NO4 | C5H7NO3 | 129.12 | 2.19 | 9.67 | 4.25 | 3.22 | 8.0 | -31.0|
|6 | Glutamine | Gln | Q | 146.15 | C5H10N2O3 | C5H8N2O2 | 128.13 | 2.17 | 9.13 | – | 5.65 | -18.0 | -10.0|
|7 | Glycine | Gly | G | 75.07 | C2H5NO2 | C2H3NO | 57.05 | 2.34 | 9.6 | – | 5.97 | 0.0 | 0.0|
|8 | Histidine | His | H | 155.16 | C6H9N3O2 | C6H7N3O | 137.14 | 1.82 | 9.17 | 6 | 7.59 | -42.0 | 8.0|
|9 | Hydroxyproline | Hyp | O | 131.13 | C5H9NO3 | C5H7NO2 | 113.11 | 1.82 | 9.65 | – | – | NaN | NaN|
|10 | Isoleucine | Ile | I | 131.18 | C6H13NO2 | C6H11NO | 113.16 | 2.36 | 9.6 | – | 6.02 | 100.0 | 99.0|
|11 | Leucine | Leu | L | 131.18 | C6H13NO2 | C6H11NO | 113.16 | 2.36 | 9.6 | – | 5.98 | 100.0 | 97.0|
|12 | Lysine | Lys | K | 146.19 | C6H14N2O2 | C6H12N2O | 128.18 | 2.18 | 8.95 | 10.53 | 9.74 | -37.0 | -23.0|
|13 | Methionine | Met | M | 149.21 | C5H11NO2S | C5H9NOS | 131.20 | 2.28 | 9.21 | – | 5.74 | 74.0 | 74.0|
|14 | Phenylalanine | Phe | F | 165.19 | C9H11NO2 | C9H9NO | 147.18 | 1.83 | 9.13 | – | 5.48 | 92.0 | 100.0|
|15 | Proline | Pro | P | 115.13 | C5H9NO2 | C5H7NO | 97.12 | 1.99 | 10.6 | – | 6.3 | -46.0 | -46.0|
|16 | Pyroglutamatic | Glp | U | 139.11 | C5H7NO3 | C5H5NO2 | 121.09 | – | – | – | 5.68 | NaN | NaN|
|17 | Serine | Ser | S | 105.09 | C3H7NO3 | C3H5NO2 | 87.08 | 2.21 | 9.15 | – | 5.68 | -7.0 | -5.0|
|18 | Threonine | Thr | T | 119.12 | C4H9NO3 | C4H7NO2 | 101.11 | 2.09 | 9.1 | – | 5.6 | 13.0 | 13.0|
|19 | Tryptophan | Trp | W | 204.23 | C11H12N2O2 | C11H10N2O | 186.22 | 2.83 | 9.39 | – | 5.89 | 84.0 | 97.0|
|20 | Tyrosine | Tyr | Y | 181.19 | C9H11NO3 | C9H9NO2 | 163.18 | 2.2 | 9.11 | 10.07 | 5.66 | 49.0 | 63.0|
|21 | Valine | Val | V | 117.15 | C5H11NO2 | C5H9NO | 99.13 | 2.32 | 9.62 | – | 5.96 | 79.0 | 76.0|



#### Steps
* Read this documentation.
* Clean up input files for extra spaces, blank lines etc.
* User needs to edit the above mentioned file names and sheet names.
* User needs to enter list of columns from reference table which are to be studied.
* Check for variant mutation delimiter and change as required. Default is space.
* Run all cells

#### Imports

In [1]:
import pandas as pd
import re
import warnings
warnings.filterwarnings("ignore")
from openpyxl import load_workbook

In [12]:
def derieve_rating(variant, ref_table, rating_type):
    var_mut_list = str.split(variant, " ")##Change delimiter here.

    total_rating = 0
        
    #loops through each mutation in variant
    for mut in var_mut_list:
        org = mut[0]
        sub = re.findall('([a-zA-Z]+)$',mut)[0]
        pos = re.findall('(\d+)',mut)[0]

        #Handling insertaions as *<n>aAbB
        if org == '*':
            insertion = list(filter(str.isupper, sub))
            change_insertion = 0
            #loops through each insertion
            for each in insertion:
                insertion_rating = ref_table.loc[ref_table['1-Letter symbol'] == each, rating_type ].item()
                change_insertion += insertion_rating          
            change = change_insertion
            
        #Handling deletions as A<n>/
        elif sub == '/':
            change = -(ref_table.loc[ref_table['1-Letter symbol'] == org, rating_type ].item())
        
        #Regular substitutions
        else: 
            sub_rating = ref_table.loc[ref_table['1-Letter symbol'] == sub, rating_type ].item()
            org_rating = ref_table.loc[ref_table['1-Letter symbol'] == org, rating_type ].item()
            change = sub_rating - org_rating
        
        #final rating of variant - sum of all changes  
        total_rating += change
        
    return total_rating

In [3]:
amino_acid_ref = pd.read_excel('../../Amino_acid_reference.xlsx', 'Sheet1', header=0)
amino_acid_ref

Unnamed: 0,Name,3-Letter symbol,1-Letter symbol,Molecular weight,Molecular formula,Residue formula,Residue Weight(-H20),pKa,pKb,pKx,pi,Hydrophobicity at ph2,Hydrophobicity at ph7
0,Alanine,Ala,A,89.1,C3H7NO2,C3H5NO,71.08,2.34,9.69,–,6,47.0,41.0
1,Arginine,Arg,R,174.2,C6H14N4O2,C6H12N4O,156.19,2.17,9.04,12.48,10.76,-26.0,-14.0
2,Asparagine,Asn,N,132.12,C4H8N2O3,C4H6N2O2,114.11,2.02,8.8,–,5.41,-41.0,-28.0
3,Aspartic acid,Asp,D,133.11,C4H7NO4,C4H5NO3,115.09,1.88,9.6,3.65,2.77,-18.0,-55.0
4,Cysteine,Cys,C,121.16,C3H7NO2S,C3H5NOS,103.15,1.96,10.28,8.18,5.07,52.0,49.0
5,Glutamic acid,Glu,E,147.13,C5H9NO4,C5H7NO3,129.12,2.19,9.67,4.25,3.22,8.0,-31.0
6,Glutamine,Gln,Q,146.15,C5H10N2O3,C5H8N2O2,128.13,2.17,9.13,–,5.65,-18.0,-10.0
7,Glycine,Gly,G,75.07,C2H5NO2,C2H3NO,57.05,2.34,9.6,–,5.97,0.0,0.0
8,Histidine,His,H,155.16,C6H9N3O2,C6H7N3O,137.14,1.82,9.17,6,7.59,-42.0,8.0
9,Hydroxyproline,Hyp,O,131.13,C5H9NO3,C5H7NO2,113.11,1.82,9.65,–,–,,


#### List desired columns or properties 

In [4]:
properties_list = ['Molecular weight', 'Residue Weight(-H20)', 'pKa','pKb', 'pi', 'Hydrophobicity at ph2', 'Hydrophobicity at ph7']
#properties_list

#### The varaiant list

In [10]:
variants_df = pd.read_excel('../../Project/Data_files/Variants_mutations.xlsx', 'Purified_Protein', header=0)
variants_df

Unnamed: 0,mut,Temp,Time(min),Detergent,HIF(WT),HIF(WT_SD)
0,A101L,30,30,90% Model,1.743793,0.306161
1,A101N,30,30,90% Model,2.177001,0.453622
2,A111L,30,30,90% Model,1.151486,0.132087
3,A1M,30,30,90% Model,0.836132,0.069533
4,A23M,30,30,90% Model,1.071493,0.083965
5,A23S,30,30,90% Model,1.216458,0.103861
6,D385A,30,60,30% Model,1.283781,0.145094
7,D385F,30,60,30% Model,1.393745,0.353281
8,D48P,30,30,90% Model,2.975107,0.210086
9,D63C,30,30,90% Model,1.808257,0.130443


In [23]:
## Iterate each variant and for each specified property
for index, row in variants_df.iterrows():
  try:
    row['mut']
  except IndexError:
    next
    for j in properties_list:
      each_metrics = derieve_rating(row['mut'], amino_acid_ref, j)
      if j not in variants_df.columns:
        variants_df[j] = ''
      variants_df.loc[index,j] = each_metrics
variants_df

In [25]:
save_location = '/foo/bar/Data_files/Variants_mutations.xlsx'
writer = pd.ExcelWriter(save_location, engine='openpyxl')
writer.book = load_workbook(save_location)
writer.sheets = dict((ws.title, ws) for ws in writer.book.worksheets)

variants_df.to_excel(writer, sheet_name = 'Purified_protein')
writer.save()

#### Plot Graphs

Change the columns indicated in X and Y axis and the titles of the graph and axis as required in the cell below. 
'Text' property is for additional hover information. 

In [None]:
X_axis = variants_df['HIF']
Y_axis = variants_df['Molecular_weight']

trace = go.Scatter(
  x = X_axis,
  y = Y_axis,
  mode = 'markers',
  marker = dict(
    size = 10,
    opacity = 0.3
  ),
  text = variants_df['Total Mutation'] ## Hover information. Choose column
)

layout = go.Layout(
  title = 'HIF vs Molecular weight',
  xaxis = dict(
    title = 'HIF'
  ),
  yaxis = dict(
    title = 'Molecular weight'
  ),
)

data = [trace]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, config=offline_config)