   ## Introduced mutation deducer notebook
   
   This notebook can be used to deduce introduced mutations for variants in the Knowledge capture workbook. Backbone mutations is looked up in the fermentation excel and hence the mutation changes in the variant is deduced over the backbone. 
   
   The notebook can be a solution to avoid bulky excel vlookups in excel.
   
   Future version maybe to lookup the backbone mutation information directly from sequoia. 
   
### Libraries required

In [1]:
## Import libraries
import pandas as pd
import numpy as np
from openpyxl import load_workbook

### Read in files from share point and proman. 

Check links before running

In [2]:
## Read from sharepoint KC workbook - Purified data sheet 
## Temp store it in a file This file gets overwritten at the end of this notebook
inpath = 'http://promanweb/sites/2505/Working%20Documents/Test_KC_wb.xlsx'
outpath = '/z/home/smbk/NoteBooks/Project/Mannanase/Test_KC_wb.xlsx'

!curl -u : --negotiate "{inpath}" --output {outpath}
    
# Load spreadsheet: xl
xl = pd.ExcelFile(outpath)

# Load sheet into a dataframe and verify if headings and data looks ok
purified_data = pd.read_excel(xl, 'Purified data', skiprows=0, header=0) 
purified_data.head()

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    16  100    16    0     0    175      0 --:--:-- --:--:-- --:--:--   175
100 1443k  100 1443k    0     0  6053k      0 --:--:-- --:--:-- --:--:-- 6053k


Unnamed: 0,SL No.,Batch,Date,Type,Back bone,Sample,UID,Total Mutation,Position,Mutation introduced,...,CV (%),IF (Average),IF_SD,CV (%).1,T1/2(Ave),T1/2(Ave_SD),HIF(BB),HIF(SD),HIF(WT),HIF(WT_SD)
0,1,Batch 1,,SDM,GH26_Pill_0008,GH26_Pill_0008,U1N9E,WT,,WT,...,8.991258,1.0,0.089913,8.991258,34.110758,2.527275,1.0,0.07409,1.0,0.07409
1,2,Batch 1,,SDM,GH26_Pill_0008,GH26_Pill_0025,U1NEQ,Q168D Q183E,183.0,Q183E,...,10.129505,0.98406,0.09968,10.129505,33.681446,2.828973,0.987414,0.082935,0.987414,0.082935
2,3,Batch 1,,SDM,GH26_Pill_0008,GH26_Pill_0026,U1NER,Q168D K198N,198.0,K198N,...,2.81423,0.950011,0.026735,2.81423,32.688056,0.726771,0.958292,0.021306,0.958292,0.021306
3,4,Batch 1,,SDM,GH26_Pill_0008,GH26_Pill_0027,U1NES,Q168D A210E,210.0,A210E,...,7.423872,1.172466,0.087042,7.423872,39.224015,2.700435,1.149902,0.079167,1.149902,0.079167
4,5,Batch 1,,SDM,GH26_Pill_0008,GH26_Pill_0028,U1NET,Q168D A210Q,210.0,A210Q,...,5.077142,1.094873,0.055588,5.077142,36.810746,1.654572,1.079154,0.048506,1.079154,0.048506


In [3]:
## Read from sharepoint the file of interest - Fermentation info 
## Temp store it in a file
inpath1 = 'https://zymernet.nzcorp.net/sites/NZ_IN_New_Detergent_Mannanase/Shared%20Documents/GH26-NZIN/Fermentation%20of%20SSL%20hits.xlsx'
outpath1 = 'test1.xlsx'

!curl -u : --negotiate "{inpath1}" --output {outpath1}
    
# Load spreadsheet: xl
xl1 = pd.ExcelFile(outpath1)

# Load sheet into a dataframe
variant_data = pd.read_excel(xl1, 'Fermentation SSL', skiprows=0, header=0) 

## Remove outpath now that we have the df
!rm {outpath1}

## Select columns of interest Codes and Total mutations and rename them to match the KC naming convention
variant_data = variant_data[['Codes', 'Total mutations']]
variant_data = variant_data.rename(columns={'Codes' : 'Sample', 'Total mutations': 'Total Mutation'})

variant_data.head()

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    16  100    16    0     0     73      0 --:--:-- --:--:-- --:--:--    73
100 1528k  100 1528k    0     0  2949k      0 --:--:-- --:--:-- --:--:-- 2949k


Unnamed: 0,Sample,Total Mutation
0,GH26_Pill8_0001,Q168D L413A
1,GH26_Pill8_0002,Q168D K408C
2,GH26_Pill8_0003,Q168D L413V
3,GH26_Pill8_0004,Q168D L413T
4,GH26_Pill8_0005,Q168D K408S


### Functions used in the notebook

In [4]:
## Function for comparing backbone mutation and variant mutation 
## Result is mutation introduced
def introduced_mutations(bb_mut, var_mut):
    bb_list = str.split(bb_mut)
    var_list = str.split(var_mut)
    
    diff_list = []
    for i in var_list:
        if i not in bb_list:
            diff_list.append(i)
    intro_mut_str = ' '.join(diff_list)
    return(intro_mut_str)


### The main magic happens now

In [5]:
#### Part 1 (get mutation information from variant data)
## Merge purified data from KC with Variant data using a left join
merged_data = pd.merge(purified_data, variant_data, how='left', on='Sample')

purified_data['Total Mutation'] = merged_data['Total Mutation_y'].where(~merged_data['Total Mutation_y'].isnull(), purified_data['Total Mutation'])
#purified_data.head()
########################################

#### Part 2 (clean and  prepare purified data)
## Imputation of undefined values in Total mutation column. 
## 0 is replaced with NAN first and subsequently all NANs are converted to blanks (for ease of reading)
purified_data['Total Mutation'].replace(0, np.nan, inplace = True)
purified_data['Total Mutation'].replace(np.nan, '', inplace = True)
#purified_data

## If there is no backbone mutation column, add it
if 'Backbone_mutation' not in purified_data.columns:
    purified_data['Backbone_mutation'] = ''
########################################

####Part 3 (update introduced mutations)
## Loop through all rows of the data frame and set value of Backbone mutation, 
## by looking up the backbone in Sample column and getting it's Total mutation value
for index, row in purified_data.iterrows(): #Iterate through the dataframe
    backbone = row['Back bone'] #Get the backbone for each row
   
    try:
        # Check if the above backbone has a record under Sample colum. If it does, get the first such row
        sample_mut_row = purified_data[purified_data['Sample']==backbone].index[0] 
    except IndexError as error:
        # No mutation found for the backbone
        print("No mutation found for BB " + str(backbone))
        bb_mut_val = ''
        #purified_data.loc[index, 'Backbone_mutation'] = '' #Set the backbone mutation to be empty
        #next
        
    # Get Total mutation column value for the row found above
    bb_mut_val = purified_data.loc[purified_data.index[sample_mut_row], 'Total Mutation']
    #Set that as the value for the bb mutation
    purified_data.loc[index, 'Backbone_mutation'] = bb_mut_val
    
    var_mut_val = row['Total Mutation']
    intro_mut_val = introduced_mutations(bb_mut_val, var_mut_val)
    purified_data.loc[index, 'Mutation introduced'] = intro_mut_val

No mutation found for BB GH26_Pill8_0176
No mutation found for BB GH26_Pill8_0176
No mutation found for BB GH26_Pill8_0176
No mutation found for BB GH26_Pill8_0176
No mutation found for BB GH26_Pill8_0176
No mutation found for BB GH26_Pill8_0176
No mutation found for BB GH26_Pill8_0176
No mutation found for BB GH26_Pill8_0176
No mutation found for BB GH26_Pill8_0176
No mutation found for BB GH26_Pill8_0176
No mutation found for BB GH26_Pill8_0176
No mutation found for BB GH26_Pill8_0176
No mutation found for BB GH26_Pill8_0176
No mutation found for BB GH26_Pill8_0176
No mutation found for BB GH26_Pill8_0176
No mutation found for BB GH26_Pill8_0176
No mutation found for BB GH26_Pill8_0176
No mutation found for BB GH26_Pill8_0176
No mutation found for BB GH26_Pill8_0176
No mutation found for BB GH26_Pill8_0176
No mutation found for BB GH26_Pill8_0176
No mutation found for BB GH26_Pill8_0175
No mutation found for BB GH26_Pill8_0175
No mutation found for BB GH26_Pill8_0175
No mutation foun

### Save the information back in proman

In [6]:
## Overwrite Purified sheet in jupyter with new information.
## Finally move to sharepoint
sp_location = 'http://promanweb/sites/2505/Working%20Documents/'
save_location = '/z/home/smbk/NoteBooks/Project/Mannanase/Test_KC_wb.xlsx'
writer = pd.ExcelWriter(save_location, engine='openpyxl')
writer.book = load_workbook(save_location)
writer.sheets = dict((ws.title, ws) for ws in writer.book.worksheets)

purified_data.to_excel(writer, sheet_name = 'Purified data')
writer.save()
!curl --negotiate -u : --upload-file "{save_location}" "{sp_location}"

Below is a test snippet to verify working of introduced mutation calculation. If required, convert to code cell from markdown and run. 

## Test
bb_mut_val = purified_data[purified_data['Sample']=='GH26_Pill8_0001']
bb_mut_val


purified_data.loc[purified_data['Sample'] == 'GH26_Pill8_0001', 'Total Mutation'] = 'abc'
purified_data.head()

bb_mut_val1 = purified_data[purified_data['Sample']=='GH26_Pill8_0001']
bb_mut_val1

purified_data.loc[:,['Back bone', 'Sample', 'Backbone_mutation', 'Total Mutation', 'Mutation introduced']]