# Overview

In this example, the Fast Fourier Transform method was used to convert the amino acid sequence of the protein into two-dimensional spectral data, combined with least squares regression for machine learning and to predict the activity of new mutation sites.


In [1]:
FFT_src_folder_path = "./src/"

import sys
sys.path.insert(0, FFT_src_folder_path)


%load_ext autoreload
%autoreload 2

# ignore all warnings
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
from data import getxulie
from data import getseqs
from data import getmutimutants,getsinglemutants
from model import index_search
from model import make_prediction

In [2]:
# the main function of every task
def main(wt_seq,train_m,target,predict_m,task_name,flag,cv):

    #Genrate the sequence of mutants
    train_m_dict=getseqs(wt_seq,train_m)
    if type(predict_m) == int:  ## Specify the sites for performing saturation mutagenesis
        predict_m_dict = getsinglemutants(wt_seq,predict_m)
    elif '.csv' in predict_m:
        predict_m_dict = pd.read_csv('./input'+'/'+predict_m,index_col=0).to_dict()['0']
    else:  ## Default to combinatorial mutagenesis at specified sites
        predict_m_dict  = getmutimutants(wt_seq,predict_m)

    #Screen the best AAindex
    indexlist=[]
    for i in range(flag):
        screenindex,score = index_search(train_m_dict,target,indexlist,cv)
        screenindex_= '_'.join(screenindex)
        score.to_csv(f'./output/{task_name}_{screenindex_}'+'.csv')
        indexlist.append(screenindex[-1])
        print('The best index of round '+str(i+1)+' : '+screenindex_)

    single_m = getseqs(wt_seq,[m for m in train_m if '/' not in m])
    #Model prediction
    all_m =dict(single_m,**predict_m_dict)  #Combine single and multiple mutants
    result = make_prediction(train_m_dict,target,all_m,score.index[0].split('_'),score['n_components'][0])
    result=result.sort_values(by=score.index[0],ascending=False)
    return result


## SingleForMulti
In this example, data of single-point mutations is utilized as the training set. After selecting suitable AAindex features, the best features are used to build a model, which is then applied to predict the mutation landscape.

### INPUT

Sequence file：`seq_IFRS.txt`  #the IFRS sequence

Traindata：`data.xlsx ( sheet_name : Trainset1 ) ` #Specify the training set, including the fitness of the single-point mutations and WT.

Then put them in the `input` folder.

### OUTPUT
`result_SingleForMulti_pred.csv ` #The predicted fitness of specified combinatory mutants was saved in the `output` folder.

### FOR USER
Firstly, create a file named 'data.xlsx' in the `input` folder, and put the training set in the first sheet named `'Trainset1'`. The fitness of the  mutations is in the 'Fitness' column, and the variants are in the 'Variants' column. create a file named 'XXX.txt' in the input folder, and put the WT sequence in it.

Then Change the following parameters in the code:

`seq_WT_protein_path ` # Specify the original amino acids sequence

`Combinatory_mutants` # Specify the mutation sites to be combined

`output_file `  # the name of the output file

`flag`  # the number of indexes to be screened

`cv`   # the fold of cross validation

Then run the code.

Finally, the predicted fitness of the specified combinatory mutants will be saved in the `output` folder.

In [3]:
task_name= 'SingleForMulti'  # the name of the task

In [4]:
seq_WT_protein_path='./input/seq_IFRS.txt'  ## Specify the original sequence
Combinatory_mutants= ['D2N','K3N','R19H','H29R','V31I','T56P','R61K','H62Y','H63Y','A100E','T122S','S193R'] 
output_file = 'result_{task_name}_pred'

wt_seq = getxulie(seq_WT_protein_path)  
data = pd.read_excel('./input/data.xlsx',sheet_name='Trainset1')  # Specify the training set
target = data['Fitness'].values
train_m =  [m for m in data['Variants'].tolist() if m==m] 
predict_m =Combinatory_mutants# Specify the mutation sites to be combined

flag=1  # the number of indexs to be screened
cv=len(train_m) # the fold of cross validation, here represents leave-one-out cross validation


In [5]:
# The predicted fitness of all combinatory mutants
result = main(wt_seq,train_m,target,predict_m,task_name,flag,cv)

# The predicted Fitness of single, double and triple mutants
result.loc[[m for m in result.index if m.count('/')<3],].head(10)

Processing: 100%|██████████| 566/566 [00:35<00:00, 15.94it/s]


The best index of round 1 : OOBM850103
The number of mutants for training: 13
The number of mutants for prediction: 4096


Unnamed: 0,OOBM850103
D2N/R61K/H62Y,4.817919
D2N/V31I/H62Y,4.741733
D2N/H62Y,4.581115
D2N/H62Y/A100E,4.461181
D2N/H29R/H62Y,4.204006
D2N/H62Y/S193R,4.197835
D2N/K3N/H62Y,4.162269
D2N/V31I/R61K,4.058469
D2N/R61K/A100E,4.026157
D2N/R19H/H62Y,3.984228


In [6]:
# save the result to csv in the output folder
result.to_csv(f'./output/{output_file}'+'.csv')

## MultiForCom1
In this example, data of single, double and triple-site mutations is utilized as the training set. After selecting suitable AAindex features, the best features are used to build a model, which is then applied to predict the mutation landscape. And top 8 of the predicted mutations are selected to validate by experiments.

### Preparation

Sequence file：`seq_IFRS.txt`  #the WT sequence

Traindata：`data.xlsx ( sheet_name : Trainset2 ) ` #Specify the training set, including the fitness of the single-point and multi-point mutations and WT.

Then put them in the `input` folder.

### OUTPUT
`result_MultiForCom1_pred.csv ` #The predicted fitness of specified combinatory mutants was saved in the `output` folder.


### FOR USER
Firstly, create a file named 'data.xlsx' in the `input` folder, and put the training set in the first sheet named `'Trainset2'`. The fitness of the  mutations is in the 'Fitness' column, and the variants are in the 'Variants' column. create a file named 'XXX.txt' in the input folder, and put the WT sequence in it.

Then Change the following parameters in the code:

`seq_WT_protein_path ` # Specify the original amino acids sequence

`Combinatory_mutants` # Specify the mutation sites to be combined

`output_file `  # the name of the output file

`flag`  # the number of indexes to be screened

`cv`   # the fold of cross validation

Then run the code.

Finally, the predicted fitness of the specified combinatory mutants will be saved in the `output` folder.

In [7]:
task_name= 'MultiForCom1'  # the name of the task

In [8]:
seq_WT_protein_path='./input/seq_IFRS.txt'  ## Specify the original sequence
# Specify the mutation sites to be combined
Combinatory_mutants= ['D2N','K3N','R19H','H29R','V31I','T56P','R61K','H62Y','H63Y','A100E','T122S','S193R']  
output_file = 'result_{task_name}_pred'

wt_seq = getxulie(seq_WT_protein_path)  
data = pd.read_excel('./input/data.xlsx',sheet_name='Trainset2')
target = data['Fitness'].values
train_m =  [m for m in data['Variants'].tolist() if m==m] 
predict_m =Combinatory_mutants

flag=1  # the number of indexes to be screened
cv=len(train_m)# the fold of cross validation, here represents leave-one-out cross validation


In [9]:
result = main(wt_seq,train_m,target,predict_m,task_name,flag,cv)
result.head(8)

Processing: 100%|██████████| 566/566 [01:09<00:00,  8.12it/s]


The best index of round 1 : RADA880104
The number of mutants for training: 38
The number of mutants for prediction: 4096


Unnamed: 0,RADA880104
D2N/K3N/T56P/R61K/H62Y/S193R,8.163454
D2N/K3N/V31I/T56P/R61K/H62Y/S193R,8.133763
D2N/K3N/T56P/R61K/H62Y/T122S/S193R,8.133705
D2N/K3N/V31I/T56P/R61K/H62Y/T122S/S193R,8.128409
D2N/T56P/R61K/H62Y/S193R,8.065958
D2N/V31I/T56P/R61K/H62Y/S193R,8.064915
D2N/V31I/T56P/R61K/H62Y/T122S/S193R,8.059374
D2N/T56P/R61K/H62Y/T122S/S193R,8.052968


In [10]:
result.to_csv(f'./output/{output_file}'+'.csv') # save the result to csv in the output folder

## Com1_SingleForSingle
In this example, data of single mutations based on Com1-IFRS is utilized as the training set for the sake of new single mutations.

### INPUT

Sequence file：`seq_Com1.txt`  #the Com1-IFRS sequence

Traindata：`data.xlsx ( sheet_name : Trainset3 ) ` #Specify the training set, including the fitness of the single-point mutations and WT based on Com1-IFRS.

Then put them in the `input` folder.

### OUTPUT
`result_Com1_SingleForSingle_pred.csv ` #The predicted fitness of specified combinatory mutants was saved in the `output` folder.


### FOR USER
Firstly, create a file named 'data.xlsx' in the `input` folder, and put the training set in the first sheet named `'Trainset3'`. The fitness of the  mutations is in the 'Fitness' column, and the variants are in the 'Variants' column. create a file named 'XXX.txt' in the input folder, and put the WT sequence in it.

Then Change the following parameters in the code:

`seq_WT_protein_path ` # Specify the original amino acids sequence

`Mutations_scaning_domain` # Specify the domain to be screened. eg. 240 refers to the first 240 amino acids of the WT sequence.

`output_file `  # the name of the output file

`flag`  # the number of indexes to be screened

`cv`   # the fold of cross validation

Then run the code.

Finally, the predicted fitness of the specified single-point mutants will be saved in the `output` folder.


In [11]:
task_name= 'Com1_SingleForSingle'  # the name of the task

In [12]:
seq_WT_protein_path='./input/seq_Com1.txt'  ## Specify the original sequence
Mutations_scaning_domain = 240 # The tRNA binding domain including 240 amino acids
output_file = 'result_{task_name}_pred'

wt_seq = getxulie(seq_WT_protein_path)   ## Specify the original sequence
data = pd.read_excel('./input/data.xlsx',sheet_name='Trainset3')

target = data['Fitness'].values
train_m =  [m for m in data['Variants'].tolist() if m==m] 
predict_m=Mutations_scaning_domain # The tRNA binding domain including 240 amino acids


In [13]:
flag=3  # the number of indexs to be screened
cv=10 # the fold of cross validation
result = main(wt_seq,train_m,target,predict_m,task_name,flag,cv)
result.head(10)

Processing: 100%|██████████| 566/566 [00:35<00:00, 16.07it/s]


The best index of round 1 : QIAN880114


Processing: 100%|██████████| 566/566 [00:43<00:00, 12.99it/s]


The best index of round 2 : QIAN880114_OOBM770105


Processing: 100%|██████████| 566/566 [01:18<00:00,  7.18it/s]


The best index of round 3 : QIAN880114_OOBM770105_QIAN880125
The number of mutants for training: 96
The number of mutants for prediction: 4801


Unnamed: 0,QIAN880114_OOBM770105_QIAN880125
H63G,2.060392
H63S,1.63696
H63A,1.545795
E199G,1.544404
N80I,1.516855
E199S,1.444043
K67G,1.415897
H63T,1.410607
D76G,1.408622
H28A,1.384394


In [14]:
result.to_csv(f'./output/{output_file}'+'.csv') # save the result to csv in the output folder

## Com1_MultiForCom2
In this example, data of single and double mutations based on Com1-IFRS is used as the training set for the sake of combinations.

### INPUT

Sequence file：`seq_Com1.txt`  #the Com1-IFRS sequence

Traindata：`data.xlsx ( sheet_name : Trainset4 ) ` #Specify the training set, including the fitness of the single-point and multi-point mutations and WT based on Com1-IFRS.

### OUTPUT
`result_Com1_MultiForCom2_pred.csv ` #The predicted fitness of specified combinatory mutants was saved in the `output` folder.


### FOR USER
Firstly, create a file named 'data.xlsx' in the `input` folder, and put the training set in the first sheet named `'Trainset4'`. The fitness of the  mutations is in the 'Fitness' column, and the variants are in the 'Variants' column. create a file named 'XXX.txt' in the input folder, and put the WT sequence in it.

Then Change the following parameters in the code:

`seq_WT_protein_path ` # Specify the original amino acids sequence

`Combinatory_mutants` # Specify the mutation sites to be combined

`output_file `  # the name of the output file

`flag`  # the number of indexes to be screened

`cv`   # the fold of cross validation

Then run the code.

Finally, the predicted fitness of the specified combinatory mutants will be saved in the `output` folder.

In [15]:
task_name= 'Com1_MultiForCom2'  # the name of the task

In [16]:
seq_WT_protein_path='./input/seq_Com1.txt'  ## Specify the original sequence
# Combinatory_mutants= ['D2N','K3N','R19H','H29R','V31I','T56P','R61K','H62Y','H63Y','A100E','T122S','S193R'] 
Combinatory_mutants= 'com1_11520_multi_m.csv'  # A great much time is taken to generate combinatorial mutant sequences. So a file is used to store the combinatory mutant sequences.
output_file = 'result_{task_name}_pred'

wt_seq = getxulie(seq_WT_protein_path)  
wt_seq = getxulie('./input/seq_Com1.txt')  ## Specify the original sequence
data = pd.read_excel('./input/data.xlsx',sheet_name='Trainset4')
target = data['Fitness'].values
train_m =  [m for m in data['Variants'].tolist() if m==m] 
# predict_m=Combinatorial_mutants 
predict_m=Combinatory_mutants  

In [17]:
flag=3  # the number of indexs to be screened
cv=10  # the fold of cross validation
result = main(wt_seq,train_m,target,predict_m,task_name,flag,cv)
result.head(10)

Processing: 100%|██████████| 566/566 [00:42<00:00, 13.20it/s]


The best index of round 1 : AVBF000109


Processing: 100%|██████████| 566/566 [00:52<00:00, 10.85it/s]


The best index of round 2 : AVBF000109_JUNJ780101


Processing: 100%|██████████| 566/566 [01:01<00:00,  9.18it/s]


The best index of round 3 : AVBF000109_JUNJ780101_JUKT750101
The number of mutants for training: 120
The number of mutants for prediction: 11520


Unnamed: 0,AVBF000109_JUNJ780101_JUKT750101
N7Y/H63L/K67N/V74W/D76N,2.249757
N7Y/H63A/K67N/V74W/D76N,2.244994
N7Y/H63L/K67N/V74W,2.219965
N7Y/H63A/K67N/V74W/D76H,2.214921
N7Y/H63A/K67N/V74W,2.19305
N7Y/H63A/K67N/V74W/D76S,2.180639
N7Y/H63D/K67N/V74W,2.178101
N7Y/H63I/K67N/V74W,2.170245
N7Y/H63A/K67N/T68F/V74W,2.146943
N7Y/H63L/K67N/V74W/D76H,2.145405


In [18]:
result.to_csv(f'./output/{output_file}'+'.csv') # save the result to csv in the output folder