# Predictions with QAFI


--- 
The notebook provides necessary functions and pipeline to make a prediction with the QAFI predictor.

Please follow the `Features.ipynb` notebook to obtain the features for your protein of interest before applying the QAFI predictions

--- 
## Requirements

- **A folder with the name of your protein's uniprot ID**
    - `output/QAFI_predictions/#uniprotID_of_your_protein#`
    - Please make sure you have the output folder to reproduce the notebook (or modify the codes for your preferences)
    - This folder should include the following file:
    - `#uniprotID_of_your_protein#_featuresAll.csv` file
        - final output of the `Features.ipynb` notebook
        - Dataframe of your protein of interest with all features


- **Dataset with 30 DMS assays and features**
    - Available at `QAFI/data/Dataset_30proteins_features.csv`
   
- List of 10 Proteins selected 
    - Available at `data/QAFI_10proteins_list.csv`
        - Note: to find the list of 10 protein selection process, please follow the "Determining the ten most effective predictors for the QAFI predictor" section in the `ProteinSpecificPredictors.ipynb` notebook
  
  
- **Folders/paths specified for storing the outputs**
    - Please make sure you have the following output folders to reproduce the notebook (or modify the codes for your preferences):
        - `output/QAFI_predictions/train_one_predict_rest`

--- 

For a detailed description of the features, please refer to our QAFI paper. Preprint version can be found:
- Selen Ozkan, Natàlia Padilla, Xavier de la Cruz et al. QAFI: A Novel Method for Quantitative Estimation of Missense Variant Impact Using Protein-Specific Predictors and Ensemble Learning, 07 May 2024, PREPRINT (Version 1) available at Research Square [https://doi.org/10.21203/rs.3.rs-4348948/v1]

In [1]:
import pandas as pd
from os.path import join
import os
import QAFI
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Testing protein

uniprot ID: Q9Y375

https://www.uniprot.org/uniprotkb/Q9Y375/entry

In [2]:
uniprot = 'Q9Y375'
db_protein_test = pd.read_csv(f'data/proteins/{uniprot}/{uniprot}_featuresAll.csv')
                            
db_protein_test.head()

Unnamed: 0,uniprot,protein,variant,first,pos,wt_pos,second,Blosum62,PSSM,Shannon's entropy,...,neco,pLDDT,pLDDT bin,colasi,fraction cons. 3D neighbor,fanc,fbnc,M.J. potential,access.dependent vol.,laar
0,Q9Y375,NDUFAF1,M1A,M,1,M1,A,-1.0,6.52,0.943,...,-0.1941,38.45,0.0,0.006,0.0,0.0,0.0,0.0,-0.254,0.327763
1,Q9Y375,NDUFAF1,M1Y,M,1,M1,Y,-1.0,6.52,0.943,...,-0.5577,38.45,0.0,0.006,0.0,0.0,0.0,0.0,0.438,-0.153484
2,Q9Y375,NDUFAF1,M1W,M,1,M1,W,-1.0,6.52,0.943,...,-0.444,38.45,0.0,0.006,0.0,0.0,0.0,0.0,0.55,0.189655
3,Q9Y375,NDUFAF1,M1V,M,1,M1,V,1.0,6.52,0.943,...,0.1777,38.45,0.0,0.006,0.0,0.0,0.0,0.0,-0.207,0.067623
4,Q9Y375,NDUFAF1,M1T,M,1,M1,T,-1.0,6.52,0.943,...,0.182,38.45,0.0,0.006,0.0,0.0,0.0,0.0,-0.142,0.470787


In [3]:
# Create a directory for the testing protein if it doesn't exist

baseDir = 'output/QAFI_predictions'
test_protein_uniprot = db_protein_test.uniprot.unique()[0]  # Example uniprot ID
print(test_protein_uniprot)

dir_path = os.path.join(baseDir, test_protein_uniprot)
os.makedirs(dir_path, exist_ok=True)

Q9Y375


<div class="alert alert-info">
  <strong> <h1>1. Train each PSP and predict testing protein</h1> </strong>
</div>

In [4]:
# Prepare dataframe
DB = pd.read_csv('data/Dataset_30proteins_features.csv')
DB_train=DB.copy()

# structural features (7/9)
columns_to_update = ['colasi','fraction cons. 3D neighbor','fanc','fbnc','M.J. potential','access.dependent vol.','laar']

# Update structural features where 'pLDDT bin' is 0
DB_train.loc[DB_train['pLDDT bin'] == 0, columns_to_update] = 0.0

In [5]:
# Define base directory and protein directory for QAFI proteins

baseDir = 'output/QAFI_predictions/train_one_predict_rest'

# selected proteins for QAFI
proteins_10 = list(pd.read_csv('data/QAFI_10proteins_list.csv').protein.values)

# Create the directory if it doesn't exist
for prot in proteins_10:
    dir_path = os.path.join(baseDir, prot)
    os.makedirs(dir_path, exist_ok=True)

In [6]:
# model input

features = ['Blosum62', 'PSSM',"Shannon's entropy", "Shannon's entropy of seq. neighbours",'neco',
            'pLDDT','pLDDT bin', 'colasi', 'fraction cons. 3D neighbor', 'fanc', 'fbnc',
            'M.J. potential', 'access.dependent vol.', 'laar']


predictor_name = 'PSP'
undersample=True
target = 'score_log_normalized'

path_save =  'output/QAFI_predictions/train_one_predict_rest/'
path_log= 'output/log/'
output_path = 'output/QAFI_predictions/'


# list of uniprot IDs to be tested
proteins_to_be_tested_uni = db_protein_test.uniprot.unique()[0]

In [7]:
# Example usage
QAFI.QAFI_train_psp_test_protein(proteins_10, DB_train, proteins_to_be_tested_uni, output_path, features, target, predictor_name, path_save, path_log)


       PROTEIN training: haeIIIM 1350
testing... NDUFAF1

       PROTEIN training: MSH2 16749
testing... NDUFAF1

       PROTEIN training: TP53 7444
testing... NDUFAF1

       PROTEIN training: neo 4234
testing... NDUFAF1

       PROTEIN training: PTEN 6564
testing... NDUFAF1

       PROTEIN training: ADRB2 7800
testing... NDUFAF1

       PROTEIN training: TPMT 3689
testing... NDUFAF1

       PROTEIN training: bla 4997
testing... NDUFAF1

       PROTEIN training: SUMO1 1700
testing... NDUFAF1

       PROTEIN training: amiE 6227
testing... NDUFAF1



<div class="alert alert-info">
  <strong> <h1>2. Prepare per prediction file(s)</h1> </strong>
</div>

In [8]:
path_psp = 'output/QAFI_predictions/train_one_predict_rest/'
output_path = 'output/QAFI_predictions/'


test_protein_uniprot = db_protein_test.uniprot.unique()[0]

proteins_10 = list(pd.read_csv('data/QAFI_10proteins_list.csv').protein.values)
predictor_name = 'PSP'


In [9]:
QAFI.aggregate_predictions(proteins_10, path_psp, test_protein_uniprot, db_protein_test, predictor_name, output_path)

<div class="alert alert-info">
  <strong> <h1> 3. Take the medians of 10 PSPs = Obtain QAFI predictions</h1> </strong>
</div>

In [10]:
output_path = f'output/QAFI_predictions/'

db_protein_test = pd.read_csv(f'{output_path}/{test_protein_uniprot}/{test_protein_uniprot}_predictions_all.csv')
db_protein_test.head()  

Unnamed: 0,uniprot,protein,variant,first,pos,wt_pos,second,Blosum62,PSSM,Shannon's entropy,...,PSP_trainedby_haeIIIM,PSP_trainedby_MSH2,PSP_trainedby_TP53,PSP_trainedby_neo,PSP_trainedby_PTEN,PSP_trainedby_ADRB2,PSP_trainedby_TPMT,PSP_trainedby_bla,PSP_trainedby_SUMO1,PSP_trainedby_amiE
0,Q9Y375,NDUFAF1,M1A,M,1,M1,A,-1.0,6.52,0.943,...,0.468,1.827,0.876,0.647,0.802,2.696,0.748,1.02,0.597,0.937
1,Q9Y375,NDUFAF1,M1Y,M,1,M1,Y,-1.0,6.52,0.943,...,0.479,1.76,0.83,0.582,0.751,2.668,0.725,0.999,0.548,0.866
2,Q9Y375,NDUFAF1,M1W,M,1,M1,W,-1.0,6.52,0.943,...,0.475,1.781,0.844,0.602,0.767,2.677,0.732,1.006,0.563,0.888
3,Q9Y375,NDUFAF1,M1V,M,1,M1,V,1.0,6.52,0.943,...,0.561,2.044,0.954,0.793,0.913,2.9,0.848,1.167,0.698,1.063
4,Q9Y375,NDUFAF1,M1T,M,1,M1,T,-1.0,6.52,0.943,...,0.457,1.897,0.924,0.714,0.854,2.725,0.77,1.042,0.648,1.01


In [11]:
db_protein_test_preds = QAFI.calculate_median_predictions(db_protein_test, proteins_10, predictor_name)


Tested protein:	NDUFAF1

Proteins selected for median:

  ['ADRB2', 'MSH2', 'PTEN', 'SUMO1', 'TP53', 'TPMT', 'amiE', 'bla', 'haeIIIM', 'neo']


10


In [12]:
uniprot='Q9Y375'
db_protein_test_preds.to_csv(f'output/QAFI_predictions/{uniprot}/{uniprot}_preds_QAFI.csv', index=0)

db_protein_test_preds.head()

Unnamed: 0,uniprot,protein,variant,first,pos,wt_pos,second,Blosum62,PSSM,Shannon's entropy,...,PSP_trainedby_MSH2,PSP_trainedby_TP53,PSP_trainedby_neo,PSP_trainedby_PTEN,PSP_trainedby_ADRB2,PSP_trainedby_TPMT,PSP_trainedby_bla,PSP_trainedby_SUMO1,PSP_trainedby_amiE,QAFI(MLR_median_10)
0,Q9Y375,NDUFAF1,M1A,M,1,M1,A,-1.0,6.52,0.943,...,1.827,0.876,0.647,0.802,2.696,0.748,1.02,0.597,0.937,0.839
1,Q9Y375,NDUFAF1,M1Y,M,1,M1,Y,-1.0,6.52,0.943,...,1.76,0.83,0.582,0.751,2.668,0.725,0.999,0.548,0.866,0.79
2,Q9Y375,NDUFAF1,M1W,M,1,M1,W,-1.0,6.52,0.943,...,1.781,0.844,0.602,0.767,2.677,0.732,1.006,0.563,0.888,0.806
3,Q9Y375,NDUFAF1,M1V,M,1,M1,V,1.0,6.52,0.943,...,2.044,0.954,0.793,0.913,2.9,0.848,1.167,0.698,1.063,0.934
4,Q9Y375,NDUFAF1,M1T,M,1,M1,T,-1.0,6.52,0.943,...,1.897,0.924,0.714,0.854,2.725,0.77,1.042,0.648,1.01,0.889
