## DBP - Modelling

First we will import all dependencies.

In [1]:
import os
import pandas as pd

from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
# To visualize pipeline diagram - 'text', or 'diagram'
from sklearn import set_config
# Import XGBoost
from xgboost import XGBClassifier

random_state = 10

In [2]:
# Import the script from different folder
import sys  
sys.path.append('../../scripts')

import file_utilities as fu
import modelling_utilities as mu

#### Set one of three project tasks (*acp*, *amp*, *dna_binding*)

In [3]:
# task - ['acp', 'amp', 'dna_binding']
task = 'acp'

### Define Pipelines

Define pipelines for ML algorithms.  
As preprocessing steps we will use `MinMaxScaler()` and `PCA()`.

In [5]:
# Define number of PCA components
num_pca_components = 1000

pipelines = {
    'xgb' : make_pipeline(MinMaxScaler(), 
                          PCA(num_pca_components),
                          XGBClassifier(random_state=random_state)),
    'lr' : make_pipeline(MinMaxScaler(),
                         PCA(num_pca_components),
                         LogisticRegression(max_iter=25000, random_state=random_state)),    
    'svm' : make_pipeline(MinMaxScaler(),
                          PCA(num_pca_components),
                          SVC(random_state=random_state)),
    'rf' : make_pipeline(MinMaxScaler(),
                         PCA(num_pca_components),
                         RandomForestClassifier(random_state=random_state))
}

### Define Hyperparameter Grids

Define hyperparamter grids for chosen ML algorithms.

In [7]:
# XGBoost
xgb_grid = {
        'xgbclassifier__max_depth': [3, 5],
         'xgbclassifier__n_estimators': [100, 200],
        }
# SVC
svm_grid = {
        'svc__kernel' : ['linear', 'rbf'],
        'svc__C': [0.01, 0.1, 1]
    }
# Random Forest
rf_grid = {
        'randomforestclassifier__n_estimators' : [100, 150],
        'randomforestclassifier__min_samples_leaf' : [1, 3],
        'randomforestclassifier__min_samples_split' : [2, 3]
    }
# Logistic Regression
lr_grid = {
        'logisticregression__C' : [0.1, 1],
        'logisticregression__solver' : ['lbfgs', 'saga']
    }

#### Create dictionary for hyperparameter grids

In [33]:
# Create hyperparameter grids dictionary
hp_grids = {
    'lr' : lr_grid,
    'svm' : svm_grid,
    'rf' : rf_grid,
    'xgb' : xgb_grid
}

### Get embedding folders and fasta files for the task.

For our modelling we will need to use previously created fasta files:

In [None]:
!tree -nhDL 1 ../../data/"{task}"/ -fP *.fa | grep fa

├── [ 17K Sep 10 20:16]  ../../data/acp/test_data.fa
└── [ 66K Sep  3 22:26]  ../../data/acp/train_data.fa


<br>

and embedding `.pt` files in these folders:

In [None]:
!tree -nhDL 3 ../../data/"{task}"/ -df | grep 'esm1\|mt\|dlm'

│   │   ├── [4.0K Sep 10 20:50]  ../../data/acp/esm/test/acp_test_esm1b_mean
│   │   └── [4.0K Sep 10 20:47]  ../../data/acp/esm/test/acp_test_esm1v_mean
│       ├── [4.0K Sep  6 16:40]  ../../data/acp/esm/train/acp_train_esm1b_mean
│       └── [4.0K Sep  6 16:01]  ../../data/acp/esm/train/acp_train_esm1v_mean
    │   ├── [4.0K Sep 10 20:35]  ../../data/acp/prose/test/acp_test_dlm_avg
    │   ├── [4.0K Sep 10 20:37]  ../../data/acp/prose/test/acp_test_dlm_max
    │   ├── [4.0K Sep 10 20:37]  ../../data/acp/prose/test/acp_test_dlm_sum
    │   ├── [4.0K Sep 10 20:37]  ../../data/acp/prose/test/acp_test_mt_avg
    │   ├── [4.0K Sep 10 20:38]  ../../data/acp/prose/test/acp_test_mt_max
    │   └── [4.0K Sep 10 20:38]  ../../data/acp/prose/test/acp_test_mt_sum
        ├── [4.0K Sep  5 15:10]  ../../data/acp/prose/train/acp_train_dlm_avg
        ├── [4.0K Sep  5 15:18]  ../../data/acp/prose/train/acp_train_dlm_max
        ├── [4.0K Sep  5 15:28]  ../../data/acp/prose/train/acp_train_dlm_sum
 

<br>  

To get paths for embedding folders and fasta files we will use the function `get_emb_folders()`.

In [4]:
pt_folders, fa_files = mu.get_emb_folders(task)

## Modelling Loop

The modelling loop includes the following steps:

1. Loop through train and test embedding folders
2. Run the function `read_embeddings()` for train embeddings to get `X_train` and `y_train`
3. Run the function `read_embeddings()` for test embeddings to get `X_test` and `y_test`
4. Define and print the output header
5. Use the function `fit_tune_CV()`to to do the following:
   - use above defined `pipelines` and `hp_grids` dictionaries and `GridSearchCV()` to get models
   - save the models with `joblib()`
   - create a dictionary of the models for one set of embedding folders
6. Run the function `evaluation()` to create an evaluation dataframe for one set of embedding folders


In [34]:
# Initialize dictionary to keep evaluation dataframes 
# One dataframe per embeddings folder (train+test, or all_data)
df_models = {}

for i in range(len(pt_folders)):
    
    # Train
    # second index: 0 - train, 1 - test
    path_pt = pt_folders[i][0]
    # Different fasta files for ESM and ProSE
    # Fasta files index: esm - 0, prose - 1
    fa_idx = 0 if 'esm' in path_pt else 1
    path_fa = fa_files[fa_idx][0]
    pool = os.path.split(path_pt)[1].split('_')[-1]
    emb_layer = 33 if 'esm' in path_pt else 'layer'
    X_train, y_train, sequence_id_train = fu.read_embeddings(path_fa, path_pt, pool, emb_layer,print_dims=False)
    
    # Test
    path_fa, path_pt = fa_files[fa_idx][1], pt_folders[i][1]
    X_test, y_test, sequence_id_train = fu.read_embeddings(path_fa, path_pt, pool, emb_layer, print_dims=False)  

    # Extensions for evaluations dataframes
    df_ext = os.path.split(path_pt)[1].split('_', 1)[1].split('_', 1)[1]
    
    # Printing output header
    ptm = df_ext.split('_')[0]
    ptr = 'ESM' if 'esm' in ptm else 'ProSE'
    print('-' * 75)
    print(f'\tPretrained Model "{ptm}" by {ptr} - Pooling Operation: "{pool}"')
    print('-' * 75)
    
    # Grid search and fit
    fitted_models = mu.fit_tune_CV(pipelines, hp_grids, 'accuracy', path_pt, X_train, y_train, task)
    
    # Save valuation dataframe into dictionary
    df_models[f'eval_{df_ext}'] = mu.evaluation(fitted_models, X_test, y_test)
  

---------------------------------------------------------------------------
	Pretrained Model "esm1v" by ESM - Pooling Operation: "mean"
---------------------------------------------------------------------------
esm1v_mean_xgb has been fitted and saved
esm1v_mean_lr has been fitted and saved
---------------------------------------------------------------------------
	Pretrained Model "dlm" by ProSE - Pooling Operation: "avg"
---------------------------------------------------------------------------
dlm_avg_xgb has been fitted and saved
dlm_avg_lr has been fitted and saved


<br>

Let's list the keys of the `df_models` dictionary:

In [10]:
df_models.keys()

dict_keys(['eval_esm1v_mean', 'eval_dlm_avg'])

<br>
Let's check a dataframe for a randomly chosen key (set of embedding folders)

In [32]:
import random
df_models[list(df_models.keys())[random.randint(0, len(df_models))]]

Unnamed: 0_level_0,cv_best_score,f1_macro,accuracy
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
esm1v_mean_xgb,0.687963,0.691694,0.69186
esm1v_mean_svm,0.729318,0.726291,0.726744


## Collecting Evaluation Results into a DataFrame

To compare evaluations for all models, collect results from all dataframes into one dataframe.  

To do that, we will merge all dataframes from the dictionary `df_models`.

In [38]:
# Create dataframe with evaluations for all models

# initialize dataframe
eval_df_all = pd.DataFrame()
# concatenate all dataframes from dictionary df_models
# Iterate through all dictionary keys 
for i in df_models.keys():
    # Use a temporary dataframe to hold one iteraton's dataframe
    eval_df_t = df_models[i].copy().reset_index()
    eval_df_all = pd.concat([eval_df_all, eval_df_t])

# Set the column 'model' as an index
eval_df_all = eval_df_all.set_index('model')

#### Display the results sorted by "accuracy"

In [None]:
# Display the dataframe
eval_df_all.sort_values(by=['accuracy'], ascending=False)

### Saving dataframe for future use

In [15]:
TASK = 'DBP' if task == 'dna_binding' else task.upper()
file_path = f'../../results/{TASK}_classifiers.csv'
eval_df_all.to_csv(file_path)

<br>

When you need to work with the results from that file, read it with the parameter `index_col=`:   
```python
df = pd.read_csv(file_path, index_col='model')
```