# Best models

This notebook aims to retrieve the best model object we have based on a metric (e.g. balanced accuracy) for all seeds

10 models for each (model type, dataset, version, normalization) combination have been trained with different 10 seeds
=> get the best model of them (best seed)

This can be important in the interpretability part, for instance in SHAP we need to take as input a trained model, can not have an aggregated one (so it's like we're performing grid search on seeds), wjile methodologically speaking we can not cherry pick such models when comparing models all together, the aim here would be biological relevance so we would want the model that portray the best predictive power as we believe it will be best in showing the embedding features that are most contributing to the discrimination between healthy and sepsis

In [3]:
import pandas as pd

all_df=pd.read_csv('../../results/score_tables/scores_all_seeds.csv')

In [4]:
all_df

Unnamed: 0,model,input,normalization,version,balanced_accuracy,precision,recall,f1,mcc,auroc,auprc,brier,seed
0,random_forest,Complex_protein_embeddings,none,v2.10,0.778708,0.897436,0.921053,0.909091,0.577079,0.925837,0.979335,0.098828,0
1,random_forest,Complex_sample_embeddings,none,v2.10,0.500000,0.775510,1.000000,0.873563,0.000000,0.270335,0.675974,0.201608,0
2,random_forest,concatenated_protein_embeddings,none,v2.10,0.824163,0.921053,0.921053,0.921053,0.648325,0.947368,0.985782,0.085299,0
3,random_forest,concatenated_sample_embeddings,none,v2.10,0.577751,0.804348,0.973684,0.880952,0.270636,0.478469,0.798355,0.190975,0
4,random_forest,gene_expression,none,no version,0.714115,0.860465,0.973684,0.913580,0.545073,0.921053,0.976706,0.105057,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1475,svm,concatenated_protein_embeddings,minmax,v2.11,0.875598,0.969697,0.842105,0.901408,0.668382,0.933014,0.980298,0.095506,9
1476,svm,RGCN_protein_embeddings,minmax,v2.11,0.875598,0.969697,0.842105,0.901408,0.668382,0.933014,0.980298,0.095506,9
1477,xgboost,Complex_protein_embeddings,minmax,v2.11,0.706938,0.868421,0.868421,0.868421,0.413876,0.918660,0.978005,0.110757,9
1478,xgboost,concatenated_protein_embeddings,minmax,v2.11,0.706938,0.868421,0.868421,0.868421,0.413876,0.918660,0.977969,0.112612,9


In [5]:
# -- group by: model, input, normalization, version and getting the max balanced accuracy
best_models = all_df.loc[all_df.groupby(['model','input','normalization','version'])['balanced_accuracy'].idxmax()]
best_models

Unnamed: 0,model,input,normalization,version,balanced_accuracy,precision,recall,f1,mcc,auroc,auprc,brier,seed
348,random_forest,Complex_protein_embeddings,log1p,v2.10,0.954545,0.974359,1.000000,0.987013,0.941159,0.997608,0.999325,0.042023,2
1160,random_forest,Complex_protein_embeddings,log1p,v2.11,0.909091,0.950000,1.000000,0.974359,0.881631,1.000000,1.000000,0.055732,7
1100,random_forest,Complex_protein_embeddings,minmax,v2.10,0.954545,0.974359,1.000000,0.987013,0.941159,0.997608,0.999325,0.046440,7
876,random_forest,Complex_protein_embeddings,minmax,v2.11,0.909091,0.950000,1.000000,0.974359,0.881631,0.997608,0.999325,0.065387,5
296,random_forest,Complex_protein_embeddings,none,v2.10,0.941388,0.973684,0.973684,0.973684,0.882775,0.992823,0.998074,0.046525,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
790,xgboost,concatenated_protein_embeddings,standard,v2.10,0.551435,0.795455,0.921053,0.853659,0.141798,0.655502,0.841143,0.181915,5
122,xgboost,concatenated_protein_embeddings,standard,v2.11,0.583732,0.809524,0.894737,0.850000,0.199681,0.715311,0.877203,0.162022,0
24,xgboost,concatenated_sample_embeddings,none,v2.10,0.564593,0.800000,0.947368,0.867470,0.196865,0.459330,0.773625,0.200869,0
393,xgboost,concatenated_sample_embeddings,none,v2.11,0.759569,0.880952,0.973684,0.925000,0.619010,0.901914,0.967005,0.100349,2


In [8]:
best_models.to_csv('../../results/score_tables/best_models_all_seeds_scores.csv', index=False)

In [6]:
MAIN_DUMP='../../dump_seeds/'
best_model_dump='../../dump/best_models/'

for item in best_models.itertuples():
    model_name=item.model
    input_name=item.input
    normalization=item.normalization
    version=item.version
    seed=item.seed

    # -- gene exp has no version so this is to fix since its training and retrieval are set to a default version of 2.10
    if input_name=='gene_expression':
        version='v2.10'

    source_path=f"{MAIN_DUMP}dump_{seed}/{version}_{normalization}/{model_name}_{input_name}_gridsearch_model.joblib"
    target_path=f"{best_model_dump}{version}_{normalization}/"
    joblib_file=f'{model_name}_{input_name}_gridsearch_model.joblib'

    !mkdir -p {target_path}
    !cp -r {source_path}* {target_path}{joblib_file}

We only care for protein embeddings dataset for the interpretability part (and maybe gene expressio)  
we will also remove those normalized by standard or robust approaches as they are not performing well overall

In [16]:
best_models=best_models[(best_models['input'].str.endswith('protein_embeddings')) & (best_models['normalization']!='standard') & (best_models['normalization']!='robust')]

### v2.10

In [17]:
best_models[best_models['version']=='v2.10']

Unnamed: 0,model,input,normalization,version,balanced_accuracy,precision,recall,f1,mcc,auroc,auprc,brier,seed
348,random_forest,Complex_protein_embeddings,log1p,v2.10,0.954545,0.974359,1.0,0.987013,0.941159,0.997608,0.999325,0.042023,2
1100,random_forest,Complex_protein_embeddings,minmax,v2.10,0.954545,0.974359,1.0,0.987013,0.941159,0.997608,0.999325,0.04644,7
296,random_forest,Complex_protein_embeddings,none,v2.10,0.941388,0.973684,0.973684,0.973684,0.882775,0.992823,0.998074,0.046525,2
350,random_forest,RGCN_protein_embeddings,log1p,v2.10,0.687799,0.853659,0.921053,0.886076,0.424009,0.777512,0.922978,0.155947,2
510,random_forest,RGCN_protein_embeddings,minmax,v2.10,0.720096,0.871795,0.894737,0.883117,0.455719,0.708134,0.841931,0.151588,3
449,random_forest,RGCN_protein_embeddings,none,v2.10,0.655502,0.837209,0.947368,0.888889,0.395863,0.696172,0.871956,0.156536,3
793,random_forest,concatenated_protein_embeddings,log1p,v2.10,0.954545,0.974359,1.0,0.987013,0.941159,0.997608,0.999325,0.047835,5
805,random_forest,concatenated_protein_embeddings,minmax,v2.10,0.954545,0.974359,1.0,0.987013,0.941159,1.0,1.0,0.048176,5
1334,random_forest,concatenated_protein_embeddings,none,v2.10,0.92823,0.972973,0.947368,0.96,0.831005,0.935407,0.980247,0.080021,9
1387,sklearn_mlp,Complex_protein_embeddings,log1p,v2.10,0.92823,0.972973,0.947368,0.96,0.831005,0.944976,0.982075,0.077402,9
