# 2.3A Izbira modela za strojno učenje

V koraku izbira modela za strojno učenje:
- uvozimo podatke pridobljene v koraku 2. s katerih bomo učili modele
- podatke o strukturah molekule pretvorimo v fingerprinte (bitni zapis strukture, s tem dobimo featurje na X osi)
- izberemo kombinacijo najbolj primernega fingerprinta, klasifikatorja, vzorčenjske tehnike, skalarja, tehnike za zaznavanje outlierjev

# Uvoz knjižnic in splošnih funkcij

In [1]:
%run __A_knjiznice.py

from __A_knjiznice import *
from __B_funkcije import *
import __C_konstante as kon
%matplotlib inline

# Uvoz obdelanih podatkov obdelanih v koraku 2. Obdelava in analiza podatkov

## Pregled podatkov

We are limited to max 4142 samples. The sample size is quite small but we are limited to exisitng data. 

Two rules of thumb are often considered when we estimate the size of training set.

1. Rule-of-Thumb for Prediction Classes
The first rule suggests having a sample size at least 50 to 1000 times the number of prediction classes. Since we're dealing with binary classification (2 classes), this guideline would technically require a minimum of 100 to 2000 samples. With 4250 samples, we comfortably exceed the lower end of this range, suggesting that, from the perspective of prediction classes alone, our sample size is adequate.
2. Rule-of-Thumb for Observations vs. Features
The more challenging guideline in our case is the one suggesting having at least 20 times the number of observations as features. With up to 4860 features in case of certain fingerprints, this rule would imply you need around 100.000 observations, a number far exceeding our current sample size. This guideline is particularly important in machine learning to avoid overfitting, where a model learns the noise in the training data instead of the actual signal, leading to poor generalization to new data.

In order to be as close as possible to the second rule, I decidet to exclude fingerprints which have more than 1024 features, because fingerprints with +1024 features could lead to overfiting



In [2]:
df = pd.read_csv(f'{kon.path_files}/dp.csv')
df

Unnamed: 0,Smiles,ROMol,Activity
0,O=C1c2cc([N+](=O)[O-])ccc2-n2c1nc1ccccc1c2=O,<rdkit.Chem.rdchem.Mol object at 0x16fce49e0>,1
1,Cc1cc(C2CC2)ncc1-c1ccc(C2(C(=O)Nc3ccc(F)cc3)CO...,<rdkit.Chem.rdchem.Mol object at 0x16fcb09e0>,1
2,O=C(Nc1ccc(C2(C(=O)Nc3ccc(F)cc3)COC2)cc1)c1ccc...,<rdkit.Chem.rdchem.Mol object at 0x16fcdcba0>,1
3,O=C(Nc1ccc(F)cc1)C1(C2CCC3C(CCCN3c3ccnc(C(F)(F...,<rdkit.Chem.rdchem.Mol object at 0x16fcddcb0>,1
4,O=C1CC(c2c[nH]c3ccc(F)cc23)C(=O)N1,<rdkit.Chem.rdchem.Mol object at 0x16fce25e0>,1
...,...,...,...
4137,FC(F)(F)c1ccc(-c2c[nH]nn2)cc1,<rdkit.Chem.rdchem.Mol object at 0x16fcbef10>,0
4138,c1ccc2[nH]nnc2c1,<rdkit.Chem.rdchem.Mol object at 0x16fcf54d0>,0
4139,Cc1cccc(NC(=O)C(F)(F)F)c1-c1c[nH]nn1,<rdkit.Chem.rdchem.Mol object at 0x16fcb5230>,0
4140,Cc1ccc(N)cc1-c1c[nH]nn1,<rdkit.Chem.rdchem.Mol object at 0x16fcb2500>,0


In [3]:
activity_counts = df['Activity'].value_counts()
print(activity_counts)

Activity
1    2103
0    2039
Name: count, dtype: int64


# Pregled kombinacij izbranih fingerprintov, klasifikacijskih modelov in korakov preprocesiranja

In [4]:
# https://medium.com/artificialis/why-how-we-split-train-valid-and-test-fb4d6746ede

In [5]:
input_directory = f'{kon.path_files}/molekulski_prstni_odtisi'

generated_fingerprints = [
    'df_circular.csv',
    'df_morgan.csv'
]

In [6]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTENC
from imblearn.under_sampling import RandomUnderSampler
from sklearn.feature_selection import SelectKBest, chi2, VarianceThreshold
from sklearn.cluster import FeatureAgglomeration
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.decomposition import PCA
import numpy as np

# Define classifiers with default parameters
classifiers = {
    "RandomForestClassifier": RandomForestClassifier(n_jobs=-1, random_state=kon.random_seed),
    "LogisticRegression": LogisticRegression(max_iter=1000, random_state=kon.random_seed)
}

# Dimensionality Reduction Methods with default parameters
dim_reduction_methods = {
    "None": None,
    "SelectKBest": SelectKBest(score_func=chi2, k=150),
    "LDA": LinearDiscriminantAnalysis(n_components=1),
    "FeatureAgglomeration": FeatureAgglomeration(n_clusters=100),
    "PCA": PCA(n_components=0.95)  
}

# Methods for Handling Imbalanced Data with default parameters
sampling_techniques = {
    "None": None,
    "SMOTENC": SMOTENC(categorical_features=[0, 1], random_state=kon.random_seed),
    "RandomUnderSampler": RandomUnderSampler(random_state=kon.random_seed)
}

# Store results
results_list = []

# Define Stratified k-fold cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=kon.random_seed)

# List of specific filenames to process
for filename in generated_fingerprints:  # Assuming generated_fingerprints is defined
    file_path = os.path.join(input_directory, filename)
    
    if os.path.exists(file_path):  # Check if the file exists
        print(f'Processing fingerprint DataFrame: {filename}')
        
        df = pd.read_csv(file_path)
        y = df[['Activity']].values.ravel()  # Assuming 'Activity' is the target
        X = df.iloc[:, 3:]  # Assuming features start from the 4th column

        # Remove constant features
        selector = VarianceThreshold()
        X = pd.DataFrame(selector.fit_transform(X), columns=selector.get_feature_names_out())

        # Split the data into train, validation, and test sets
        X_interim, X_test, y_interim, y_test = train_test_split(X, y, test_size=0.15, random_state=kon.random_seed, shuffle=True, stratify=y)
        X_train, X_val, y_train, y_val = train_test_split(X_interim, y_interim, test_size=15/85, random_state=kon.random_seed, shuffle=True, stratify=y_interim)

        # Train and evaluate each classifier
        for clf_name, clf in classifiers.items():
            for dr_name, dr_method in dim_reduction_methods.items():
                for fs_name, fs_method in sampling_techniques.items():
                    steps = []
                    if fs_method is not None:
                        steps.append(('feature_selection', fs_method))
                    if dr_method is not None:
                        steps.append(('dim_reduction', dr_method))
                    steps.append(('classifier', clf))
                    
                    # Create the pipeline
                    pipeline = ImbPipeline(steps)

                    # Perform cross-validation
                    cv_results = []
                    for train_index, val_index in cv.split(X_train, y_train):
                        X_train_cv, X_val_cv = X_train.iloc[train_index], X_train.iloc[val_index]
                        y_train_cv, y_val_cv = y_train[train_index], y_train[val_index]

                        # Fit the model
                        pipeline.fit(X_train_cv, y_train_cv)

                        # Evaluate on the validation set
                        y_val_pred = pipeline.predict(X_val_cv)
                        val_accuracy = accuracy_score(y_val_cv, y_val_pred)
                        val_f1 = f1_score(y_val_cv, y_val_pred)
                        val_precision = precision_score(y_val_cv, y_val_pred)
                        val_recall = recall_score(y_val_cv, y_val_pred)
                        val_roc_auc = roc_auc_score(y_val_cv, y_val_pred)

                        # Store the results for this fold
                        cv_results.append({
                            'Val_Accuracy': val_accuracy,
                            'Val_F1': val_f1,
                            'Val_Precision': val_precision,
                            'Val_Recall': val_recall,
                            'Val_ROC_AUC': val_roc_auc,
                        })

                    # Calculate mean metrics across all folds
                    mean_cv_results = pd.DataFrame(cv_results).mean()

                    # Fit the model on the entire training set
                    pipeline.fit(X_train, y_train)

                    # Evaluate on the training set (train metrics)
                    y_train_pred = pipeline.predict(X_train)
                    train_accuracy = accuracy_score(y_train, y_train_pred)
                    train_f1 = f1_score(y_train, y_train_pred)
                    train_precision = precision_score(y_train, y_train_pred)
                    train_recall = recall_score(y_train, y_train_pred)
                    train_roc_auc = roc_auc_score(y_train, y_train_pred)

                    # Evaluate on the hold-out validation set
                    y_val_final_pred = pipeline.predict(X_val)
                    val_accuracy_final = accuracy_score(y_val, y_val_final_pred)
                    val_f1_final = f1_score(y_val, y_val_final_pred)
                    val_precision_final = precision_score(y_val, y_val_final_pred)
                    val_recall_final = recall_score(y_val, y_val_final_pred)
                    val_roc_auc_final = roc_auc_score(y_val, y_val_final_pred)

                    # Evaluate on the hold-out test set
                    y_test_final_pred = pipeline.predict(X_test)
                    test_accuracy_final = accuracy_score(y_test, y_test_final_pred)
                    test_f1_final = f1_score(y_test, y_test_final_pred)
                    test_precision_final = precision_score(y_test, y_test_final_pred)
                    test_recall_final = recall_score(y_test, y_test_final_pred)
                    test_roc_auc_final = roc_auc_score(y_test, y_test_final_pred)

                    # Append results to the list
                    results_temp = {
                        'Fingerprint': filename,  # Use the filename for identification
                        'Feature_Selection': fs_name,
                        'Dim_Reduction': dr_name,
                        'Classifier': clf_name,
                        'CV_Mean_Accuracy': mean_cv_results['Val_Accuracy'],
                        'CV_Mean_F1': mean_cv_results['Val_F1'],
                        'CV_Mean_Precision': mean_cv_results['Val_Precision'],
                        'CV_Mean_Recall': mean_cv_results['Val_Recall'],
                        'CV_Mean_ROC_AUC': mean_cv_results['Val_ROC_AUC'],
                        'Train_Accuracy': train_accuracy,
                        'Train_F1': train_f1,
                        'Train_Precision': train_precision,
                        'Train_Recall': train_recall,
                        'Train_ROC_AUC': train_roc_auc,
                        'Val_Accuracy': val_accuracy_final,
                        'Val_F1': val_f1_final,
                        'Val_Precision': val_precision_final,
                        'Val_Recall': val_recall_final,
                        'Val_ROC_AUC': val_roc_auc_final,
                        'Test_Accuracy': test_accuracy_final,
                        'Test_F1': test_f1_final,
                        'Test_Precision': test_precision_final,
                        'Test_Recall': test_recall_final,
                        'Test_ROC_AUC': test_roc_auc_final
                    }
                    results_list.append(results_temp)
                    print("\nResults:")
                    print(results_temp)

# Create DataFrame from the results list
results_df = pd.DataFrame(results_list)

Processing fingerprint DataFrame: df_circular.csv

Results:
{'Fingerprint': 'df_circular.csv', 'Feature_Selection': 'None', 'Dim_Reduction': 'None', 'Classifier': 'RandomForestClassifier', 'CV_Mean_Accuracy': 0.8816365910309095, 'CV_Mean_F1': 0.8838005513323293, 'CV_Mean_Precision': 0.8813497593711164, 'CV_Mean_Recall': 0.8864637380375878, 'CV_Mean_ROC_AUC': 0.881551093653808, 'Train_Accuracy': 0.9993098688750862, 'Train_F1': 0.9993201903467029, 'Train_Precision': 0.9993201903467029, 'Train_Recall': 0.9993201903467029, 'Train_ROC_AUC': 0.999309709749385, 'Val_Accuracy': 0.882636655948553, 'Val_F1': 0.8854003139717426, 'Val_Precision': 0.8785046728971962, 'Val_Recall': 0.8924050632911392, 'Val_ROC_AUC': 0.8824770414494911, 'Test_Accuracy': 0.8842443729903537, 'Test_F1': 0.8849840255591054, 'Test_Precision': 0.8935483870967742, 'Test_Recall': 0.8765822784810127, 'Test_ROC_AUC': 0.8843695706130553}

Results:
{'Fingerprint': 'df_circular.csv', 'Feature_Selection': 'SMOTENC', 'Dim_Reduction

In [7]:
results_df

Unnamed: 0,Fingerprint,Feature_Selection,Dim_Reduction,Classifier,CV_Mean_Accuracy,CV_Mean_F1,CV_Mean_Precision,CV_Mean_Recall,CV_Mean_ROC_AUC,Train_Accuracy,...,Val_Accuracy,Val_F1,Val_Precision,Val_Recall,Val_ROC_AUC,Test_Accuracy,Test_F1,Test_Precision,Test_Recall,Test_ROC_AUC
0,df_circular.csv,,,RandomForestClassifier,0.881637,0.883801,0.88135,0.886464,0.881551,0.99931,...,0.882637,0.8854,0.878505,0.892405,0.882477,0.884244,0.884984,0.893548,0.876582,0.88437
1,df_circular.csv,SMOTENC,,RandomForestClassifier,0.884744,0.886829,0.884734,0.889187,0.884665,0.99931,...,0.881029,0.884013,0.875776,0.892405,0.880843,0.877814,0.879365,0.882166,0.876582,0.877834
2,df_circular.csv,RandomUnderSampler,,RandomForestClassifier,0.88164,0.88321,0.884977,0.881709,0.881631,0.996549,...,0.876206,0.877971,0.879365,0.876582,0.8762,0.876206,0.876404,0.889251,0.863924,0.876406
3,df_circular.csv,,SelectKBest,RandomForestClassifier,0.871288,0.87264,0.876639,0.868788,0.871318,0.995514,...,0.861736,0.864353,0.861635,0.867089,0.861649,0.863344,0.862682,0.881188,0.844937,0.863645
4,df_circular.csv,SMOTENC,SelectKBest,RandomForestClassifier,0.869561,0.87056,0.877344,0.864031,0.869642,0.994134,...,0.861736,0.863492,0.866242,0.860759,0.861752,0.861736,0.859477,0.888514,0.832278,0.862218
5,df_circular.csv,RandomUnderSampler,SelectKBest,RandomForestClassifier,0.871288,0.871491,0.883487,0.859952,0.871459,0.992063,...,0.87299,0.875197,0.873817,0.876582,0.872932,0.860129,0.858537,0.882943,0.835443,0.860532
6,df_circular.csv,,LDA,RandomForestClassifier,0.817455,0.819227,0.823069,0.815748,0.817469,0.99931,...,0.81672,0.815534,0.834437,0.797468,0.817035,0.823151,0.823151,0.836601,0.810127,0.823364
7,df_circular.csv,SMOTENC,LDA,RandomForestClassifier,0.81262,0.816905,0.810687,0.823249,0.812454,0.99931,...,0.81672,0.819048,0.821656,0.816456,0.816725,0.808682,0.805237,0.833898,0.778481,0.809175
8,df_circular.csv,RandomUnderSampler,LDA,RandomForestClassifier,0.812273,0.814531,0.816735,0.812363,0.812273,0.996549,...,0.83119,0.833597,0.834921,0.832278,0.831172,0.819936,0.81759,0.842282,0.794304,0.820355
9,df_circular.csv,,FeatureAgglomeration,RandomForestClassifier,0.879566,0.881791,0.879309,0.884425,0.879484,0.99931,...,0.876206,0.877971,0.879365,0.876582,0.8762,0.864952,0.865385,0.876623,0.85443,0.865124


In [8]:
results_df.sort_values(by=['Val_Accuracy'], ascending=False, inplace = True)
results_df

Unnamed: 0,Fingerprint,Feature_Selection,Dim_Reduction,Classifier,CV_Mean_Accuracy,CV_Mean_F1,CV_Mean_Precision,CV_Mean_Recall,CV_Mean_ROC_AUC,Train_Accuracy,...,Val_Accuracy,Val_F1,Val_Precision,Val_Recall,Val_ROC_AUC,Test_Accuracy,Test_F1,Test_Precision,Test_Recall,Test_ROC_AUC
0,df_circular.csv,,,RandomForestClassifier,0.881637,0.883801,0.88135,0.886464,0.881551,0.99931,...,0.882637,0.8854,0.878505,0.892405,0.882477,0.884244,0.884984,0.893548,0.876582,0.88437
32,df_morgan.csv,RandomUnderSampler,,RandomForestClassifier,0.880606,0.882831,0.880198,0.885802,0.880522,0.991028,...,0.881029,0.884375,0.873457,0.89557,0.880791,0.864952,0.864516,0.881579,0.848101,0.865227
1,df_circular.csv,SMOTENC,,RandomForestClassifier,0.884744,0.886829,0.884734,0.889187,0.884665,0.99931,...,0.881029,0.884013,0.875776,0.892405,0.880843,0.877814,0.879365,0.882166,0.876582,0.877834
10,df_circular.csv,SMOTENC,FeatureAgglomeration,RandomForestClassifier,0.880603,0.883369,0.876423,0.890545,0.880444,0.99931,...,0.879421,0.88189,0.877743,0.886076,0.879312,0.879421,0.880763,0.884984,0.876582,0.879468
31,df_morgan.csv,SMOTENC,,RandomForestClassifier,0.883018,0.884921,0.884295,0.885793,0.88297,0.992754,...,0.877814,0.88125,0.87037,0.892405,0.877575,0.871383,0.873016,0.875796,0.870253,0.871401
9,df_circular.csv,,FeatureAgglomeration,RandomForestClassifier,0.879566,0.881791,0.879309,0.884425,0.879484,0.99931,...,0.876206,0.877971,0.879365,0.876582,0.8762,0.864952,0.865385,0.876623,0.85443,0.865124
2,df_circular.csv,RandomUnderSampler,,RandomForestClassifier,0.88164,0.88321,0.884977,0.881709,0.881631,0.996549,...,0.876206,0.877971,0.879365,0.876582,0.8762,0.876206,0.876404,0.889251,0.863924,0.876406
11,df_circular.csv,RandomUnderSampler,FeatureAgglomeration,RandomForestClassifier,0.874048,0.876318,0.874043,0.878981,0.873957,0.996894,...,0.874598,0.876972,0.874214,0.879747,0.874514,0.87299,0.873194,0.885993,0.860759,0.87319
30,df_morgan.csv,,,RandomForestClassifier,0.882329,0.884741,0.880625,0.889192,0.882217,0.992754,...,0.874598,0.878125,0.867284,0.889241,0.874359,0.871383,0.873016,0.875796,0.870253,0.871401
5,df_circular.csv,RandomUnderSampler,SelectKBest,RandomForestClassifier,0.871288,0.871491,0.883487,0.859952,0.871459,0.992063,...,0.87299,0.875197,0.873817,0.876582,0.872932,0.860129,0.858537,0.882943,0.835443,0.860532


In [9]:
cv_stats = results_df['CV_Mean_Accuracy'].describe()
train_stats = results_df['Train_Accuracy'].describe()
val_stats = results_df['Val_Accuracy'].describe()

print('\nTrain cross validation accuracy\n')
print(cv_stats)
print('\nTrain accuracy\n')
print(train_stats)
print('\nValidation accuracy\n')
print(val_stats)


Train cross validation accuracy

count    60.000000
mean      0.834693
std       0.071970
min       0.603184
25%       0.841093
50%       0.858172
75%       0.870509
max       0.884744
Name: CV_Mean_Accuracy, dtype: float64

Train accuracy

count    60.000000
mean      0.958063
std       0.051106
min       0.850932
25%       0.946170
50%       0.984472
75%       0.992754
max       0.999310
Name: Train_Accuracy, dtype: float64

Validation accuracy

count    60.000000
mean      0.839577
std       0.041807
min       0.717042
25%       0.825965
50%       0.850482
75%       0.868167
max       0.882637
Name: Val_Accuracy, dtype: float64
