# 2.3A Izbira modela za strojno učenje

V koraku izbira modela za strojno učenje:
- uvozimo podatke pridobljene v koraku 2. s katerih bomo učili modele
- podatke o strukturah molekule pretvorimo v fingerprinte (bitni zapis strukture, s tem dobimo featurje na X osi)
- izberemo kombinacijo najbolj primernega fingerprinta, klasifikatorja, vzorčenjske tehnike, skalarja, tehnike za zaznavanje outlierjev

# Uvoz knjižnic in splošnih funkcij

In [1]:
%run __A_knjiznice.py

from __A_knjiznice import *
from __B_funkcije import *
import __C_konstante as kon
%matplotlib inline

# Uvoz obdelanih podatkov obdelanih v koraku 2. Obdelava in analiza podatkov

## Pregled podatkov

We are limited to max 4142 samples. The sample size is quite small but we are limited to exisitng data. 

Two rules of thumb are often considered when we estimate the size of training set.

1. Rule-of-Thumb for Prediction Classes
The first rule suggests having a sample size at least 50 to 1000 times the number of prediction classes. Since we're dealing with binary classification (2 classes), this guideline would technically require a minimum of 100 to 2000 samples. With 4250 samples, we comfortably exceed the lower end of this range, suggesting that, from the perspective of prediction classes alone, our sample size is adequate.
2. Rule-of-Thumb for Observations vs. Features
The more challenging guideline in our case is the one suggesting having at least 20 times the number of observations as features. With up to 4860 features in case of certain fingerprints, this rule would imply you need around 100.000 observations, a number far exceeding our current sample size. This guideline is particularly important in machine learning to avoid overfitting, where a model learns the noise in the training data instead of the actual signal, leading to poor generalization to new data.

In order to be as close as possible to the second rule, I decidet to exclude fingerprints which have more than 1024 features, because fingerprints with +1024 features could lead to overfiting



In [2]:
df = pd.read_csv(f'{kon.path_files}/dp.csv')
df

Unnamed: 0,Smiles,ROMol,Activity
0,O=C1c2cc([N+](=O)[O-])ccc2-n2c1nc1ccccc1c2=O,<rdkit.Chem.rdchem.Mol object at 0x16fce49e0>,1
1,Cc1cc(C2CC2)ncc1-c1ccc(C2(C(=O)Nc3ccc(F)cc3)CO...,<rdkit.Chem.rdchem.Mol object at 0x16fcb09e0>,1
2,O=C(Nc1ccc(C2(C(=O)Nc3ccc(F)cc3)COC2)cc1)c1ccc...,<rdkit.Chem.rdchem.Mol object at 0x16fcdcba0>,1
3,O=C(Nc1ccc(F)cc1)C1(C2CCC3C(CCCN3c3ccnc(C(F)(F...,<rdkit.Chem.rdchem.Mol object at 0x16fcddcb0>,1
4,O=C1CC(c2c[nH]c3ccc(F)cc23)C(=O)N1,<rdkit.Chem.rdchem.Mol object at 0x16fce25e0>,1
...,...,...,...
4137,FC(F)(F)c1ccc(-c2c[nH]nn2)cc1,<rdkit.Chem.rdchem.Mol object at 0x16fcbef10>,0
4138,c1ccc2[nH]nnc2c1,<rdkit.Chem.rdchem.Mol object at 0x16fcf54d0>,0
4139,Cc1cccc(NC(=O)C(F)(F)F)c1-c1c[nH]nn1,<rdkit.Chem.rdchem.Mol object at 0x16fcb5230>,0
4140,Cc1ccc(N)cc1-c1c[nH]nn1,<rdkit.Chem.rdchem.Mol object at 0x16fcb2500>,0


In [3]:
activity_counts = df['Activity'].value_counts()
print(activity_counts)

Activity
1    2103
0    2039
Name: count, dtype: int64


# Pregled kombinacij izbranih fingerprintov, klasifikacijskih modelov in korakov preprocesiranja

In [4]:
# https://medium.com/artificialis/why-how-we-split-train-valid-and-test-fb4d6746ede

In [5]:
input_directory = f'{kon.path_files}/molekulski_prstni_odtisi'

generated_fingerprints = [
    'df_extended.csv',
    'df_fp2.csv',
    'df_circular.csv'
]

In [6]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTENC
from imblearn.under_sampling import RandomUnderSampler
from sklearn.feature_selection import SelectKBest, chi2, VarianceThreshold
from sklearn.cluster import FeatureAgglomeration
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.decomposition import PCA
import numpy as np

# Define classifiers with random_state for reproducibility
classifiers = {
    'SupporVectorMachine': SVC(probability=True, random_state=kon.random_seed)
}

# Dimensionality Reduction Methods with default parameters
dim_reduction_methods = {
    "None": None,
    "SelectKBest": SelectKBest(score_func=chi2, k=150),
    "PCA": PCA(n_components=50)  
}

# Methods for Handling Imbalanced Data with default parameters
sampling_techniques = {
    "None": None,
    "SMOTENC": SMOTENC(categorical_features=[0, 1], random_state=kon.random_seed),
    "RandomUnderSampler": RandomUnderSampler(random_state=kon.random_seed)
}

# Store results
results_list = []

# Define Stratified k-fold cross-validation
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=kon.random_seed)

# List of specific filenames to process
for filename in generated_fingerprints:  # Assuming generated_fingerprints is defined
    file_path = os.path.join(input_directory, filename)
    
    if os.path.exists(file_path):  # Check if the file exists
        print(f'Processing fingerprint DataFrame: {filename}')
        
        df = pd.read_csv(file_path)
        y = df[['Activity']].values.ravel()  # Assuming 'Activity' is the target
        X = df.iloc[:, 3:]  # Assuming features start from the 4th column

        # Split the data into train, validation, and test sets
        X_interim, X_test, y_interim, y_test = train_test_split(X, y, test_size=0.10, random_state=kon.random_seed, shuffle=True, stratify=y)
        X_train, X_val, y_train, y_val = train_test_split(X_interim, y_interim, test_size=10/90, random_state=kon.random_seed, shuffle=True, stratify=y_interim)

        # Remove constant features
        selector = VarianceThreshold()
        X_train = pd.DataFrame(selector.fit_transform(X_train), columns=selector.get_feature_names_out())
        
        # Apply the same transformation to the validation and test sets
        X_val = pd.DataFrame(selector.transform(X_val), columns=selector.get_feature_names_out())
        
        # Train and evaluate each classifier
        for clf_name, clf in classifiers.items():
            for dr_name, dr_method in dim_reduction_methods.items():
                for fs_name, fs_method in sampling_techniques.items():
                    steps = []
                    if fs_method is not None:
                        steps.append(('feature_selection', fs_method))
                    if dr_method is not None:
                        steps.append(('dim_reduction', dr_method))
                    steps.append(('classifier', clf))
                    
                    # Create the pipeline
                    pipeline = ImbPipeline(steps)

                    # Perform cross-validation
                    cv_results = []
                    for train_index, val_index in cv.split(X_train, y_train):
                        X_train_cv, X_val_cv = X_train.iloc[train_index], X_train.iloc[val_index]
                        y_train_cv, y_val_cv = y_train[train_index], y_train[val_index]

                        # Fit the model
                        pipeline.fit(X_train_cv, y_train_cv)

                        # Evaluate on the validation set
                        y_val_pred = pipeline.predict(X_val_cv)
                        val_accuracy = accuracy_score(y_val_cv, y_val_pred)
                        val_f1 = f1_score(y_val_cv,                         y_val_pred)
                        val_precision = precision_score(y_val_cv, y_val_pred)
                        val_recall = recall_score(y_val_cv, y_val_pred)
                        val_roc_auc = roc_auc_score(y_val_cv, y_val_pred)

                        # Store the results for this fold
                        cv_results.append({
                            'Val_Accuracy': val_accuracy,
                            'Val_F1': val_f1,
                            'Val_Precision': val_precision,
                            'Val_Recall': val_recall,
                            'Val_ROC_AUC': val_roc_auc,
                        })

                    # Calculate mean metrics across all folds
                    mean_cv_results = pd.DataFrame(cv_results).mean()

                    # Fit the model on the entire training set
                    pipeline.fit(X_train, y_train)

                    # Evaluate on the training set (train metrics)
                    y_train_pred = pipeline.predict(X_train)
                    train_accuracy = accuracy_score(y_train, y_train_pred)
                    train_f1 = f1_score(y_train, y_train_pred)
                    train_precision = precision_score(y_train, y_train_pred)
                    train_recall = recall_score(y_train, y_train_pred)
                    train_roc_auc = roc_auc_score(y_train, y_train_pred)

                    # Evaluate on the hold-out validation set
                    y_val_final_pred = pipeline.predict(X_val)
                    val_accuracy_final = accuracy_score(y_val, y_val_final_pred)
                    val_f1_final = f1_score(y_val, y_val_final_pred)
                    val_precision_final = precision_score(y_val, y_val_final_pred)
                    val_recall_final = recall_score(y_val, y_val_final_pred)
                    val_roc_auc_final = roc_auc_score(y_val, y_val_final_pred)


                    # Append results to the list
                    results_temp = {
                        'Fingerprint': filename,  # Use the filename for identification
                        'Feature_Selection': fs_name,
                        'Dim_Reduction': dr_name,
                        'Classifier': clf_name,
                        'CV_Mean_Accuracy': mean_cv_results['Val_Accuracy'],
                        'CV_Mean_F1': mean_cv_results['Val_F1'],
                        'CV_Mean_Precision': mean_cv_results['Val_Precision'],
                        'CV_Mean_Recall': mean_cv_results['Val_Recall'],
                        'CV_Mean_ROC_AUC': mean_cv_results['Val_ROC_AUC'],
                        'Train_Accuracy': train_accuracy,
                        'Train_F1': train_f1,
                        'Train_Precision': train_precision,
                        'Train_Recall': train_recall,
                        'Train_ROC_AUC': train_roc_auc,
                        'Val_Accuracy': val_accuracy_final,
                        'Val_F1': val_f1_final,
                        'Val_Precision': val_precision_final,
                        'Val_Recall': val_recall_final,
                        'Val_ROC_AUC': val_roc_auc_final,
                    }
                    results_list.append(results_temp)
                    print("\nResults:")
                    print(results_temp)

# Create DataFrame from the results list
results_df = pd.DataFrame(results_list)
print("\nFinal Results:")

Processing fingerprint DataFrame: df_extended.csv

Results:
{'Fingerprint': 'df_extended.csv', 'Feature_Selection': 'None', 'Dim_Reduction': 'None', 'Classifier': 'SupporVectorMachine', 'CV_Mean_Accuracy': 0.8695701233938775, 'CV_Mean_F1': 0.8738869856342596, 'CV_Mean_Precision': 0.8581668676366643, 'CV_Mean_Recall': 0.8905501549732318, 'CV_Mean_ROC_AUC': 0.869240736621735, 'Train_Accuracy': 0.9054951690821256, 'Train_F1': 0.9081303199295568, 'Train_Precision': 0.8962920046349943, 'Train_Recall': 0.9202855443188578, 'Train_ROC_AUC': 0.9052684619203117, 'Val_Accuracy': 0.9012048192771084, 'Val_F1': 0.9048723897911833, 'Val_Precision': 0.8863636363636364, 'Val_Recall': 0.9241706161137441, 'Val_ROC_AUC': 0.9008107982529505}

Results:
{'Fingerprint': 'df_extended.csv', 'Feature_Selection': 'SMOTENC', 'Dim_Reduction': 'None', 'Classifier': 'SupporVectorMachine', 'CV_Mean_Accuracy': 0.8704782877734504, 'CV_Mean_F1': 0.8743569873507505, 'CV_Mean_Precision': 0.8612884939738203, 'CV_Mean_Recall

In [7]:
results_df

Unnamed: 0,Fingerprint,Feature_Selection,Dim_Reduction,Classifier,CV_Mean_Accuracy,CV_Mean_F1,CV_Mean_Precision,CV_Mean_Recall,CV_Mean_ROC_AUC,Train_Accuracy,Train_F1,Train_Precision,Train_Recall,Train_ROC_AUC,Val_Accuracy,Val_F1,Val_Precision,Val_Recall,Val_ROC_AUC
0,df_extended.csv,,,SupporVectorMachine,0.86957,0.873887,0.858167,0.89055,0.869241,0.905495,0.90813,0.896292,0.920286,0.905268,0.901205,0.904872,0.886364,0.924171,0.900811
1,df_extended.csv,SMOTENC,,SupporVectorMachine,0.870478,0.874357,0.861288,0.888173,0.870199,0.908514,0.910751,0.901984,0.919691,0.908343,0.898795,0.902326,0.885845,0.919431,0.898441
2,df_extended.csv,RandomUnderSampler,,SupporVectorMachine,0.870175,0.87374,0.862931,0.885204,0.869938,0.906401,0.908824,0.898778,0.919096,0.906206,0.901205,0.904872,0.886364,0.924171,0.900811
3,df_extended.csv,,SelectKBest,SupporVectorMachine,0.86746,0.871363,0.859666,0.88402,0.867197,0.895229,0.89767,0.890058,0.905413,0.895073,0.903614,0.907407,0.886878,0.92891,0.90318
4,df_extended.csv,SMOTENC,SelectKBest,SupporVectorMachine,0.866251,0.870001,0.859277,0.881636,0.866005,0.897947,0.900059,0.894768,0.905413,0.897832,0.903614,0.907407,0.886878,0.92891,0.90318
5,df_extended.csv,RandomUnderSampler,SelectKBest,SupporVectorMachine,0.86897,0.87199,0.865239,0.879269,0.868807,0.897343,0.899587,0.893255,0.906008,0.89721,0.901205,0.904872,0.886364,0.924171,0.900811
6,df_extended.csv,,PCA,SupporVectorMachine,0.859605,0.864175,0.849084,0.880438,0.859273,0.883756,0.886865,0.876307,0.89768,0.883543,0.898795,0.903226,0.878924,0.92891,0.898279
7,df_extended.csv,SMOTENC,PCA,SupporVectorMachine,0.861116,0.865363,0.85189,0.879843,0.860816,0.883454,0.88627,0.877992,0.894706,0.883282,0.901205,0.905312,0.882883,0.92891,0.900729
8,df_extended.csv,RandomUnderSampler,PCA,SupporVectorMachine,0.861116,0.865285,0.85222,0.879248,0.860827,0.881643,0.884638,0.875364,0.894111,0.881451,0.898795,0.903226,0.878924,0.92891,0.898279
9,df_fp2.csv,,,SupporVectorMachine,0.86655,0.870531,0.858011,0.884003,0.86627,0.90006,0.902847,0.891078,0.914932,0.899832,0.903614,0.907407,0.886878,0.92891,0.90318


In [8]:
results_df.sort_values(by=['Val_Accuracy'], ascending=False, inplace = True)
results_df

Unnamed: 0,Fingerprint,Feature_Selection,Dim_Reduction,Classifier,CV_Mean_Accuracy,CV_Mean_F1,CV_Mean_Precision,CV_Mean_Recall,CV_Mean_ROC_AUC,Train_Accuracy,Train_F1,Train_Precision,Train_Recall,Train_ROC_AUC,Val_Accuracy,Val_F1,Val_Precision,Val_Recall,Val_ROC_AUC
23,df_circular.csv,RandomUnderSampler,SelectKBest,SupporVectorMachine,0.866849,0.869385,0.865393,0.87388,0.866734,0.899155,0.901417,0.894552,0.908388,0.899013,0.910843,0.914153,0.895455,0.933649,0.910452
22,df_circular.csv,SMOTENC,SelectKBest,SupporVectorMachine,0.866549,0.868789,0.86655,0.87151,0.866469,0.902174,0.903858,0.901717,0.906008,0.902115,0.906024,0.909931,0.887387,0.933649,0.90555
19,df_circular.csv,SMOTENC,,SupporVectorMachine,0.87893,0.882359,0.870925,0.894706,0.878673,0.927838,0.929602,0.920653,0.938727,0.927671,0.903614,0.906103,0.897674,0.914692,0.903424
3,df_extended.csv,,SelectKBest,SupporVectorMachine,0.86746,0.871363,0.859666,0.88402,0.867197,0.895229,0.89767,0.890058,0.905413,0.895073,0.903614,0.907407,0.886878,0.92891,0.90318
4,df_extended.csv,SMOTENC,SelectKBest,SupporVectorMachine,0.866251,0.870001,0.859277,0.881636,0.866005,0.897947,0.900059,0.894768,0.905413,0.897832,0.903614,0.907407,0.886878,0.92891,0.90318
21,df_circular.csv,,SelectKBest,SupporVectorMachine,0.86655,0.869345,0.863573,0.875673,0.866403,0.898853,0.9015,0.891279,0.911957,0.898652,0.903614,0.907834,0.883408,0.933649,0.903099
11,df_fp2.csv,RandomUnderSampler,,SupporVectorMachine,0.866855,0.870076,0.862093,0.878659,0.866664,0.898853,0.901151,0.894028,0.908388,0.898707,0.903614,0.907407,0.886878,0.92891,0.90318
9,df_fp2.csv,,,SupporVectorMachine,0.86655,0.870531,0.858011,0.884003,0.86627,0.90006,0.902847,0.891078,0.914932,0.899832,0.903614,0.907407,0.886878,0.92891,0.90318
10,df_fp2.csv,SMOTENC,,SupporVectorMachine,0.868965,0.87207,0.864983,0.879839,0.868786,0.902174,0.904312,0.897947,0.910767,0.902042,0.901205,0.904872,0.886364,0.924171,0.900811
20,df_circular.csv,RandomUnderSampler,,SupporVectorMachine,0.878326,0.881645,0.87123,0.892921,0.878087,0.928442,0.930191,0.921237,0.939322,0.928275,0.901205,0.903981,0.893519,0.914692,0.900973


In [9]:
cv_stats = results_df['CV_Mean_Accuracy'].describe()
train_stats = results_df['Train_Accuracy'].describe()
val_stats = results_df['Val_Accuracy'].describe()

print('\nTrain cross validation accuracy\n')
print(cv_stats)
print('\nTrain accuracy\n')
print(train_stats)
print('\nValidation accuracy\n')
print(val_stats)


Train cross validation accuracy

count    27.000000
mean      0.866429
std       0.008188
min       0.846931
25%       0.864437
50%       0.866849
75%       0.870327
max       0.880139
Name: CV_Mean_Accuracy, dtype: float64

Train accuracy

count    27.000000
mean      0.896314
std       0.014864
min       0.875906
25%       0.884058
50%       0.897343
75%       0.902174
max       0.929046
Name: Train_Accuracy, dtype: float64

Validation accuracy

count    27.000000
mean      0.898617
std       0.007676
min       0.879518
25%       0.898795
50%       0.901205
75%       0.903614
max       0.910843
Name: Val_Accuracy, dtype: float64
