# 2.3A Izbira modela za strojno učenje

V koraku izbira modela za strojno učenje:
- uvozimo podatke pridobljene v koraku 2. s katerih bomo učili modele
- podatke o strukturah molekule pretvorimo v fingerprinte (bitni zapis strukture, s tem dobimo featurje na X osi)
- izberemo kombinacijo najbolj primernega fingerprinta, klasifikatorja, vzorčenjske tehnike, skalarja, tehnike za zaznavanje outlierjev

# Uvoz knjižnic in splošnih funkcij

In [1]:
%run __A_knjiznice.py

from __A_knjiznice import *
from __B_funkcije import *
import __C_konstante as kon
%matplotlib inline

# Uvoz obdelanih podatkov obdelanih v koraku 2. Obdelava in analiza podatkov

## Pregled podatkov

We are limited to max 4142 samples. The sample size is quite small but we are limited to exisitng data. 

Two rules of thumb are often considered when we estimate the size of training set.

1. Rule-of-Thumb for Prediction Classes
The first rule suggests having a sample size at least 50 to 1000 times the number of prediction classes. Since we're dealing with binary classification (2 classes), this guideline would technically require a minimum of 100 to 2000 samples. With 4250 samples, we comfortably exceed the lower end of this range, suggesting that, from the perspective of prediction classes alone, our sample size is adequate.
2. Rule-of-Thumb for Observations vs. Features
The more challenging guideline in our case is the one suggesting having at least 20 times the number of observations as features. With up to 4860 features in case of certain fingerprints, this rule would imply you need around 100.000 observations, a number far exceeding our current sample size. This guideline is particularly important in machine learning to avoid overfitting, where a model learns the noise in the training data instead of the actual signal, leading to poor generalization to new data.

In order to be as close as possible to the second rule, I decidet to exclude fingerprints which have more than 1024 features, because fingerprints with +1024 features could lead to overfiting



In [2]:
df = pd.read_csv(f'{kon.path_files}/dp.csv')
df

Unnamed: 0,Smiles,ROMol,Activity
0,O=C1c2cc([N+](=O)[O-])ccc2-n2c1nc1ccccc1c2=O,<rdkit.Chem.rdchem.Mol object at 0x16fce49e0>,1
1,Cc1cc(C2CC2)ncc1-c1ccc(C2(C(=O)Nc3ccc(F)cc3)CO...,<rdkit.Chem.rdchem.Mol object at 0x16fcb09e0>,1
2,O=C(Nc1ccc(C2(C(=O)Nc3ccc(F)cc3)COC2)cc1)c1ccc...,<rdkit.Chem.rdchem.Mol object at 0x16fcdcba0>,1
3,O=C(Nc1ccc(F)cc1)C1(C2CCC3C(CCCN3c3ccnc(C(F)(F...,<rdkit.Chem.rdchem.Mol object at 0x16fcddcb0>,1
4,O=C1CC(c2c[nH]c3ccc(F)cc23)C(=O)N1,<rdkit.Chem.rdchem.Mol object at 0x16fce25e0>,1
...,...,...,...
4137,FC(F)(F)c1ccc(-c2c[nH]nn2)cc1,<rdkit.Chem.rdchem.Mol object at 0x16fcbef10>,0
4138,c1ccc2[nH]nnc2c1,<rdkit.Chem.rdchem.Mol object at 0x16fcf54d0>,0
4139,Cc1cccc(NC(=O)C(F)(F)F)c1-c1c[nH]nn1,<rdkit.Chem.rdchem.Mol object at 0x16fcb5230>,0
4140,Cc1ccc(N)cc1-c1c[nH]nn1,<rdkit.Chem.rdchem.Mol object at 0x16fcb2500>,0


In [3]:
activity_counts = df['Activity'].value_counts()
print(activity_counts)

Activity
1    2103
0    2039
Name: count, dtype: int64


# Pregled kombinacij izbranih fingerprintov, klasifikacijskih modelov in korakov preprocesiranja

In [4]:
# https://medium.com/artificialis/why-how-we-split-train-valid-and-test-fb4d6746ede

In [5]:
input_directory = f'{kon.path_files}/molekulski_prstni_odtisi'

generated_fingerprints = [
    'df_circular.csv',
]

In [6]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTENC
from imblearn.under_sampling import RandomUnderSampler
from sklearn.feature_selection import SelectKBest, chi2, VarianceThreshold
from sklearn.cluster import FeatureAgglomeration
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.decomposition import PCA
import numpy as np

# Define classifiers with random_state for reproducibility
classifiers = {
    "RandomForestClassifier": RandomForestClassifier(n_jobs=-1, random_state=kon.random_seed),
    "ExtraTreesClassifier": ExtraTreesClassifier(n_jobs=-1, random_state=kon.random_seed),
    # "LogisticRegression": LogisticRegression(max_iter=1000, random_state=kon.random_seed),
    # "XGBClassifier": XGBClassifier(eval_metric='logloss', n_jobs=-1, random_state=kon.random_seed)
}

# Dimensionality Reduction Methods with default parameters
dim_reduction_methods = {
    "None": None,
    "SelectKBest": SelectKBest(score_func=chi2, k=150),
    "LDA": LinearDiscriminantAnalysis(n_components=1),
    "FeatureAgglomeration": FeatureAgglomeration(n_clusters=100),
    "PCA": PCA(n_components=50)  
}

# Methods for Handling Imbalanced Data with default parameters
sampling_techniques = {
    "None": None,
    "SMOTENC": SMOTENC(categorical_features=[0, 1], random_state=kon.random_seed),
    "RandomUnderSampler": RandomUnderSampler(random_state=kon.random_seed)
}

# Store results
results_list = []

# Define Stratified k-fold cross-validation
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=kon.random_seed)

# List of specific filenames to process
for filename in generated_fingerprints:  # Assuming generated_fingerprints is defined
    file_path = os.path.join(input_directory, filename)
    
    if os.path.exists(file_path):  # Check if the file exists
        print(f'Processing fingerprint DataFrame: {filename}')
        
        df = pd.read_csv(file_path)
        y = df[['Activity']].values.ravel()  # Assuming 'Activity' is the target
        X = df.iloc[:, 3:]  # Assuming features start from the 4th column

        # Remove constant features
        selector = VarianceThreshold()
        X = pd.DataFrame(selector.fit_transform(X), columns=selector.get_feature_names_out())

        # Split the data into train, validation, and test sets
        X_interim, X_test, y_interim, y_test = train_test_split(X, y, test_size=0.10, random_state=kon.random_seed, shuffle=True, stratify=y)
        X_train, X_val, y_train, y_val = train_test_split(X_interim, y_interim, test_size=10/90, random_state=kon.random_seed, shuffle=True, stratify=y_interim)

        # Train and evaluate each classifier
        for clf_name, clf in classifiers.items():
            for dr_name, dr_method in dim_reduction_methods.items():
                for fs_name, fs_method in sampling_techniques.items():
                    steps = []
                    if fs_method is not None:
                        steps.append(('feature_selection', fs_method))
                    if dr_method is not None:
                        steps.append(('dim_reduction', dr_method))
                    steps.append(('classifier', clf))
                    
                    # Create the pipeline
                    pipeline = ImbPipeline(steps)

                    # Perform cross-validation
                    cv_results = []
                    for train_index, val_index in cv.split(X_train, y_train):
                        X_train_cv, X_val_cv = X_train.iloc[train_index], X_train.iloc[val_index]
                        y_train_cv, y_val_cv = y_train[train_index], y_train[val_index]

                        # Fit the model
                        pipeline.fit(X_train_cv, y_train_cv)

                        # Evaluate on the validation set
                        y_val_pred = pipeline.predict(X_val_cv)
                        val_accuracy = accuracy_score(y_val_cv, y_val_pred)
                        val_f1 = f1_score(y_val_cv,                         y_val_pred)
                        val_precision = precision_score(y_val_cv, y_val_pred)
                        val_recall = recall_score(y_val_cv, y_val_pred)
                        val_roc_auc = roc_auc_score(y_val_cv, y_val_pred)

                        # Store the results for this fold
                        cv_results.append({
                            'Val_Accuracy': val_accuracy,
                            'Val_F1': val_f1,
                            'Val_Precision': val_precision,
                            'Val_Recall': val_recall,
                            'Val_ROC_AUC': val_roc_auc,
                        })

                    # Calculate mean metrics across all folds
                    mean_cv_results = pd.DataFrame(cv_results).mean()

                    # Fit the model on the entire training set
                    pipeline.fit(X_train, y_train)

                    # Evaluate on the training set (train metrics)
                    y_train_pred = pipeline.predict(X_train)
                    train_accuracy = accuracy_score(y_train, y_train_pred)
                    train_f1 = f1_score(y_train, y_train_pred)
                    train_precision = precision_score(y_train, y_train_pred)
                    train_recall = recall_score(y_train, y_train_pred)
                    train_roc_auc = roc_auc_score(y_train, y_train_pred)

                    # Evaluate on the hold-out validation set
                    y_val_final_pred = pipeline.predict(X_val)
                    val_accuracy_final = accuracy_score(y_val, y_val_final_pred)
                    val_f1_final = f1_score(y_val, y_val_final_pred)
                    val_precision_final = precision_score(y_val, y_val_final_pred)
                    val_recall_final = recall_score(y_val, y_val_final_pred)
                    val_roc_auc_final = roc_auc_score(y_val, y_val_final_pred)


                    # Append results to the list
                    results_temp = {
                        'Fingerprint': filename,  # Use the filename for identification
                        'Feature_Selection': fs_name,
                        'Dim_Reduction': dr_name,
                        'Classifier': clf_name,
                        'CV_Mean_Accuracy': mean_cv_results['Val_Accuracy'],
                        'CV_Mean_F1': mean_cv_results['Val_F1'],
                        'CV_Mean_Precision': mean_cv_results['Val_Precision'],
                        'CV_Mean_Recall': mean_cv_results['Val_Recall'],
                        'CV_Mean_ROC_AUC': mean_cv_results['Val_ROC_AUC'],
                        'Train_Accuracy': train_accuracy,
                        'Train_F1': train_f1,
                        'Train_Precision': train_precision,
                        'Train_Recall': train_recall,
                        'Train_ROC_AUC': train_roc_auc,
                        'Val_Accuracy': val_accuracy_final,
                        'Val_F1': val_f1_final,
                        'Val_Precision': val_precision_final,
                        'Val_Recall': val_recall_final,
                        'Val_ROC_AUC': val_roc_auc_final,
                    }
                    results_list.append(results_temp)
                    print("\nResults:")
                    print(results_temp)

# Create DataFrame from the results list
results_df = pd.DataFrame(results_list)
print("\nFinal Results:")
print(results_df)



Processing fingerprint DataFrame: df_circular.csv

Results:
{'Fingerprint': 'df_circular.csv', 'Feature_Selection': 'None', 'Dim_Reduction': 'None', 'Classifier': 'RandomForestClassifier', 'CV_Mean_Accuracy': 0.8807401812688822, 'CV_Mean_F1': 0.8832933751016702, 'CV_Mean_Precision': 0.8776301329368424, 'CV_Mean_Recall': 0.8893561566638489, 'CV_Mean_ROC_AUC': 0.8806013163986609, 'Train_Accuracy': 0.9993961352657005, 'Train_F1': 0.9994054696789536, 'Train_Precision': 0.9988116458704694, 'Train_Recall': 1.0, 'Train_ROC_AUC': 0.9993868792152054, 'Val_Accuracy': 0.9132530120481928, 'Val_F1': 0.9158878504672897, 'Val_Precision': 0.9032258064516129, 'Val_Recall': 0.9289099526066351, 'Val_ROC_AUC': 0.9129843880680235}

Results:
{'Fingerprint': 'df_circular.csv', 'Feature_Selection': 'SMOTENC', 'Dim_Reduction': 'None', 'Classifier': 'RandomForestClassifier', 'CV_Mean_Accuracy': 0.8795253521639428, 'CV_Mean_F1': 0.8821665621842799, 'CV_Mean_Precision': 0.8759727186531094, 'CV_Mean_Recall': 0.888

In [7]:
results_df

Unnamed: 0,Fingerprint,Feature_Selection,Dim_Reduction,Classifier,CV_Mean_Accuracy,CV_Mean_F1,CV_Mean_Precision,CV_Mean_Recall,CV_Mean_ROC_AUC,Train_Accuracy,Train_F1,Train_Precision,Train_Recall,Train_ROC_AUC,Val_Accuracy,Val_F1,Val_Precision,Val_Recall,Val_ROC_AUC
0,df_circular.csv,,,RandomForestClassifier,0.88074,0.883293,0.87763,0.889356,0.880601,0.999396,0.999405,0.998812,1.0,0.999387,0.913253,0.915888,0.903226,0.92891,0.912984
1,df_circular.csv,SMOTENC,,RandomForestClassifier,0.879525,0.882167,0.875973,0.888764,0.879383,0.999396,0.999405,0.998812,1.0,0.999387,0.906024,0.908665,0.898148,0.919431,0.905794
2,df_circular.csv,RandomUnderSampler,,RandomForestClassifier,0.881648,0.883583,0.881937,0.885792,0.881578,0.998188,0.998214,0.998809,0.99762,0.998197,0.896386,0.899767,0.885321,0.914692,0.896071
3,df_circular.csv,,SelectKBest,RandomForestClassifier,0.876507,0.877919,0.880996,0.875074,0.876526,0.992754,0.992887,0.989368,0.996431,0.992697,0.893976,0.898148,0.877828,0.919431,0.893539
4,df_circular.csv,SMOTENC,SelectKBest,RandomForestClassifier,0.88013,0.880974,0.887906,0.874486,0.880216,0.992754,0.992891,0.988791,0.997026,0.992688,0.889157,0.893023,0.876712,0.909953,0.8888
5,df_circular.csv,RandomUnderSampler,SelectKBest,RandomForestClassifier,0.878017,0.87868,0.88714,0.870914,0.878118,0.991244,0.991382,0.990499,0.992267,0.991228,0.874699,0.877934,0.869767,0.886256,0.874501
6,df_circular.csv,,LDA,RandomForestClassifier,0.817938,0.820767,0.821467,0.821559,0.817876,0.998792,0.99881,0.999404,0.998215,0.998801,0.845783,0.851852,0.832579,0.872038,0.845333
7,df_circular.csv,SMOTENC,LDA,RandomForestClassifier,0.827893,0.830958,0.829512,0.832847,0.827809,0.998792,0.99881,0.99881,0.99881,0.998792,0.838554,0.843091,0.833333,0.853081,0.838305
8,df_circular.csv,RandomUnderSampler,LDA,RandomForestClassifier,0.817022,0.81899,0.822832,0.815575,0.817029,0.996981,0.997022,0.998211,0.995836,0.996998,0.816867,0.820755,0.816901,0.824645,0.816734
9,df_circular.csv,,FeatureAgglomeration,RandomForestClassifier,0.879529,0.881653,0.879008,0.884608,0.879451,0.999396,0.999405,0.998812,1.0,0.999387,0.906024,0.909513,0.890909,0.92891,0.905631


In [8]:
results_df.sort_values(by=['Val_Accuracy'], ascending=False, inplace = True)
results_df

Unnamed: 0,Fingerprint,Feature_Selection,Dim_Reduction,Classifier,CV_Mean_Accuracy,CV_Mean_F1,CV_Mean_Precision,CV_Mean_Recall,CV_Mean_ROC_AUC,Train_Accuracy,Train_F1,Train_Precision,Train_Recall,Train_ROC_AUC,Val_Accuracy,Val_F1,Val_Precision,Val_Recall,Val_ROC_AUC
0,df_circular.csv,,,RandomForestClassifier,0.88074,0.883293,0.87763,0.889356,0.880601,0.999396,0.999405,0.998812,1.0,0.999387,0.913253,0.915888,0.903226,0.92891,0.912984
9,df_circular.csv,,FeatureAgglomeration,RandomForestClassifier,0.879529,0.881653,0.879008,0.884608,0.879451,0.999396,0.999405,0.998812,1.0,0.999387,0.906024,0.909513,0.890909,0.92891,0.905631
1,df_circular.csv,SMOTENC,,RandomForestClassifier,0.879525,0.882167,0.875973,0.888764,0.879383,0.999396,0.999405,0.998812,1.0,0.999387,0.906024,0.908665,0.898148,0.919431,0.905794
14,df_circular.csv,RandomUnderSampler,PCA,RandomForestClassifier,0.878624,0.88099,0.87653,0.885785,0.878509,0.996377,0.996424,0.998209,0.994646,0.996403,0.901205,0.904429,0.889908,0.919431,0.900892
16,df_circular.csv,SMOTENC,,ExtraTreesClassifier,0.884058,0.886122,0.883289,0.88936,0.883974,0.999396,0.999405,1.0,0.99881,0.999405,0.898795,0.901869,0.889401,0.914692,0.898522
11,df_circular.csv,RandomUnderSampler,FeatureAgglomeration,RandomForestClassifier,0.880739,0.882988,0.879176,0.886979,0.880638,0.997585,0.997618,0.998807,0.996431,0.997602,0.898795,0.902778,0.882353,0.924171,0.89836
15,df_circular.csv,,,ExtraTreesClassifier,0.882244,0.884378,0.881573,0.887574,0.88216,0.999396,0.999405,1.0,0.99881,0.999405,0.898795,0.901408,0.893023,0.909953,0.898604
27,df_circular.csv,,PCA,ExtraTreesClassifier,0.876214,0.878156,0.87684,0.879836,0.876148,0.999396,0.999405,0.998812,1.0,0.999387,0.896386,0.900232,0.881818,0.919431,0.89599
2,df_circular.csv,RandomUnderSampler,,RandomForestClassifier,0.881648,0.883583,0.881937,0.885792,0.881578,0.998188,0.998214,0.998809,0.99762,0.998197,0.896386,0.899767,0.885321,0.914692,0.896071
17,df_circular.csv,RandomUnderSampler,,ExtraTreesClassifier,0.883753,0.885716,0.88435,0.887574,0.883689,0.997585,0.997615,1.0,0.995241,0.99762,0.896386,0.898824,0.892523,0.905213,0.896234


In [9]:
cv_stats = results_df['CV_Mean_Accuracy'].describe()
train_stats = results_df['Train_Accuracy'].describe()
val_stats = results_df['Val_Accuracy'].describe()

print('\nTrain cross validation accuracy\n')
print(cv_stats)
print('\nTrain accuracy\n')
print(train_stats)
print('\nValidation accuracy\n')
print(val_stats)


Train cross validation accuracy

count    30.000000
mean      0.867391
std       0.023879
min       0.816118
25%       0.872649
50%       0.878320
75%       0.881421
max       0.884361
Name: CV_Mean_Accuracy, dtype: float64

Train accuracy

count    30.000000
mean      0.997373
std       0.002789
min       0.990942
25%       0.997056
50%       0.998792
75%       0.999396
max       0.999396
Name: Train_Accuracy, dtype: float64

Validation accuracy

count    30.000000
mean      0.881124
std       0.026078
min       0.816867
25%       0.878313
50%       0.891566
75%       0.896386
max       0.913253
Name: Val_Accuracy, dtype: float64
