# 2.3A Izbira modela za strojno učenje

V koraku izbira modela za strojno učenje:
- uvozimo podatke pridobljene v koraku 2. s katerih bomo učili modele
- podatke o strukturah molekule pretvorimo v fingerprinte (bitni zapis strukture, s tem dobimo featurje na X osi)
- izberemo kombinacijo najbolj primernega fingerprinta, klasifikatorja, vzorčenjske tehnike, skalarja, tehnike za zaznavanje outlierjev

# Uvoz knjižnic in splošnih funkcij

In [1]:
%run __A_knjiznice.py

from __A_knjiznice import *
from __B_funkcije import *
import __C_konstante as kon
%matplotlib inline

# Uvoz obdelanih podatkov obdelanih v koraku 2. Obdelava in analiza podatkov

## Pregled podatkov

In [2]:
df = pd.read_csv(f'{kon.path_files}/dp.csv')
df

Unnamed: 0,Smiles,ROMol,Activity
0,O=C1c2cc([N+](=O)[O-])ccc2-n2c1nc1ccccc1c2=O,<rdkit.Chem.rdchem.Mol object at 0x16fce49e0>,1
1,Cc1cc(C2CC2)ncc1-c1ccc(C2(C(=O)Nc3ccc(F)cc3)CO...,<rdkit.Chem.rdchem.Mol object at 0x16fcb09e0>,1
2,O=C(Nc1ccc(C2(C(=O)Nc3ccc(F)cc3)COC2)cc1)c1ccc...,<rdkit.Chem.rdchem.Mol object at 0x16fcdcba0>,1
3,O=C(Nc1ccc(F)cc1)C1(C2CCC3C(CCCN3c3ccnc(C(F)(F...,<rdkit.Chem.rdchem.Mol object at 0x16fcddcb0>,1
4,O=C1CC(c2c[nH]c3ccc(F)cc23)C(=O)N1,<rdkit.Chem.rdchem.Mol object at 0x16fce25e0>,1
...,...,...,...
4137,FC(F)(F)c1ccc(-c2c[nH]nn2)cc1,<rdkit.Chem.rdchem.Mol object at 0x16fcbef10>,0
4138,c1ccc2[nH]nnc2c1,<rdkit.Chem.rdchem.Mol object at 0x16fcf54d0>,0
4139,Cc1cccc(NC(=O)C(F)(F)F)c1-c1c[nH]nn1,<rdkit.Chem.rdchem.Mol object at 0x16fcb5230>,0
4140,Cc1ccc(N)cc1-c1c[nH]nn1,<rdkit.Chem.rdchem.Mol object at 0x16fcb2500>,0


In [3]:
activity_counts = df['Activity'].value_counts()
print(activity_counts)

Activity
1    2103
0    2039
Name: count, dtype: int64


# 3.3 Izračun relevantnih metrik za model strojnega učenja

## Test different fingerprints

**standard**: Considers paths of a given length. These are hashed fingerprints, with a default length of 1024.
**extended**: Similar to the standard type, but takes rings and atomic properties into account into account.
**graph**: Similar to the standard type by simply considers connectivity.
**hybridization**: Similar to the standard type, but only consider hybridization state.
**estate**: 79 bit fingerprints corresponding to the E-State atom types described by Hall and Kier.
**cdk-atompairs**: CDK's implementation of the atompairs fingerprint
**cdk-substructure**: CDK substructure fingerprint, basically identical to openbabel's fp4.
**pubchem**: 881 bit fingerprints defined by PubChem.
**klekota-roth**: 4860 bit fingerprint defined by Klekota and Roth.
**shortestpath**: A fingerprint based on the shortest paths between pairs of atoms and takes into account ring systems, charges etc.
**rdk-descriptor**: Various molecular descriptors implemented and calculated by RDKit.
**circular**: An implementation of the ECFP6 fingerprint.
**lingo**: An implementation of the LINGO fingerprint.
**rdkit**: Another implementation of a Daylight-like fingerprint by RDKit.
**maccs**: The popular 166 bit MACCS keys described by MDL.
**avalon**: Substructure or similarity Avalon fingerprint.
**atom-pair**: RDKit Atom-Pair fingerprint.
**topological-torsion**: RDKit Topological-Torsion Fingerprint.
**morgan**: RDKit Morgan fingerprint.
**fp2**: OpenBabel FP2 fingerprint, which indexes small molecule fragments based on linear segments of up to 7 atoms in length.
**fp3**: OpenBabel FP3 fingerprint, which is a fingerprint method created from a set of SMARTS patterns defining functional groups.
**fp4**: OpenBabel FP4 fingerprint, which is a fingerprint method created from a set of SMARTS patterns defining functional groups.
**spectrophore** Openbabel implementation of the spectrophore fingerprint (https://github.com/silicos-it/spectrophore).
**mol2vec**: Unsupervised machine learning approach for mulecule representation.

# Pregled kombinacij izbranih fingerprintov in klasifikacijskih modelov

## V tem koraku bi rad preveril katera kombinacija fingerprinta in klasifikacijskega modela je najboljša preden začnem manipulirati s podatki. Ta korak je pomemben zato da v nadaljnih korakih vidim kaj izboljšujem

In [4]:
# https://medium.com/artificialis/why-how-we-split-train-valid-and-test-fb4d6746ede

In [5]:
input_directory = f'{kon.path_files}/molekulski_prstni_odtisi'

generated_fingerprints = ['df_standard.csv', 'df_extended.csv', 'df_graph.csv', 'df_maccs.csv', 'df_pubchem.csv', 
 'df_estate.csv', 'df_hybridization.csv', 'df_lingo.csv', 'df_klekota-roth.csv', 'df_shortestpath.csv', 
 'df_cdk-substructure.csv', 'df_circular.csv', 'df_cdk-atompairs.csv', 'df_rdkit.csv', 'df_morgan.csv', 
 'df_rdk-maccs.csv', 'df_topological-torsion.csv', 'df_avalon.csv', 'df_atom-pair.csv', 'df_fp2.csv']

In [6]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score
from imblearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import VarianceThreshold, SelectPercentile, chi2, SelectKBest
from sklearn.decomposition import PCA
from xgboost import XGBClassifier  # Make sure to import XGBClassifier
import numpy as np

# Define classifiers with random_state for reproducibility
classifiers = [
    ('svm', SVC(probability=True, random_state=kon.random_seed)),
    ('lr', LogisticRegression(random_state=kon.random_seed, max_iter=1000)),
    ('rf', RandomForestClassifier(
        n_jobs=-1,
        random_state=kon.random_seed,
        max_features='sqrt',
        min_samples_split=5,
        min_samples_leaf=5,
        max_depth=20,
        bootstrap=True,
        n_estimators=100
    )),
    ('xgb', XGBClassifier(
        objective='binary:logistic',
        eval_metric='logloss',
        random_state=kon.random_seed,
        n_jobs=-1,
        max_depth=6,
        min_child_weight=5,
        learning_rate=0.05,
        n_estimators=100,
        gamma=0.3
    ))
]

# Feature selection preprocessing steps
feature_selection_steps = [
    ('select_percentile_25', SelectPercentile(score_func=chi2, percentile=25)),
    ('select_percentile_50', SelectPercentile(score_func=chi2, percentile=50))
]

# Dimensionality reduction options
dimensionality_reduction = [
    ('pca', PCA(n_components=0.9, random_state=kon.random_seed)),
    ('none', 'passthrough')
]

# Stratified k-fold cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=kon.random_seed)
scoring = {'accuracy': 'accuracy', 'f1': 'f1', 'precision': 'precision', 'recall': 'recall', 'roc_auc': 'roc_auc'}
results_list = []


for filename in generated_fingerprints:
    file_path = os.path.join(input_directory, filename)
    
    if os.path.exists(file_path):
        print(f'Processing fingerprint DataFrame: {filename}')
        
        # Load the fingerprint data
        df = pd.read_csv(file_path)
        fingerprint_name = filename.split('df_')[1].split('.')[0]
        y = df['Activity'].values.ravel()
        X = df.iloc[:, 3:]
        
        # Split the data into train and test
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=kon.random_seed, stratify=y)
        
        # Loop over classifiers, feature selection methods, and dimensionality reduction
        for clf_name, classifier in classifiers:
            for prep_name, prep_step in feature_selection_steps:
                for dr_name, dr_step in dimensionality_reduction:
                    print(f"Training {clf_name} with {prep_name} and {dr_name} on {fingerprint_name}...")
                    
                    # Define the pipeline steps
                    steps = [
                        ('variance_threshold', VarianceThreshold()),
                        (prep_name, prep_step),
                        (dr_name, dr_step),
                        ('classifier', classifier)
                    ]
                    
                    # Create the pipeline
                    pipeline = Pipeline(steps)
                    
                    # Cross-validation to evaluate the model
                    cv_results = cross_validate(pipeline, X_train, y_train, cv=cv, scoring=scoring, return_train_score=True)
                    
                    # Fit the pipeline and predict on the test set
                    pipeline.fit(X_train, y_train)
                    y_test_pred = pipeline.predict(X_test)
                    
                    # Calculate metrics on the test set
                    test_accuracy = accuracy_score(y_test, y_test_pred)
                    test_f1 = f1_score(y_test, y_test_pred)
                    test_precision = precision_score(y_test, y_test_pred)
                    test_recall = recall_score(y_test, y_test_pred)
                    test_roc_auc = roc_auc_score(y_test, y_test_pred)
                    
                    # Store the results including the feature selection and dimensionality reduction methods used
                    results_temp = {
                        'Fingerprint': fingerprint_name,
                        'Classifier': clf_name,
                        'Feature_Selection_Method': prep_name,  # Store feature selection method
                        'Dimensionality_Reduction_Method': dr_name,  # Store dimensionality reduction method
                        # 'Steps': str(steps),  # Store the steps as string representation
                        'CV_Train_Accuracy_Folds': list(cv_results['train_accuracy']),
                        'CV_Train_Mean_Accuracy': cv_results['train_accuracy'].mean(),
                        'CV_Train_Mean_F1': cv_results['train_f1'].mean(),
                        'CV_Train_Mean_Precision': cv_results['train_precision'].mean(),
                        'CV_Train_Mean_Recall': cv_results['train_recall'].mean(),
                        'CV_Train_Mean_ROC_AUC': cv_results['train_roc_auc'].mean(),
                        'CV_Val_Accuracy_Folds': list(cv_results['test_accuracy']),
                        'CV_Val_Mean_Accuracy': cv_results['test_accuracy'].mean(),
                        'CV_Val_Mean_F1': cv_results['test_f1'].mean(),
                        'CV_Val_Mean_Precision': cv_results['test_precision'].mean(),
                        'CV_Val_Mean_Recall': cv_results['test_recall'].mean(),
                        'CV_Val_Mean_ROC_AUC': cv_results['test_roc_auc'].mean(),
                        'Test_Accuracy': test_accuracy,
                        'Test_F1': test_f1,
                        'Test_Precision': test_precision,
                        'Test_Recall': test_recall,
                        'Test_ROC_AUC': test_roc_auc
                    }
                    
                    # Append the results for this configuration
                    results_list.append(results_temp)
                    print(results_temp)

# Convert the results list into a DataFrame
results_df = pd.DataFrame(results_list)

Processing fingerprint DataFrame: df_standard.csv
Training svm with select_percentile_25 and pca on standard...
{'Fingerprint': 'standard', 'Classifier': 'svm', 'Feature_Selection_Method': 'select_percentile_25', 'Dimensionality_Reduction_Method': 'pca', 'CV_Train_Accuracy_Folds': [0.8950943396226415, 0.8883018867924528, 0.889433962264151, 0.8890984534138061, 0.8902301018483592], 'CV_Train_Mean_Accuracy': 0.8904317487882821, 'CV_Train_Mean_F1': 0.8942105740798171, 'CV_Train_Mean_Precision': 0.8769680437647421, 'CV_Train_Mean_Recall': 0.9121587299833734, 'CV_Train_Mean_ROC_AUC': 0.9623834065923791, 'CV_Val_Accuracy_Folds': [0.8642533936651584, 0.8702865761689291, 0.8567119155354449, 0.851963746223565, 0.8625377643504532], 'CV_Val_Mean_Accuracy': 0.86115067918871, 'CV_Val_Mean_F1': 0.866663276436382, 'CV_Val_Mean_Precision': 0.8458380297438955, 'CV_Val_Mean_Recall': 0.8888300127172531, 'CV_Val_Mean_ROC_AUC': 0.9291842774501748, 'Test_Accuracy': 0.8757539203860072, 'Test_F1': 0.8760529482

In [7]:
results_df.to_csv(f'{kon.path_files}/rezultati_modelov/results_df_new12.csv', index=False)