# 2.3A Izbira modela za strojno učenje

V koraku izbira modela za strojno učenje:
- uvozimo podatke pridobljene v koraku 2. s katerih bomo učili modele
- podatke o strukturah molekule pretvorimo v fingerprinte (bitni zapis strukture, s tem dobimo featurje na X osi)
- izberemo kombinacijo najbolj primernega fingerprinta, klasifikatorja, vzorčenjske tehnike, skalarja, tehnike za zaznavanje outlierjev

# Uvoz knjižnic in splošnih funkcij

In [1]:
%run __A_knjiznice.py

from __A_knjiznice import *
from __B_funkcije import *
import __C_konstante as kon
%matplotlib inline

# Uvoz obdelanih podatkov obdelanih v koraku 2. Obdelava in analiza podatkov

## Pregled podatkov

In [2]:
df = pd.read_csv(f'{kon.path_files}/dp.csv')
df

Unnamed: 0,Smiles,ROMol,Activity
0,O=C1c2cc([N+](=O)[O-])ccc2-n2c1nc1ccccc1c2=O,<rdkit.Chem.rdchem.Mol object at 0x16fce49e0>,1
1,Cc1cc(C2CC2)ncc1-c1ccc(C2(C(=O)Nc3ccc(F)cc3)CO...,<rdkit.Chem.rdchem.Mol object at 0x16fcb09e0>,1
2,O=C(Nc1ccc(C2(C(=O)Nc3ccc(F)cc3)COC2)cc1)c1ccc...,<rdkit.Chem.rdchem.Mol object at 0x16fcdcba0>,1
3,O=C(Nc1ccc(F)cc1)C1(C2CCC3C(CCCN3c3ccnc(C(F)(F...,<rdkit.Chem.rdchem.Mol object at 0x16fcddcb0>,1
4,O=C1CC(c2c[nH]c3ccc(F)cc23)C(=O)N1,<rdkit.Chem.rdchem.Mol object at 0x16fce25e0>,1
...,...,...,...
4137,FC(F)(F)c1ccc(-c2c[nH]nn2)cc1,<rdkit.Chem.rdchem.Mol object at 0x16fcbef10>,0
4138,c1ccc2[nH]nnc2c1,<rdkit.Chem.rdchem.Mol object at 0x16fcf54d0>,0
4139,Cc1cccc(NC(=O)C(F)(F)F)c1-c1c[nH]nn1,<rdkit.Chem.rdchem.Mol object at 0x16fcb5230>,0
4140,Cc1ccc(N)cc1-c1c[nH]nn1,<rdkit.Chem.rdchem.Mol object at 0x16fcb2500>,0


In [3]:
activity_counts = df['Activity'].value_counts()
print(activity_counts)

Activity
1    2103
0    2039
Name: count, dtype: int64


# 3.3 Izračun relevantnih metrik za model strojnega učenja

## Test different fingerprints

**standard**: Considers paths of a given length. These are hashed fingerprints, with a default length of 1024.
**extended**: Similar to the standard type, but takes rings and atomic properties into account into account.
**graph**: Similar to the standard type by simply considers connectivity.
**hybridization**: Similar to the standard type, but only consider hybridization state.
**estate**: 79 bit fingerprints corresponding to the E-State atom types described by Hall and Kier.
**cdk-atompairs**: CDK's implementation of the atompairs fingerprint
**cdk-substructure**: CDK substructure fingerprint, basically identical to openbabel's fp4.
**pubchem**: 881 bit fingerprints defined by PubChem.
**klekota-roth**: 4860 bit fingerprint defined by Klekota and Roth.
**shortestpath**: A fingerprint based on the shortest paths between pairs of atoms and takes into account ring systems, charges etc.
**rdk-descriptor**: Various molecular descriptors implemented and calculated by RDKit.
**circular**: An implementation of the ECFP6 fingerprint.
**lingo**: An implementation of the LINGO fingerprint.
**rdkit**: Another implementation of a Daylight-like fingerprint by RDKit.
**maccs**: The popular 166 bit MACCS keys described by MDL.
**avalon**: Substructure or similarity Avalon fingerprint.
**atom-pair**: RDKit Atom-Pair fingerprint.
**topological-torsion**: RDKit Topological-Torsion Fingerprint.
**morgan**: RDKit Morgan fingerprint.
**fp2**: OpenBabel FP2 fingerprint, which indexes small molecule fragments based on linear segments of up to 7 atoms in length.
**fp3**: OpenBabel FP3 fingerprint, which is a fingerprint method created from a set of SMARTS patterns defining functional groups.
**fp4**: OpenBabel FP4 fingerprint, which is a fingerprint method created from a set of SMARTS patterns defining functional groups.
**spectrophore** Openbabel implementation of the spectrophore fingerprint (https://github.com/silicos-it/spectrophore).
**mol2vec**: Unsupervised machine learning approach for mulecule representation.

# Pregled kombinacij izbranih fingerprintov in klasifikacijskih modelov

## V tem koraku bi rad preveril katera kombinacija fingerprinta in klasifikacijskega modela je najboljša preden začnem manipulirati s podatki. Ta korak je pomemben zato da v nadaljnih korakih vidim kaj izboljšujem

In [4]:
# https://medium.com/artificialis/why-how-we-split-train-valid-and-test-fb4d6746ede

In [5]:
input_directory = f'{kon.path_files}/molekulski_prstni_odtisi'

generated_fingerprints = ['df_standard.csv', 'df_extended.csv', 'df_graph.csv', 'df_maccs.csv', 'df_pubchem.csv', 
 'df_estate.csv', 'df_hybridization.csv', 'df_lingo.csv', 'df_klekota-roth.csv', 'df_shortestpath.csv', 
 'df_cdk-substructure.csv', 'df_circular.csv', 'df_cdk-atompairs.csv', 'df_rdkit.csv', 'df_morgan.csv', 
 'df_rdk-maccs.csv', 'df_topological-torsion.csv', 'df_avalon.csv', 'df_atom-pair.csv', 'df_fp2.csv']

In [6]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score
from imblearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.feature_selection import VarianceThreshold, SelectPercentile, chi2, SelectKBest
import numpy as np

# Define classifiers with random_state for reproducibility
classifiers = [
    ('svm', SVC(probability=True, random_state=kon.random_seed)),
    ('lr', LogisticRegression(random_state=kon.random_seed, max_iter=1000)),
    ('rf', RandomForestClassifier(
        n_jobs=-1,
        random_state=kon.random_seed,
        max_features='sqrt',  # default
        min_samples_split=5,  # Increased from default (2) to prevent overfitting
        min_samples_leaf=5,    # Increased from default (1) to prevent very small leaf nodes
        max_depth=20,          # Restrict the depth of the trees
        bootstrap=True,        # Use bootstrap sampling for more generalization
        n_estimators=100 # default value   
    )),
    ('xgb', XGBClassifier(
        objective='binary:logistic',
        eval_metric='logloss',
        random_state=kon.random_seed,
        n_jobs=-1,
        max_depth=6,          # default
        min_child_weight=5,   # Ensure a minimum sum of instance weights per leaf
        learning_rate=0.05,   # Lower learning rate to prevent overfitting
        n_estimators=100,     # default
        gamma=0.3    #default=0.3
    ))
]

# Stratified k-fold cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=kon.random_seed)
scoring = {'accuracy': 'accuracy', 'f1': 'f1', 'precision': 'precision', 'recall': 'recall', 'roc_auc': 'roc_auc'}
results_list = []

for filename in generated_fingerprints:
    file_path = os.path.join(input_directory, filename)
    
    if os.path.exists(file_path):
        print(f'Processing fingerprint DataFrame: {filename}')
        
        df = pd.read_csv(file_path)
        fingerprint_name = filename.split('df_')[1].split('.')[0]
        y = df['Activity'].values.ravel()
        X = df.iloc[:, 3:]
        
        # Train-test split (Stratified split ensures balanced classes)
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=kon.random_seed, stratify=y)

        # Iterate over classifiers (we are not using other feature selection or dimensionality reduction in this case)
        for clf_name, classifier in classifiers:
            print(f"Training {clf_name} on {fingerprint_name}...")
            
            # Define the pipeline steps, adding SelectPercentile to select 50% of features
            steps = [
                ('variance_threshold', VarianceThreshold()),  # Apply Variance Threshold to remove low-variance features
                ('SelectKBest', SelectKBest(score_func=chi2, k = 300)),  # Apply SelectPercentile to select 50% of features
                ('classifier', classifier)  # Add the classifier
            ]
            
            # Create the pipeline with VarianceThreshold, SelectPercentile, and the classifier
            pipeline = Pipeline(steps)
            
            # Cross-validation
            cv_results = cross_validate(pipeline, X_train, y_train, cv=cv, scoring=scoring, return_train_score=True)
            
            # Fit the pipeline on the entire training set
            pipeline.fit(X_train, y_train)
            
            # Validation set evaluation (we are using this to test after training on the entire training data)
            y_test_pred = pipeline.predict(X_test)
            test_accuracy = accuracy_score(y_test, y_test_pred)
            test_f1 = f1_score(y_test, y_test_pred)
            test_precision = precision_score(y_test, y_test_pred)
            test_recall = recall_score(y_test, y_test_pred)
            test_roc_auc = roc_auc_score(y_test, y_test_pred)
            
            # Store the results with corrected labels
            results_temp = {
                'Fingerprint': fingerprint_name,
                'Classifier': clf_name,
                'CV_Train_Accuracy_Folds': list(cv_results['train_accuracy']),  # Training accuracy for each fold
                'CV_Train_Mean_Accuracy': cv_results['train_accuracy'].mean(),
                'CV_Train_Mean_F1': cv_results['train_f1'].mean(),
                'CV_Train_Mean_Precision': cv_results['train_precision'].mean(),
                'CV_Train_Mean_Recall': cv_results['train_recall'].mean(),
                'CV_Train_Mean_ROC_AUC': cv_results['train_roc_auc'].mean(),
                'CV_Val_Accuracy_Folds': list(cv_results['test_accuracy']),  # Validation accuracy for each fold
                'CV_Val_Mean_Accuracy': cv_results['test_accuracy'].mean(),
                'CV_Val_Mean_F1': cv_results['test_f1'].mean(),
                'CV_Val_Mean_Precision': cv_results['test_precision'].mean(),
                'CV_Val_Mean_Recall': cv_results['test_recall'].mean(),
                'CV_Val_Mean_ROC_AUC': cv_results['test_roc_auc'].mean(),
                'Test_Accuracy': test_accuracy,  # Test accuracy
                'Test_F1': test_f1,
                'Test_Precision': test_precision,
                'Test_Recall': test_recall,
                'Test_ROC_AUC': test_roc_auc
            }
            
            results_list.append(results_temp)
            print(results_temp)

# Store all results in a DataFrame
results_df = pd.DataFrame(results_list)

Processing fingerprint DataFrame: df_standard.csv
Training svm on standard...
{'Fingerprint': 'standard', 'Classifier': 'svm', 'CV_Train_Accuracy_Folds': [0.9018867924528302, 0.8977358490566038, 0.8962264150943396, 0.8981516408902301, 0.8985288570350811], 'CV_Train_Mean_Accuracy': 0.898505910905817, 'CV_Train_Mean_F1': 0.9020805761270279, 'CV_Train_Mean_Precision': 0.8840277122083158, 'CV_Train_Mean_Recall': 0.9209284289951778, 'CV_Train_Mean_ROC_AUC': 0.9692254676613358, 'CV_Val_Accuracy_Folds': [0.8702865761689291, 0.8717948717948718, 0.861236802413273, 0.851963746223565, 0.8564954682779456], 'CV_Val_Mean_Accuracy': 0.8623554929757169, 'CV_Val_Mean_F1': 0.8678296543190974, 'CV_Val_Mean_Precision': 0.8469277891142826, 'CV_Val_Mean_Recall': 0.8900116574819839, 'CV_Val_Mean_ROC_AUC': 0.9307590277458161, 'Test_Accuracy': 0.873341375150784, 'Test_F1': 0.8748510131108462, 'Test_Precision': 0.8779904306220095, 'Test_Recall': 0.8717339667458432, 'Test_ROC_AUC': 0.8733669833729216}
Training l



{'Fingerprint': 'maccs', 'Classifier': 'svm', 'CV_Train_Accuracy_Folds': [0.8864150943396226, 0.8837735849056604, 0.8849056603773585, 0.8845718596755942, 0.8826857789513392], 'CV_Train_Mean_Accuracy': 0.8844703956499149, 'CV_Train_Mean_F1': 0.889287547616021, 'CV_Train_Mean_Precision': 0.8659413445621551, 'CV_Train_Mean_Recall': 0.913942232803239, 'CV_Train_Mean_ROC_AUC': 0.9521507863464004, 'CV_Val_Accuracy_Folds': [0.8702865761689291, 0.8642533936651584, 0.8687782805429864, 0.8549848942598187, 0.8580060422960725], 'CV_Val_Mean_Accuracy': 0.8632618373865931, 'CV_Val_Mean_F1': 0.8691841387088448, 'CV_Val_Mean_Precision': 0.8452265834305905, 'CV_Val_Mean_Recall': 0.8947717959587396, 'CV_Val_Mean_ROC_AUC': 0.9246013587768106, 'Test_Accuracy': 0.8697225572979493, 'Test_F1': 0.8735362997658079, 'Test_Precision': 0.8614318706697459, 'Test_Recall': 0.8859857482185273, 'Test_ROC_AUC': 0.8694634623445577}
Training lr on maccs...




{'Fingerprint': 'maccs', 'Classifier': 'lr', 'CV_Train_Accuracy_Folds': [0.8577358490566038, 0.8547169811320755, 0.8562264150943396, 0.8577895133911732, 0.8528857035081101], 'CV_Train_Mean_Accuracy': 0.8558708924364605, 'CV_Train_Mean_F1': 0.860910895836119, 'CV_Train_Mean_Precision': 0.84395295383135, 'CV_Train_Mean_Recall': 0.8785672542077034, 'CV_Train_Mean_ROC_AUC': 0.9256924462390353, 'CV_Val_Accuracy_Folds': [0.8371040723981901, 0.8401206636500754, 0.8431372549019608, 0.8338368580060423, 0.8413897280966768], 'CV_Val_Mean_Accuracy': 0.8391177154105891, 'CV_Val_Mean_F1': 0.8443896870059338, 'CV_Val_Mean_Precision': 0.8294564560873084, 'CV_Val_Mean_Recall': 0.860292143563657, 'CV_Val_Mean_ROC_AUC': 0.9062838653306924, 'Test_Accuracy': 0.8395657418576599, 'Test_F1': 0.8448074679113186, 'Test_Precision': 0.8302752293577982, 'Test_Recall': 0.8598574821852731, 'Test_ROC_AUC': 0.8392424665828326}
Training rf on maccs...




{'Fingerprint': 'maccs', 'Classifier': 'rf', 'CV_Train_Accuracy_Folds': [0.890943396226415, 0.890943396226415, 0.8856603773584906, 0.8887212372689551, 0.8887212372689551], 'CV_Train_Mean_Accuracy': 0.8889979288698461, 'CV_Train_Mean_F1': 0.8926000438165176, 'CV_Train_Mean_Precision': 0.8771839766958742, 'CV_Train_Mean_Recall': 0.9085921662422599, 'CV_Train_Mean_ROC_AUC': 0.9620033873814707, 'CV_Val_Accuracy_Folds': [0.8717948717948718, 0.8627450980392157, 0.8552036199095022, 0.8564954682779456, 0.8595166163141994], 'CV_Val_Mean_Accuracy': 0.861151134867147, 'CV_Val_Mean_F1': 0.8659789412697709, 'CV_Val_Mean_Precision': 0.8494355760379774, 'CV_Val_Mean_Recall': 0.8834711035749612, 'CV_Val_Mean_ROC_AUC': 0.9282019807435345, 'Test_Accuracy': 0.8636911942098915, 'Test_F1': 0.8665879574970484, 'Test_Precision': 0.8615023474178404, 'Test_Recall': 0.8717339667458432, 'Test_ROC_AUC': 0.8635630618042941}
Training xgb on maccs...




{'Fingerprint': 'maccs', 'Classifier': 'xgb', 'CV_Train_Accuracy_Folds': [0.900377358490566, 0.8984905660377358, 0.8947169811320754, 0.897774424745379, 0.899660505469634], 'CV_Train_Mean_Accuracy': 0.898203967175078, 'CV_Train_Mean_F1': 0.9011139124699772, 'CV_Train_Mean_Precision': 0.8889512193775474, 'CV_Train_Mean_Recall': 0.9136448350337224, 'CV_Train_Mean_ROC_AUC': 0.9656236439870927, 'CV_Val_Accuracy_Folds': [0.8582202111613876, 0.8476621417797888, 0.8567119155354449, 0.8640483383685801, 0.8670694864048338], 'CV_Val_Mean_Accuracy': 0.8587424186500071, 'CV_Val_Mean_F1': 0.8625258895211136, 'CV_Val_Mean_Precision': 0.8524387748985521, 'CV_Val_Mean_Recall': 0.8733891479440439, 'CV_Val_Mean_ROC_AUC': 0.9297608665752103, 'Test_Accuracy': 0.8624849215922799, 'Test_F1': 0.8652482269503546, 'Test_Precision': 0.8611764705882353, 'Test_Recall': 0.8693586698337292, 'Test_ROC_AUC': 0.8623754133482371}
Processing fingerprint DataFrame: df_pubchem.csv
Training svm on pubchem...
{'Fingerprint':



{'Fingerprint': 'estate', 'Classifier': 'svm', 'CV_Train_Accuracy_Folds': [0.8445283018867924, 0.8456603773584905, 0.8456603773584905, 0.8434552998868352, 0.8427008675971331], 'CV_Train_Mean_Accuracy': 0.8444010448175485, 'CV_Train_Mean_F1': 0.8491063199581148, 'CV_Train_Mean_Precision': 0.8363435734002707, 'CV_Train_Mean_Recall': 0.8623672508934638, 'CV_Train_Mean_ROC_AUC': 0.910786206720237, 'CV_Val_Accuracy_Folds': [0.8190045248868778, 0.8129713423831071, 0.8220211161387632, 0.8338368580060423, 0.8413897280966768], 'CV_Val_Mean_Accuracy': 0.8258447139022935, 'CV_Val_Mean_F1': 0.8310297309667247, 'CV_Val_Mean_Precision': 0.8187659474943778, 'CV_Val_Mean_Recall': 0.8442631058358062, 'CV_Val_Mean_ROC_AUC': 0.8900963478741246, 'Test_Accuracy': 0.8250904704463209, 'Test_F1': 0.8288075560802833, 'Test_Precision': 0.823943661971831, 'Test_Recall': 0.833729216152019, 'Test_ROC_AUC': 0.8249528433701271}
Training lr on estate...
{'Fingerprint': 'estate', 'Classifier': 'lr', 'CV_Train_Accuracy



{'Fingerprint': 'estate', 'Classifier': 'rf', 'CV_Train_Accuracy_Folds': [0.8441509433962264, 0.8430188679245283, 0.8445283018867924, 0.8442097321765372, 0.8415692191625802], 'CV_Train_Mean_Accuracy': 0.8434954129093329, 'CV_Train_Mean_F1': 0.8463906616748181, 'CV_Train_Mean_Precision': 0.84355493973274, 'CV_Train_Mean_Recall': 0.8492866099195192, 'CV_Train_Mean_ROC_AUC': 0.9100098114718123, 'CV_Val_Accuracy_Folds': [0.8295625942684767, 0.8205128205128205, 0.8250377073906485, 0.8308157099697885, 0.8413897280966768], 'CV_Val_Mean_Accuracy': 0.8294637120476823, 'CV_Val_Mean_F1': 0.8325258329926198, 'CV_Val_Mean_Precision': 0.8301526920615684, 'CV_Val_Mean_Recall': 0.8353415995478309, 'CV_Val_Mean_ROC_AUC': 0.8887338753054765, 'Test_Accuracy': 0.8238841978287093, 'Test_F1': 0.8266033254156769, 'Test_Precision': 0.8266033254156769, 'Test_Recall': 0.8266033254156769, 'Test_ROC_AUC': 0.823840878394113}
Training xgb on estate...




{'Fingerprint': 'estate', 'Classifier': 'xgb', 'CV_Train_Accuracy_Folds': [0.8464150943396226, 0.8441509433962264, 0.8452830188679246, 0.8423236514522822, 0.8445869483213881], 'CV_Train_Mean_Accuracy': 0.844551931275489, 'CV_Train_Mean_F1': 0.8489161228348487, 'CV_Train_Mean_Precision': 0.8379212152805602, 'CV_Train_Mean_Recall': 0.8602865712533901, 'CV_Train_Mean_ROC_AUC': 0.9122977589798025, 'CV_Val_Accuracy_Folds': [0.8114630467571644, 0.8129713423831071, 0.8174962292609351, 0.823262839879154, 0.8459214501510574], 'CV_Val_Mean_Accuracy': 0.8222229816862836, 'CV_Val_Mean_F1': 0.8269006427214098, 'CV_Val_Mean_Precision': 0.8174379798979412, 'CV_Val_Mean_Recall': 0.8371290801186945, 'CV_Val_Mean_ROC_AUC': 0.8888731092746094, 'Test_Accuracy': 0.8250904704463209, 'Test_F1': 0.8300117233294255, 'Test_Precision': 0.8194444444444444, 'Test_Recall': 0.8408551068883611, 'Test_ROC_AUC': 0.824839318150063}
Processing fingerprint DataFrame: df_hybridization.csv
Training svm on hybridization...
{



{'Fingerprint': 'cdk-substructure', 'Classifier': 'svm', 'CV_Train_Accuracy_Folds': [0.8716981132075472, 0.870188679245283, 0.8732075471698113, 0.8713692946058091, 0.8664654847227461], 'CV_Train_Mean_Accuracy': 0.8705858237902392, 'CV_Train_Mean_F1': 0.8747915378796233, 'CV_Train_Mean_Precision': 0.8597097414335592, 'CV_Train_Mean_Recall': 0.8904594088501245, 'CV_Train_Mean_ROC_AUC': 0.9331504681690153, 'CV_Val_Accuracy_Folds': [0.8552036199095022, 0.8461538461538461, 0.8491704374057315, 0.8549848942598187, 0.851963746223565], 'CV_Val_Mean_Accuracy': 0.8514953087904926, 'CV_Val_Mean_F1': 0.8568246554605597, 'CV_Val_Mean_Precision': 0.8393308622056219, 'CV_Val_Mean_Recall': 0.8751536668079695, 'CV_Val_Mean_ROC_AUC': 0.9119140420015974, 'Test_Accuracy': 0.8636911942098915, 'Test_F1': 0.8665879574970484, 'Test_Precision': 0.8615023474178404, 'Test_Recall': 0.8717339667458432, 'Test_ROC_AUC': 0.8635630618042941}
Training lr on cdk-substructure...
{'Fingerprint': 'cdk-substructure', 'Classi



{'Fingerprint': 'cdk-substructure', 'Classifier': 'rf', 'CV_Train_Accuracy_Folds': [0.8667924528301887, 0.8649056603773585, 0.8641509433962264, 0.8649566201433422, 0.8623161071293851], 'CV_Train_Mean_Accuracy': 0.8646243567753003, 'CV_Train_Mean_F1': 0.8669437845559745, 'CV_Train_Mean_Precision': 0.8652731891330193, 'CV_Train_Mean_Recall': 0.8687592039196407, 'CV_Train_Mean_ROC_AUC': 0.9313512308190788, 'CV_Val_Accuracy_Folds': [0.8491704374057315, 0.8597285067873304, 0.832579185520362, 0.8489425981873112, 0.8383685800604229], 'CV_Val_Mean_Accuracy': 0.8457578615922318, 'CV_Val_Mean_F1': 0.8485030478221995, 'CV_Val_Mean_Precision': 0.8465110845658337, 'CV_Val_Mean_Recall': 0.8507753991804436, 'CV_Val_Mean_ROC_AUC': 0.9110564771006905, 'Test_Accuracy': 0.8480096501809409, 'Test_F1': 0.8478260869565217, 'Test_Precision': 0.8624078624078624, 'Test_Recall': 0.833729216152019, 'Test_ROC_AUC': 0.8482371570956173}
Training xgb on cdk-substructure...




{'Fingerprint': 'cdk-substructure', 'Classifier': 'xgb', 'CV_Train_Accuracy_Folds': [0.8686792452830189, 0.8637735849056604, 0.8679245283018868, 0.8713692946058091, 0.8687287815918522], 'CV_Train_Mean_Accuracy': 0.8680950869376455, 'CV_Train_Mean_F1': 0.8712568645011588, 'CV_Train_Mean_Precision': 0.8635563372929091, 'CV_Train_Mean_Recall': 0.8791614973734652, 'CV_Train_Mean_ROC_AUC': 0.9379338718131571, 'CV_Val_Accuracy_Folds': [0.8506787330316742, 0.8401206636500754, 0.8355957767722474, 0.8459214501510574, 0.8429003021148036], 'CV_Val_Mean_Accuracy': 0.8430433851439716, 'CV_Val_Mean_F1': 0.8464330990219755, 'CV_Val_Mean_Precision': 0.841058734416172, 'CV_Val_Mean_Recall': 0.8519694079412181, 'CV_Val_Mean_ROC_AUC': 0.9117531610234471, 'Test_Accuracy': 0.8480096501809409, 'Test_F1': 0.8485576923076923, 'Test_Precision': 0.8588807785888077, 'Test_Recall': 0.838479809976247, 'Test_ROC_AUC': 0.8481614736155745}
Processing fingerprint DataFrame: df_circular.csv
Training svm on circular...




{'Fingerprint': 'rdk-maccs', 'Classifier': 'svm', 'CV_Train_Accuracy_Folds': [0.8864150943396226, 0.8849056603773585, 0.8837735849056604, 0.8849490758204451, 0.8857035081101471], 'CV_Train_Mean_Accuracy': 0.8851493847106469, 'CV_Train_Mean_F1': 0.8903606023445473, 'CV_Train_Mean_Precision': 0.8638599606536583, 'CV_Train_Mean_Recall': 0.9185497992123157, 'CV_Train_Mean_ROC_AUC': 0.9534905543539838, 'CV_Val_Accuracy_Folds': [0.8778280542986425, 0.861236802413273, 0.8597285067873304, 0.8610271903323263, 0.8625377643504532], 'CV_Val_Mean_Accuracy': 0.864471663636405, 'CV_Val_Mean_F1': 0.8707655938715158, 'CV_Val_Mean_Precision': 0.8445306135125437, 'CV_Val_Mean_Recall': 0.8989349300551082, 'CV_Val_Mean_ROC_AUC': 0.9242998260726246, 'Test_Accuracy': 0.8697225572979493, 'Test_F1': 0.8752886836027713, 'Test_Precision': 0.851685393258427, 'Test_Recall': 0.9002375296912114, 'Test_ROC_AUC': 0.8692364119044294}
Training lr on rdk-maccs...




{'Fingerprint': 'rdk-maccs', 'Classifier': 'lr', 'CV_Train_Accuracy_Folds': [0.8516981132075472, 0.8550943396226415, 0.849811320754717, 0.8517540550735572, 0.8487363259147491], 'CV_Train_Mean_Accuracy': 0.8514188309146423, 'CV_Train_Mean_F1': 0.8563500189055123, 'CV_Train_Mean_Precision': 0.8409555344737694, 'CV_Train_Mean_Recall': 0.8723255467114457, 'CV_Train_Mean_ROC_AUC': 0.9218926584006322, 'CV_Val_Accuracy_Folds': [0.8431372549019608, 0.8446455505279035, 0.8340874811463047, 0.8293051359516617, 0.8368580060422961], 'CV_Val_Mean_Accuracy': 0.8376066857140254, 'CV_Val_Mean_F1': 0.8423204546965535, 'CV_Val_Mean_Precision': 0.8306517811553104, 'CV_Val_Mean_Recall': 0.854931468136216, 'CV_Val_Mean_ROC_AUC': 0.9003535530510008, 'Test_Accuracy': 0.827503015681544, 'Test_F1': 0.8315665488810365, 'Test_Precision': 0.8247663551401869, 'Test_Recall': 0.838479809976247, 'Test_ROC_AUC': 0.8273281402822411}
Training rf on rdk-maccs...




{'Fingerprint': 'rdk-maccs', 'Classifier': 'rf', 'CV_Train_Accuracy_Folds': [0.8924528301886793, 0.8932075471698113, 0.8879245283018868, 0.8932478310071671, 0.8875895888344021], 'CV_Train_Mean_Accuracy': 0.8908844651003894, 'CV_Train_Mean_F1': 0.8946689143192627, 'CV_Train_Mean_Precision': 0.8773054601065515, 'CV_Train_Mean_Recall': 0.9127534150477526, 'CV_Train_Mean_ROC_AUC': 0.9627542566168603, 'CV_Val_Accuracy_Folds': [0.8717948717948718, 0.8521870286576169, 0.8567119155354449, 0.8549848942598187, 0.8625377643504532], 'CV_Val_Mean_Accuracy': 0.8596432949196411, 'CV_Val_Mean_F1': 0.864318590701663, 'CV_Val_Mean_Precision': 0.8490853384082249, 'CV_Val_Mean_Recall': 0.8805055108096651, 'CV_Val_Mean_ROC_AUC': 0.9290174759157521, 'Test_Accuracy': 0.8685162846803377, 'Test_F1': 0.8716136631330977, 'Test_Precision': 0.8644859813084113, 'Test_Recall': 0.8788598574821853, 'Test_ROC_AUC': 0.8683514973685437}
Training xgb on rdk-maccs...




{'Fingerprint': 'rdk-maccs', 'Classifier': 'xgb', 'CV_Train_Accuracy_Folds': [0.8954716981132076, 0.8977358490566038, 0.8962264150943396, 0.8989060731799321, 0.9004149377593361], 'CV_Train_Mean_Accuracy': 0.8977509946406839, 'CV_Train_Mean_F1': 0.9009071466198695, 'CV_Train_Mean_Precision': 0.8867301556667864, 'CV_Train_Mean_Recall': 0.9155763738904202, 'CV_Train_Mean_ROC_AUC': 0.9657812314698792, 'CV_Val_Accuracy_Folds': [0.8642533936651584, 0.8642533936651584, 0.8582202111613876, 0.8625377643504532, 0.8625377643504532], 'CV_Val_Mean_Accuracy': 0.8623605054385222, 'CV_Val_Mean_F1': 0.866020513323531, 'CV_Val_Mean_Precision': 0.8564042304886147, 'CV_Val_Mean_Recall': 0.8763529744241911, 'CV_Val_Mean_ROC_AUC': 0.9315583068439937, 'Test_Accuracy': 0.864897466827503, 'Test_F1': 0.8679245283018868, 'Test_Precision': 0.8618266978922716, 'Test_Recall': 0.8741092636579573, 'Test_ROC_AUC': 0.8647507102603511}
Processing fingerprint DataFrame: df_topological-torsion.csv
Training svm on topologi

In [7]:
results_df.to_csv(f'{kon.path_files}/rezultati_modelov/results_df_new3.csv', index=False)