<a id="content"></a>
## Notebook content

This notebook contains a Scikit-learn representation of AutoAI pipeline. This notebook introduces commands for retrieving data, training the model, and testing the model. 

Some familiarity with Python is helpful. This notebook uses Python 3.11 and scikit-learn 1.3.

# Imports

In [2]:
import numpy as np
import pandas as pd
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, StratifiedKFold

In [3]:
def read_csv_file(file_path):
    """
    Reads a CSV file and returns its content as a pandas DataFrame.
    
    Parameters:
    file_path (str): The path to the CSV file.
    
    Returns:
    pd.DataFrame: DataFrame containing the CSV data.
    """
    try:
        df = pd.read_csv(file_path, )
        return df
    except ValueError as e:
        print(f"Error reading CSV file: {e}")
        return None

<a id="read"></a>
## Read training data



In [4]:
train_data = pd.read_csv('data/train_data_80.csv')

fingerprints_train = train_data.iloc[:, 200:-1]
labels_train = train_data.iloc[:, -1]

test_data = pd.read_csv('data/train_data_20.csv')
fingerprints_test = test_data.iloc[:, 200:-1]
labels_test = test_data.iloc[:, -1]
smiles_test = test_data.iloc[:, 0]

print(f"Train data shape: {fingerprints_train.shape}, Labels shape: {labels_train.shape}")
print(f"Test data shape: {fingerprints_test.shape}, Labels shape: {labels_test.shape}")


Train data shape: (7532, 4096), Labels shape: (7532,)
Test data shape: (1883, 4096), Labels shape: (1883,)


In [5]:
X = pd.concat([fingerprints_train, fingerprints_test])
y = pd.concat([labels_train, labels_test])
X_train, X_test, y_train, y_test = fingerprints_train, fingerprints_test, labels_train, labels_test

<a id="preview_model_to_python_code"></a>
## Create pipeline
In the next cell, you can find the Scikit-learn definition of the selected AutoAI pipeline.

#### Import statements.

#### Pre-processing & Estimator.

In [None]:
model = RandomForestClassifier(
    max_depth=None,
    min_samples_leaf=1,
    min_samples_split=2,
    n_estimators=225,
)

<a id="train"></a>
## Train pipeline model


In [None]:
def cohen_kappa(annotator1, annotator2):
    # Créer la matrice de confusion
    confusion_matrix = np.zeros((max(annotator1 + annotator2), max(annotator1 + annotator2)))
    for a1, a2 in zip(annotator1, annotator2):
        confusion_matrix[a1-1][a2-1] += 1

    # Calculer p_o
    po = np.trace(confusion_matrix) / len(annotator1)

    # Calculer p_e
    row_sums = np.sum(confusion_matrix, axis=1)
    col_sums = np.sum(confusion_matrix, axis=0)
    pe = np.sum(row_sums * col_sums) / (len(annotator1) ** 2)

    # Calculer Kappa
    kappa = (po - pe) / (1 - pe)

    return kappa


In [8]:
model.fit(X_train, y_train)

In [9]:
y_pred = model.predict(X_test)
kappa_score = cohen_kappa(y_test.values, y_pred)
print(f"Cohen's Kappa Score: {kappa_score}")
print(f"Accuracy: {model.score(X_test.values, y_test.values)}")


Cohen's Kappa Score: 0.6455171807398956
Accuracy: 0.822623473181094




<a id="saving"></a>
## Store the model

In this section you will learn how to store the trained model.

In [10]:
from joblib import dump

# Enregistrer le modèle entraîné
dump(model, 'models/rf_model.joblib')

['models/rf_model.joblib']

Inspect the stored model details.

In [14]:
def save_predictions(smiles_list, y_pred, name):
    """
    Save predictions to a CSV file.
    
    Parameters:
    smiles_list (list): List of SMILES strings.
    y_pred (list): List of predicted labels.
    output_path (str): Path to save the predictions CSV file.
    """
    if len(smiles_list) != len(y_pred):
        raise ValueError("Length of SMILES list and predictions must match.")
    
    output_path='predictions/' + name + '.csv'
    
    # Create a DataFrame and save it to CSV 
    df = pd.DataFrame({
        "smiles": smiles_list,
        name : y_pred
    })
    df.to_csv(output_path, index=False)
    print(f"✅ Prédictions enregistrées dans : {output_path}")

def save_probas(smiles_list, y_proba, name):
    """
    Save predictions to a CSV file.
    
    Parameters:
    smiles_list (list): List of SMILES strings.
    y_pred (list): List of predicted labels.
    output_path (str): Path to save the predictions CSV file.
    """
    if len(smiles_list) != len(y_proba):
        raise ValueError("Length of SMILES list and predictions must match.")
    
    output_path='predictions/' + name + '.csv'

    # Create a DataFrame and save it to CSV 
    df = pd.DataFrame({
        "smiles": smiles_list,
        name : y_proba
    })
    df.to_csv(output_path, index=False)
    print(f"✅ Prédictions enregistrées dans : {output_path}")


print(smiles_test.shape)
print(y_pred.shape)


save_predictions(smiles_test, y_pred, 'rf_pred')
y_proba = model.predict_proba(X_test)
save_probas(smiles_test, y_proba[:, 1], 'rf_proba')

(1883,)
(1883,)
✅ Prédictions enregistrées dans : predictions/rf_pred.csv


IndexError: list index out of range

In [None]:
# Entrainer le classificateur XGBoost avec la validation croisée

model.fit(X.values, y.values)

# Définir la validation croisée stratifiée
stratified_kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# Effectuer la validation croisée stratifiée
scores = cross_val_score(model, X.values, y.values, cv=stratified_kfold)
# print(f"Cross-Validation Scores: {scores}")
print(f"Mean Cross-Validation Score: {np.mean(scores)}")
print(f"Standard Deviation of Cross-Validation Scores: {np.std(scores)}")

# TEST 1

In [13]:
X_test_1_raw = pd.read_csv('data/test_1.csv')
X_test_1 = X_test_1_raw.iloc[:, 200:]
y_proba_1 = model.predict_proba(X_test_1)

print(X_test_1.shape)
print(y_proba_1.shape)

save_probas(X_test_1_raw.iloc[:, 0].tolist(), y_proba_1[:, 1], 'rf_proba_test_1')

IndexError: list index out of range