<a id="content"></a>
## Notebook content

This notebook contains a Scikit-learn representation of AutoAI pipeline. This notebook introduces commands for retrieving data, training the model, and testing the model. 

Some familiarity with Python is helpful. This notebook uses Python 3.11 and scikit-learn 1.3.

# Imports

The following cell contains input parameters provided to run the AutoAI experiment in Watson Studio.

In [2]:
import numpy as np
import pandas as pd
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
import lightgbm as lgb


from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, StratifiedKFold

In [3]:
def read_csv_file(file_path):
    """
    Reads a CSV file and returns its content as a pandas DataFrame.
    
    Parameters:
    file_path (str): The path to the CSV file.
    
    Returns:
    pd.DataFrame: DataFrame containing the CSV data.
    """
    try:
        df = pd.read_csv(file_path, )
        return df
    except ValueError as e:
        print(f"Error reading CSV file: {e}")
        return None

<a id="read"></a>
## Read training data



In [4]:
train_data = pd.read_csv('data/train_data_80.csv')

fingerprints_train = train_data.iloc[:, 200:-1]
labels_train = train_data.iloc[:, -1]

test_data = pd.read_csv('data/train_data_20.csv')
fingerprints_test = test_data.iloc[:, 200:-1]
labels_test = test_data.iloc[:, -1]
smiles_test = test_data.iloc[:, 0]

print(f"Train data shape: {fingerprints_train.shape}, Labels shape: {labels_train.shape}")
print(f"Test data shape: {fingerprints_test.shape}, Labels shape: {labels_test.shape}")


Train data shape: (7532, 4096), Labels shape: (7532,)
Test data shape: (1883, 4096), Labels shape: (1883,)


In [5]:
X = pd.concat([fingerprints_train, fingerprints_test])
y = pd.concat([labels_train, labels_test])
X_train, X_test, y_train, y_test = fingerprints_train, fingerprints_test, labels_train, labels_test

<a id="preview_model_to_python_code"></a>
## Create pipeline
In the next cell, you can find the Scikit-learn definition of the selected AutoAI pipeline.

#### Import statements.

#### Pre-processing & Estimator.

In [6]:
# lgbm_classifier = lgb.LGBMClassifier(
#     class_weight="balanced",
#     colsample_bytree=0.5185878473906118,
#     learning_rate=0.2819183874082652,
#     min_child_samples=16,
#     min_child_weight=0.007076389406189793,
#     n_estimators=992,
#     num_leaves=64,
#     random_state=33,
#     reg_alpha=0.9288233666023282,
#     reg_lambda=0.551821503605106,
#     subsample=0.9955253005922045,
#     subsample_freq=1,
#     force_row_wise=True
# )

lgbm_classifier = lgb.LGBMClassifier(
    class_weight="balanced",
    learning_rate=0.05,
    n_estimators=500,
    num_leaves=64,
    min_child_samples=20,
    reg_alpha=0.1,
    reg_lambda=0.1,
    colsample_bytree=0.6,
    subsample=0.8,
    random_state=33
)

model = lgbm_classifier

<a id="train"></a>
## Train pipeline model


In [7]:
def cohen_kappa(annotator1, annotator2):
    # Créer la matrice de confusion
    confusion_matrix = np.zeros((max(annotator1 + annotator2), max(annotator1 + annotator2)))
    for a1, a2 in zip(annotator1, annotator2):
        confusion_matrix[a1-1][a2-1] += 1

    # Calculer p_o
    po = np.trace(confusion_matrix) / len(annotator1)

    # Calculer p_e
    row_sums = np.sum(confusion_matrix, axis=1)
    col_sums = np.sum(confusion_matrix, axis=0)
    pe = np.sum(row_sums * col_sums) / (len(annotator1) ** 2)

    # Calculer Kappa
    kappa = (po - pe) / (1 - pe)

    return kappa




In [8]:
# model.fit(X_train, y_train)
model.fit(X_train, y_train, eval_set=[(X_test, y_test)], eval_metric="binary_logloss", callbacks=[lgb.log_evaluation(10)])


[LightGBM] [Info] Number of positive: 3849, number of negative: 3683
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.036954 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 9339
[LightGBM] [Info] Number of data points in the train set: 7532, number of used features: 3176
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Start training from score 0.000000
[10]	valid_0's binary_logloss: 0.595467
[20]	valid_0's binary_logloss: 0.542698
[30]	valid_0's binary_logloss: 0.505784
[40]	valid_0's binary_logloss: 0.480216
[50]	valid_0's binary_logloss: 0.461922
[60]	valid_0's binary_logloss: 0.446245
[70]	valid_0's binary_logloss: 0.435077
[80]	valid_0's binary_logloss: 0.426003
[90]	valid_0's binary_logloss: 0.418777
[100]	valid_0's binary_logloss: 0.413124
[110]	valid_0's binary_logloss: 0.408388
[12

In [9]:
y_pred = model.predict(X_test)
kappa_score = cohen_kappa(y_test.values, y_pred)
print(f"Cohen's Kappa Score: {kappa_score}")
print(f"Accuracy: {model.score(X_test.values, y_test.values)}")


Cohen's Kappa Score: 0.6591478513011405
Accuracy: 0.8295273499734467




Cohen's Kappa Score: 0.6598404429480795
Accuracy: 0.8300584174190122

<a id="saving"></a>
## Store the model

In this section you will learn how to store the trained model.

In [10]:
from joblib import dump

# Enregistrer le modèle entraîné
dump(model, 'models/lgbm_model.joblib')

['models/lgbm_model.joblib']

Inspect the stored model details.

In [12]:
def save_predictions(smiles_list, y_pred, name):
    """
    Save predictions to a CSV file.
    
    Parameters:
    smiles_list (list): List of SMILES strings.
    y_pred (list): List of predicted labels.
    output_path (str): Path to save the predictions CSV file.
    """
    if len(smiles_list) != len(y_pred):
        raise ValueError("Length of SMILES list and predictions must match.")
    
    output_path='predictions/' + name + '.csv'
    
    # Create a DataFrame and save it to CSV 
    df = pd.DataFrame({
        "smiles": smiles_list,
        name : y_pred
    })
    df.to_csv(output_path, index=False)
    print(f"✅ Prédictions enregistrées dans : {output_path}")

def save_probas(smiles_list, y_proba, name):
    """
    Save predictions to a CSV file.
    
    Parameters:
    smiles_list (list): List of SMILES strings.
    y_pred (list): List of predicted labels.
    output_path (str): Path to save the predictions CSV file.
    """
    if len(smiles_list) != len(y_proba):
        raise ValueError("Length of SMILES list and predictions must match.")
    
    output_path='predictions/' + name + '.csv'

    # Create a DataFrame and save it to CSV 
    df = pd.DataFrame({
        "smiles": smiles_list,
        name : y_proba
    })
    df.to_csv(output_path, index=False)
    print(f"✅ Prédictions enregistrées dans : {output_path}")


print(smiles_test.shape)
print(y_pred.shape)



save_predictions(smiles_test, y_pred, 'lgbm_pred')
y_proba = model.predict_proba(X_test)
save_probas(smiles_test, y_proba[:, 1], 'lgbm_proba')

(1883,)
(1883,)
✅ Prédictions enregistrées dans : predictions/lgbm_pred.csv
✅ Prédictions enregistrées dans : predictions/lgbm_proba.csv


## Cross validation

In [13]:
# Entrainer le classificateur XGBoost avec la validation croisée
model.fit(X.values, y.values)

# Définir la validation croisée stratifiée
stratified_kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# Effectuer la validation croisée stratifiée
scores = cross_val_score(model, X.values, y.values, cv=stratified_kfold)
# print(f"Cross-Validation Scores: {scores}")
print(f"Mean Cross-Validation Score: {np.mean(scores)}")
print(f"Standard Deviation of Cross-Validation Scores: {np.std(scores)}")

[LightGBM] [Info] Number of positive: 4816, number of negative: 4599
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.242335 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9896
[LightGBM] [Info] Number of data points in the train set: 9415, number of used features: 3352
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000
[LightGBM] [Info] Number of positive: 4334, number of negative: 4139
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.209029 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 9629
[LightGBM] [Info] Number of data points in the train set: 8473, number of used features: 3281
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[Lig



[LightGBM] [Info] Number of positive: 4334, number of negative: 4139
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.087650 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9651
[LightGBM] [Info] Number of data points in the train set: 8473, number of used features: 3272
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000




[LightGBM] [Info] Number of positive: 4334, number of negative: 4139
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.304693 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 9634
[LightGBM] [Info] Number of data points in the train set: 8473, number of used features: 3277
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000




[LightGBM] [Info] Number of positive: 4334, number of negative: 4139
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.310064 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 9657
[LightGBM] [Info] Number of data points in the train set: 8473, number of used features: 3276
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000




[LightGBM] [Info] Number of positive: 4334, number of negative: 4139
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.267469 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 9603
[LightGBM] [Info] Number of data points in the train set: 8473, number of used features: 3272
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000




[LightGBM] [Info] Number of positive: 4334, number of negative: 4140
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.292099 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 9620
[LightGBM] [Info] Number of data points in the train set: 8474, number of used features: 3276
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000




[LightGBM] [Info] Number of positive: 4335, number of negative: 4139
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.069809 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 9668
[LightGBM] [Info] Number of data points in the train set: 8474, number of used features: 3281
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Start training from score 0.000000




[LightGBM] [Info] Number of positive: 4335, number of negative: 4139
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.055479 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 9651
[LightGBM] [Info] Number of data points in the train set: 8474, number of used features: 3285
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Start training from score 0.000000




[LightGBM] [Info] Number of positive: 4335, number of negative: 4139
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.047295 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 9629
[LightGBM] [Info] Number of data points in the train set: 8474, number of used features: 3276
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Start training from score 0.000000




[LightGBM] [Info] Number of positive: 4335, number of negative: 4139
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.066902 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9662
[LightGBM] [Info] Number of data points in the train set: 8474, number of used features: 3279
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Start training from score 0.000000
Mean Cross-Validation Score: 0.8319746125434613
Standard Deviation of Cross-Validation Scores: 0.019552695933633048




## TEST 1

In [14]:
X_test_1_raw = pd.read_csv('data/test_1.csv')
X_test_1 = X_test_1_raw.iloc[:, 200:]
y_proba_1 = model.predict_proba(X_test_1)

print(X_test_1.shape)
print(y_proba_1.shape)

save_probas(X_test_1_raw.iloc[:, 0].tolist(), y_proba_1[:, 1], 'lgbm_proba_test_1')

(750, 4096)
(750, 2)
✅ Prédictions enregistrées dans : predictions/lgbm_proba_test_1.csv
