# Model Evaluation - Sentiment Analysis

> **Note**: This notebook is compatible with both Google Colab and local Jupyter environments. Colab-specific sections are clearly marked.

This notebook demonstrates the training and evaluation of traditional machine learning models for **sentiment analysis** on the Amazon Reviews dataset. After preprocessing text in a separate pipeline, we use **TF-IDF vectorization** to extract features and train models including Logistic Regression, XGBoost, LightGBM, and CatBoost. We apply **Optuna** for hyperparameter optimization and **MLflow** for experiment tracking.

Key components covered:
- TF-IDF feature extraction from preprocessed text
- Hyperparameter tuning with Optuna
- Evaluation using metrics such as accuracy, precision, recall, F1, AUROC, and AUPRC
- Confusion matrix, ROC and PR curve visualizations
- End-to-end pipeline logging and model versioning with MLflow

This project is part of my Week 1 exploration of **core NLP foundations**, and provides a strong baseline for future comparisons with neural and transformer-based models.

## Import Libraries

In [None]:
!pip install xgboost==2.1.4
!pip install lightgbm
!pip install optuna
!pip install catboost
!pip install swifter
!pip install optuna-integration
!pip install mlflow

Collecting optuna
  Downloading optuna-4.3.0-py3-none-any.whl.metadata (17 kB)
Collecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.15.2-py3-none-any.whl.metadata (7.3 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.9.0-py3-none-any.whl.metadata (10 kB)
Downloading optuna-4.3.0-py3-none-any.whl (386 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m386.6/386.6 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading alembic-1.15.2-py3-none-any.whl (231 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m231.9/231.9 kB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorlog-6.9.0-py3-none-any.whl (11 kB)
Installing collected packages: colorlog, alembic, optuna
Successfully installed alembic-1.15.2 colorlog-6.9.0 optuna-4.3.0
Collecting catboost
  Downloading catboost-1.2.8-cp311-cp311-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.8-cp311-cp311-manylinux2014_x86_64.whl (99.2 MB)
[2K   [9

In [None]:
import sys
import os
if 'google.colab' in sys.modules:
    from google.colab import drive
    drive.mount('/content/drive')

    project_path = '/content/drive/MyDrive/NLP_Projects/Week_1/sentiment-analysis/'
    if os.path.exists(project_path):
        os.chdir(project_path)
        print(f"Changed working directory to: {project_path}")
    else:
        raise FileNotFoundError(f"Project path not found: {project_path}")
else:
    print("Not running in Colab — skipping Drive mount.")

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.pipeline import Pipeline
import swifter

import optuna
from optuna.integration import XGBoostPruningCallback, LightGBMPruningCallback, CatBoostPruningCallback

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, average_precision_score, confusion_matrix, ConfusionMatrixDisplay, roc_curve, auc, precision_recall_curve

import warnings
warnings.filterwarnings('ignore')

import pickle

import matplotlib.pyplot as plt

import mlflow
mlflow.autolog(disable = True)

import ast

## Loading and Cleaning the Data

We begin by loading the pre-split text datasets for training, validation, and testing. The .squeeze() method is used to ensure the data is stored as a 1D pandas.Series rather than a DataFrame with a single column.

In [None]:
X_train = pd.read_csv('./data/X_train_nltk.csv').squeeze()
X_val = pd.read_csv('./data/X_val_nltk.csv').squeeze()
X_test = pd.read_csv('./data/X_test_nltk.csv').squeeze()

y_train = pd.read_csv('./data/y_train.csv').squeeze()
y_val = pd.read_csv('./data/y_val.csv').squeeze()
y_test = pd.read_csv('./data/y_test.csv').squeeze()

Since some rows may be missing values, we filter out any entries in X that are NaN, and remove the corresponding labels from y to ensure alignment.

In [None]:
inds = X_train.isna()
X_train = X_train[~inds]
y_train = y_train[~inds]

inds = X_val.isna()
X_val = X_val[~inds]
y_val = y_val[~inds]

inds = X_test.isna()
X_test = X_test[~inds]
y_test = y_test[~inds]

## End-to-End Model Training, Tuning, and Evaluation with Optuna and MLFlow

In this section, we define a full ML pipeline that includes:

- Constructing and tuning a TF-IDF + classifier pipeline using Optuna

- Logging hyperparameters, metrics, and artifacts to MLflow for experiment tracking

- Evaluating the best model on train, validation, and test sets

- Visualizing performance via ROC/PR curves and confusion matrices

Each component is modular and supports Logistic Regression, XGBoost, LightGBM, and CatBoost. This setup enables reproducibility and scalable experimentation for binary classification tasks.

In [None]:
def create_model(model_name, model_params):
  """
  Creates and returns a classification model based on the specified model name and hyperparameters.

  Args:
      model_name (str): The name of the model to create. One of:
          - 'lr'   : Logistic Regression
          - 'xgb'  : XGBoost Classifier
          - 'lgbm' : LightGBM Classifier
          - 'cat'  : CatBoost Classifier
      model_params (dict): Dictionary of hyperparameters to initialize the model with.

  Returns:
      model (sklearn/base.BaseEstimator): An instance of the specified classification model,
                                          initialized with the provided parameters.

  Notes:
      - Adds default settings for `n_jobs` or `thread_count` where applicable for parallelism.
      - Evaluation metric is preset to AUC for tree-based models.
  """
  if model_name == 'lr':
    model = LogisticRegression(**model_params, n_jobs = 5)
  elif model_name == 'xgb':
    model = XGBClassifier(**model_params, eval_metric = 'auc', n_jobs = 5)
  elif model_name == 'lgbm':
    model = LGBMClassifier(**model_params, metric = 'auc', n_jobs = 5)
  elif model_name == 'cat':
    model = CatBoostClassifier(**model_params, eval_metric = 'AUC', thread_count = 5)
  return model

In [None]:
def calculate_metrics(y_true, y_pred_proba, y_pred, set = 'train'):
  """
  Calculates common classification evaluation metrics.

  Args:
      y_true (array-like): Ground truth binary labels (0 or 1).
      y_pred_proba (array-like): Predicted probabilities for the positive class.
      y_pred (array-like): Predicted binary class labels.
      set (str, optional): Identifier for the dataset split (e.g., 'train', 'val', 'test').
                            Used to prefix the returned metric keys. Default is 'train'.

  Returns:
      dict: A dictionary containing the following metrics with keys prefixed by `set`:
          - accuracy: Proportion of correct predictions.
          - precision: Proportion of positive predictions that are correct.
          - recall: Proportion of actual positives correctly predicted.
          - specificity: Proportion of actual negatives correctly predicted.
          - f1: Harmonic mean of precision and recall.
          - auroc: Area under the ROC curve.
          - auprc: Area under the Precision-Recall curve.
  """
  accuracy = accuracy_score(y_true, y_pred)
  precision = precision_score(y_true, y_pred)
  recall = recall_score(y_true, y_pred)
  specificity = recall_score(y_true, y_pred, pos_label = 0)
  f1 = f1_score(y_true, y_pred)
  auroc = roc_auc_score(y_true, y_pred_proba)
  auprc = average_precision_score(y_true, y_pred_proba)

  metrics = {
      f'{set}_accuracy': accuracy,
      f'{set}_precision': precision,
      f'{set}_recall': recall,
      f'{set}_specificity': specificity,
      f'{set}_f1': f1,
      f'{set}_auroc': auroc,
      f'{set}_auprc': auprc
  }

  return metrics

In [None]:
def plot_roc_curve(y_true, y_pred_proba, path, set = 'Train'):
    """
    Plots the ROC curve and computes the AUC.

    Args:
        y_true (array-like): True binary labels (0 or 1).
        y_pred_proba (array-like): Predicted probabilities for the positive class.
        title (str): Title of the plot.

    Returns: None
    """
    fpr, tpr, _ = roc_curve(y_true, y_pred_proba)
    roc_auc = auc(fpr, tpr)

    plt.figure(figsize = (6, 5))
    plt.plot(fpr, tpr, label=f'AUC = {roc_auc:.4f}')
    plt.plot([0, 1], [0, 1], linestyle = '--', color = 'gray')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title(f'{set} ROC Curve')
    plt.legend(loc = 'lower right')
    plt.tight_layout()
    plt.savefig(path)
    plt.close()

In [None]:
def plot_pr_curve(y_true, y_pred_proba, path, set = 'Train'):
    """
    Plots the Precision-Recall curve and computes the average precision.

    Args:
        y_true (array-like): True binary labels (0 or 1).
        y_pred_proba (array-like): Predicted probabilities for the positive class.
        title (str): Title of the plot.

    Returns: None
    """
    precision, recall, _ = precision_recall_curve(y_true, y_pred_proba)
    ap_score = average_precision_score(y_true, y_pred_proba)

    plt.figure(figsize = (6, 5))
    plt.plot(recall, precision, label = f'AUPRC = {ap_score:.4f}')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title(f'{set} Recall-Precision Curve')
    plt.legend(loc = 'lower left')
    plt.tight_layout()
    plt.savefig(path)
    plt.close()

In [None]:
def create_confusion_matrix(y_true, y_pred, path, set = 'Train'):
  """
  Generates and saves a confusion matrix plot for classification predictions.

  Args:
      y_true (array-like): Ground truth binary or multiclass labels.
      y_pred (array-like): Predicted class labels.
      path (str): File path to save the confusion matrix plot.
      set (str, optional): Label for the dataset split (e.g., 'Train', 'Val', 'Test').
                            Used in the plot title. Default is 'Train'.

  Returns:
      None. Saves the confusion matrix plot to the specified path.
  """
  ConfusionMatrixDisplay.from_predictions(y_true, y_pred)
  plt.title(f'{set} Confusion Matrix')
  plt.savefig(path)
  plt.close()

In [None]:
def create_objective(tfidf_suggestions, model_suggestions, model_name, X_train, y_train, X_val, y_val, experiment_id):
  """
  Creates an Optuna objective function for hyperparameter optimization using a TF-IDF + ML model pipeline.

  The returned objective function builds a pipeline with TF-IDF and a classifier, fits it on training data,
  evaluates it on validation data, logs the run to MLflow, and returns the validation F1 score.

  Args:
      tfidf_suggestions (Dict[str, Callable[[optuna.Trial], Any]]):
          Dictionary of hyperparameter suggestion functions for TF-IDF vectorizer.
      model_suggestions (Dict[str, Callable[[optuna.Trial], Any]]):
          Dictionary of hyperparameter suggestion functions for the classifier.
      model_name (str):
          Name of the model to use in the pipeline ('lr', 'xgb', 'lgbm', 'cat').
      X_train (pd.Series or array-like):
          Training feature data (text).
      y_train (pd.Series or array-like):
          Training labels.
      X_val (pd.Series or array-like):
          Validation feature data (text).
      y_val (pd.Series or array-like):
          Validation labels.
      experiment_id (str):
          MLflow experiment ID to log the nested runs under.

  Returns:
      Callable[[optuna.Trial], float]:
          An objective function compatible with Optuna that returns validation F1 score.
  """
  def objective(trial):
    tfidf_params = {key: func(trial) for key, func in tfidf_suggestions.items()}
    model_params = {key: func(trial) for key, func in model_suggestions.items()}

    tfidf = TfidfVectorizer(**tfidf_params, lowercase = False, tokenizer = str.split)
    model = create_model(model_name, model_params)

    pipe = Pipeline([
        ('tfidf', tfidf),
        ('model', model)
    ])

    pipe.fit(X_train, y_train)

    y_pred_proba = pipe.predict_proba(X_val)[:, 1]
    y_pred = pipe.predict(X_val)

    metrics = calculate_metrics(y_val, y_pred_proba, y_pred, set = 'val')

    run_name = f'trial_{trial.number}'
    with mlflow.start_run(run_name = run_name, experiment_id = experiment_id, nested = True) as run:
      mlflow.log_param('tfidf_params', tfidf_params)
      mlflow.log_param('model_params', model_params)
      mlflow.log_metrics(metrics)
      trial.set_user_attr('mlflow_run_id', run.info.run_id)

    return metrics['val_f1']
  return objective

In [None]:
def log_mlflow(tfidf_suggestions, model_suggestions, model_name, X_train, y_train, X_val, y_val, X_test, y_test, n_trials, run_name, experiment_id):
  """
  Runs hyperparameter tuning using Optuna, evaluates the best model, and logs all results to MLflow.

  This function:
    - Defines and optimizes an Optuna objective using a TF-IDF + classifier pipeline.
    - Logs the best parameters and validation metrics.
    - Re-trains the model on the full training set using the best parameters.
    - Evaluates on train, validation, and test sets.
    - Logs metrics, ROC & PR curves, confusion matrices, and the final model to MLflow.

  Args:
      tfidf_suggestions (Dict[str, Callable[[optuna.Trial], Any]]):
          Dictionary of TF-IDF hyperparameter search spaces for Optuna.
      model_suggestions (Dict[str, Callable[[optuna.Trial], Any]]):
          Dictionary of model hyperparameter search spaces for Optuna.
      model_name (str):
          One of 'lr', 'xgb', 'lgbm', or 'cat' — specifies the classifier to use.
      X_train, y_train:
          Training data and labels.
      X_val, y_val:
          Validation data and labels.
      X_test, y_test:
          Test data and labels.
      n_trials (int):
          Number of Optuna trials to run.
      run_name (str):
          Name of the parent MLflow run.
      experiment_id (str):
          ID of the MLflow experiment where results should be logged.

  Returns:
      str: The MLflow run ID of the parent run.

  Notes:
      - Uses nested MLflow runs for each trial during tuning.
      - Uses stratified performance metrics (AUROC, AUPRC, F1, etc.).
      - Saves visualizations and logs them as MLflow artifacts.
      - Logs the final trained model with input-output signature.
  """
  objective = create_objective(tfidf_suggestions, model_suggestions, model_name, X_train, y_train, X_val, y_val, experiment_id)
  direction = 'maximize'
  study = optuna.create_study(direction = direction)

  with mlflow.start_run(run_name = run_name, experiment_id = experiment_id) as run_outer:
    study.optimize(objective, n_trials = n_trials)

    mlflow.log_metric('best_optimize_metric', study.best_value)
    mlflow.set_tag('best_trial_number', study.best_trial.number)

    best_run_id = study.best_trial.user_attrs['mlflow_run_id']
    run = mlflow.get_run(best_run_id)

    best_tfidf_params = ast.literal_eval(run.data.params['tfidf_params'])
    best_model_params = ast.literal_eval(run.data.params['model_params'])
    mlflow.log_param('best_tfidf_params', best_tfidf_params)
    mlflow.log_param('best_model_params', best_model_params)

    tfidf = TfidfVectorizer(**best_tfidf_params, lowercase = False, tokenizer = str.split)
    model = create_model(model_name, best_model_params)

    pipe = Pipeline([
        ('tfidf', tfidf),
        ('model', model)
    ])

    pipe.fit(X_train, y_train)

    train_pred_proba = pipe.predict_proba(X_train)[:, 1]
    train_pred = pipe.predict(X_train)

    train_metrics = calculate_metrics(y_train, train_pred_proba, train_pred, set = 'Train')
    plot_roc_curve(y_train, train_pred_proba, path = f'./artifacts/{model_name}/train_roc_curve.png', set = 'Train')
    plot_pr_curve(y_train, train_pred_proba, path = f'./artifacts/{model_name}/train_pr_curve.png', set = 'Train')
    create_confusion_matrix(y_train, train_pred, path = f'./artifacts/{model_name}/train_confusion_matrix.png', set = 'Train')

    mlflow.log_metrics(train_metrics)
    mlflow.log_artifact(f'./artifacts/{model_name}/train_roc_curve.png')
    mlflow.log_artifact(f'./artifacts/{model_name}/train_pr_curve.png')
    mlflow.log_artifact(f'./artifacts/{model_name}/train_confusion_matrix.png')

    val_pred_proba = pipe.predict_proba(X_val)[:, 1]
    val_pred = pipe.predict(X_val)

    val_metrics = calculate_metrics(y_val, val_pred_proba, val_pred, set = 'Val')
    plot_roc_curve(y_val, val_pred_proba, path = f'./artifacts/{model_name}/val_roc_curve.png', set = 'Val')
    plot_pr_curve(y_val, val_pred_proba, path = f'./artifacts/{model_name}/val_pr_curve.png', set = 'Val')
    create_confusion_matrix(y_val, val_pred, path = f'./artifacts/{model_name}/val_confusion_matrix.png', set = 'Val')

    mlflow.log_metrics(val_metrics)
    mlflow.log_artifact(f'./artifacts/{model_name}/val_roc_curve.png')
    mlflow.log_artifact(f'./artifacts/{model_name}/val_pr_curve.png')
    mlflow.log_artifact(f'./artifacts/{model_name}/val_confusion_matrix.png')

    test_pred_proba = pipe.predict_proba(X_test)[:, 1]
    test_pred = pipe.predict(X_test)

    test_metrics = calculate_metrics(y_test, test_pred_proba, test_pred, set = 'Test')
    plot_roc_curve(y_test, test_pred_proba, path = f'./artifacts/{model_name}/test_roc_curve.png', set = 'Test')
    plot_pr_curve(y_test, test_pred_proba, path = f'./artifacts/{model_name}/test_pr_curve.png', set = 'Test')
    create_confusion_matrix(y_test, test_pred, path = f'./artifacts/{model_name}/test_confusion_matrix.png', set = 'Test')

    mlflow.log_metrics(test_metrics)
    mlflow.log_artifact(f'./artifacts/{model_name}/test_roc_curve.png')
    mlflow.log_artifact(f'./artifacts/{model_name}/test_pr_curve.png')
    mlflow.log_artifact(f'./artifacts/{model_name}/test_confusion_matrix.png')

    signature = mlflow.models.infer_signature(X_train, y_train)
    mlflow.sklearn.log_model(pipe, run_name, signature = signature)

  return run_outer.info.run_id

## Setting Up MLFLow Tracking and Experiment

We configure MLflow to log all runs and artifacts to a custom directory (../experiments/) for better organization and portability. Then, we create or fetch an experiment named sentiment-analysis-amazon-reviews to group all related model runs.

This setup allows us to:

- Track hyperparameter tuning results

- Compare model performance

- Store artifacts like plots and trained models

The retrieved experiment_id is used to link all runs to this specific experiment.

In [None]:
mlflow.set_tracking_uri('./experiments/')

experiment_name = 'sentiment-analysis-amazon-reviews'
mlflow.set_experiment(experiment_name)
experiment = mlflow.get_experiment_by_name(experiment_name)

experiment_id = experiment.experiment_id
print('Experiment ID:', experiment_id)

Experiment ID: 617187074394259539


## Defining TF-IDF Hyperparameter Search Space (Optuna)

We define a dictionary tfidf_suggestions that maps TF-IDF hyperparameters to Optuna search strategies. These settings allow the optimizer to explore different preprocessing configurations during hyperparameter tuning:

- max_df: Upper bound on the document frequency for a term to be included (filters overly common words).

- min_df: Lower bound on the document frequency (filters very rare words).

- ngram_range: Decides whether to use unigrams, bigrams, or both.

- max_features: Limits the number of tokens considered, helping control model complexity and speed.

These values will be sampled dynamically by Optuna during each trial.

In [None]:
tfidf_suggestions = {
    'max_df': lambda trial: trial.suggest_float('max_df', 0.5, 1.0),
    'min_df': lambda trial: trial.suggest_float('min_df', 0.01, 0.05),
    'ngram_range': lambda trial: trial.suggest_categorical('ngram_range', [(1, 1), (1, 2), (2, 2)]),
    'max_features': lambda trial: trial.suggest_int('max_features', 1000, 7000, step = 250)
}

## Defining Logistic Regression Hyperparameter Search Space and Launching Optuna + MLflow Tuning

We define model_suggestions, a dictionary of hyperparameter options for Logistic Regression, which Optuna will explore to find the best performing model:

- solver: Specifies the optimization algorithm; here, we use 'saga', which supports both L1 and L2 penalties.

- penalty: Chooses between L1 and L2 regularization.

- C: Inverse regularization strength (smaller = stronger regularization); explored on a log scale.

- max_iter: Maximum number of iterations for convergence.

- random_state: Random seed for reproducibility, varied across a range to increase robustness.

We then call log_mlflow(...), which:

- Runs Optuna for hyperparameter optimization over 25 trials.

- Logs all runs, metrics, plots, and final models to MLflow.

- Returns the MLflow run_id for future reference.

In [None]:
model_suggestions = {
    'solver': lambda trial: trial.suggest_categorical('solver', ['saga']),
    'penalty': lambda trial: trial.suggest_categorical('penalty', ['l1', 'l2']),
    'C': lambda trial: trial.suggest_float('C', 1e-4, 1e3, log = True),
    'max_iter': lambda trial: trial.suggest_int('max_iter', 300, 300, step = 1),
    'random_state': lambda trial: trial.suggest_int('random_state', 100, 400, step = 1)
}

run_id = log_mlflow(tfidf_suggestions, model_suggestions, 'lr', X_train, y_train, X_val, y_val, X_test, y_test, 25, 'lr', experiment_id = experiment_id)

[I 2025-04-24 02:53:01,172] A new study created in memory with name: no-name-f1cb429f-dc3b-4518-b40a-f71a00c6258d
[I 2025-04-24 02:53:22,481] Trial 0 finished with value: 0.5834932821497121 and parameters: {'max_df': 0.9219462990764398, 'min_df': 0.021471461743854732, 'ngram_range': (1, 1), 'max_features': 3000, 'solver': 'saga', 'penalty': 'l1', 'C': 0.000609813504762927, 'max_iter': 300, 'random_state': 349}. Best is trial 0 with value: 0.5834932821497121.
[I 2025-04-24 02:54:32,594] Trial 1 finished with value: 0.0 and parameters: {'max_df': 0.7423305443660055, 'min_df': 0.025424715477950586, 'ngram_range': (1, 2), 'max_features': 5250, 'solver': 'saga', 'penalty': 'l1', 'C': 0.00010645182072009795, 'max_iter': 300, 'random_state': 125}. Best is trial 0 with value: 0.5834932821497121.
[I 2025-04-24 02:55:29,813] Trial 2 finished with value: 0.6710094756790903 and parameters: {'max_df': 0.9680012294032031, 'min_df': 0.016506066712895118, 'ngram_range': (2, 2), 'max_features': 6750, '

## Defining XGBoost Hyperparameter Search Space and Launching Optuna + MLflow Tuning

In this section, we define a set of hyperparameters for XGBoost that Optuna will optimize to maximize model performance. The parameters explored include:

- max_depth: Maximum depth of each tree; controls model complexity.

- learning_rate: Step size shrinkage to prevent overfitting; sampled on a log scale.

- subsample: Fraction of samples used for training each tree to introduce stochasticity.

- alpha & lambda: L1 and L2 regularization terms, respectively.

- gamma: Minimum loss reduction to make a further partition; helps with pruning.

- n_estimators: Number of boosting rounds (trees).

- random_state: Seed for reproducibility.

We then pass this configuration to log_mlflow(...), which:

- Tunes the hyperparameters over 2 trials using Optuna (limited here for demo purposes).

- Logs parameters, metrics, plots, and models to MLflow.

- Returns the run_id of the parent MLflow run for traceability.

In [None]:
model_suggestions = {
    'max_depth': lambda trial: trial.suggest_int('max_depth', 3, 12, step = 1),
    'learning_rate': lambda trial: trial.suggest_float('learning_rate', 1e-5, 0.1, log = True),
    'subsample': lambda trial: trial.suggest_float('subsample', 0.5, 1),
    'alpha': lambda trial: trial.suggest_float('alpha', 0, 10),
    'lambda': lambda trial: trial.suggest_float('lambda', 0, 10),
    'gamma': lambda trial: trial.suggest_float('gamma', 0, 10),
    'n_estimators': lambda trial: trial.suggest_int('n_estimators', 100, 500, step = 1),
    'random_state': lambda trial: trial.suggest_int('random_state', 100, 400, step = 1)
}

run_id = log_mlflow(tfidf_suggestions, model_suggestions, 'xgb', X_train, y_train, X_val, y_val, X_test, y_test, 25, 'xgboost', experiment_id = experiment_id)

[I 2025-04-24 03:14:16,317] A new study created in memory with name: no-name-c589146c-1a73-4524-be84-9da3756fe039
[I 2025-04-24 03:14:36,285] Trial 0 finished with value: 0.5704261704156715 and parameters: {'max_df': 0.6822703653483534, 'min_df': 0.03907488044770374, 'ngram_range': (1, 1), 'max_features': 2500, 'max_depth': 5, 'learning_rate': 0.0010807817127471916, 'subsample': 0.9082890809540323, 'alpha': 3.3264834166103885, 'lambda': 7.434273417768309, 'gamma': 9.683322075212534, 'n_estimators': 188, 'random_state': 358}. Best is trial 0 with value: 0.5704261704156715.
[I 2025-04-24 03:16:12,213] Trial 1 finished with value: 0.5938117978255801 and parameters: {'max_df': 0.7949532042967156, 'min_df': 0.02284011131188271, 'ngram_range': (1, 2), 'max_features': 7000, 'max_depth': 6, 'learning_rate': 0.00017874965991000134, 'subsample': 0.9697870464582811, 'alpha': 9.813420861221827, 'lambda': 8.930185799385065, 'gamma': 2.9025372676540817, 'n_estimators': 439, 'random_state': 241}. Bes

## Defining LightGBM Hyperparameter Search Space and Launching Optuna + MLflow Tuning

In this section, we define a hyperparameter search space for LightGBM using Optuna. This configuration includes regularization, sampling, and tree-building parameters:

- max_depth: Maximum depth of trees.

- learning_rate: Controls how much each tree contributes to the final prediction; lower values slow learning.

- feature_fraction: Fraction of features randomly selected in each boosting round (column sampling).

- bagging_fraction: Fraction of data randomly selected for each iteration (row sampling).

- lambda_l1 & lambda_l2: L1 and L2 regularization terms.

- boosting_type: The boosting method used (here we constrain to 'gbdt').

- n_estimators: Number of trees in the model.

- random_state: Seed to ensure reproducibility.

- verbose: Controls output verbosity during training.

We run log_mlflow(...) to:

- Optimize LightGBM hyperparameters over 25 Optuna trials.

- Log the best parameters, performance metrics, confusion matrices, and PR/ROC curves to MLflow.

- Save the trained model and evaluation artifacts to a tracked experiment for easy comparison.

In [None]:
model_suggestions = {
    'max_depth': lambda trial: trial.suggest_int('max_depth', 3, 12, step = 1),
    'learning_rate': lambda trial: trial.suggest_float('learning_rate', 1e-5, 0.1, log = True),
    'feature_fraction': lambda trial: trial.suggest_float('feature_fraction', 0.5, 1),
    'bagging_fraction': lambda trial: trial.suggest_float('bagging_fraction', 0.5, 1),
    'lambda_l1': lambda trial: trial.suggest_float('lambda_l1', 0, 100),
    'lambda_l2': lambda trial: trial.suggest_float('lambda_l1', 0, 100),
    'boosting_type': lambda trial: trial.suggest_categorical('boosting_type', ['gbdt']),
    'n_estimators': lambda trial: trial.suggest_int('n_estimators', 100, 500, step = 1),
    'random_state': lambda trial: trial.suggest_int('random_state', 100, 400, step = 1),
    'verbose': lambda trial: trial.suggest_int('verbose', -1, -1, step = 1)
}

run_id = log_mlflow(tfidf_suggestions, model_suggestions, 'lgbm', X_train, y_train, X_val, y_val, X_test, y_test, 25, 'lightgbm', experiment_id = experiment_id)

[I 2025-04-24 03:54:22,041] A new study created in memory with name: no-name-5d36e649-c416-4d4b-9b69-f61bb01f74b7
[I 2025-04-24 03:55:36,212] Trial 0 finished with value: 0.6015875388166525 and parameters: {'max_df': 0.5380768028772736, 'min_df': 0.04507304985198902, 'ngram_range': (1, 2), 'max_features': 3250, 'max_depth': 7, 'learning_rate': 3.472102811598749e-05, 'feature_fraction': 0.9368303649762355, 'bagging_fraction': 0.9297613781472126, 'lambda_l1': 0.4611955075296881, 'boosting_type': 'gbdt', 'n_estimators': 116, 'random_state': 174, 'verbose': -1}. Best is trial 0 with value: 0.6015875388166525.
[I 2025-04-24 03:56:34,260] Trial 1 finished with value: 0.6609538039365043 and parameters: {'max_df': 0.983253153030452, 'min_df': 0.0395784357721554, 'ngram_range': (2, 2), 'max_features': 3250, 'max_depth': 11, 'learning_rate': 0.045606450082927484, 'feature_fraction': 0.8651691330974973, 'bagging_fraction': 0.5666090697832743, 'lambda_l1': 26.326945753771458, 'boosting_type': 'gbd

## Defining CatBoost Hyperparameter Search Space and Launching Optuna + MLflow Tuning

This section sets up a hyperparameter search space for CatBoost, a gradient boosting library optimized for performance and efficiency. We define key tuning parameters:

- max_depth: Maximum depth of the trees.

- learning_rate: Step size used to shrink each tree’s contribution (smaller = slower but more accurate learning).

- l2_leaf_reg: L2 regularization to prevent overfitting by penalizing leaf scores.

- bagging_temperature: Controls the amount of randomness in data sampling; higher = more randomness.

- n_estimators: Total number of boosting rounds (trees).

- random_state: Seed for reproducibility.

- verbose: Set to 0 to suppress training output.

We then call log_mlflow(...) to:

- Optimize the CatBoost model over 25 trials using Optuna.

- Automatically track the best trial, parameters, and performance metrics with MLflow.

- Log model artifacts, evaluation plots (ROC, PR curves), and the trained pipeline for later analysis or deployment.



In [None]:
model_suggestions = {
    'max_depth': lambda trial: trial.suggest_int('max_depth', 3, 12, step = 1),
    'learning_rate': lambda trial: trial.suggest_float('learning_rate', 1e-5, 0.1, log = True),
    'l2_leaf_reg': lambda trial: trial.suggest_float('l2_leaf_reg', 1, 100),
    'bagging_temperature': lambda trial: trial.suggest_float('bagging_temperature', 0, 1),
    'n_estimators': lambda trial: trial.suggest_int('n_estimators', 100, 500, step = 1),
    'random_state': lambda trial: trial.suggest_int('random_state', 100, 400, step = 1),
    'verbose': lambda trial: trial.suggest_int('verbose', 0, 0, step = 1)
}

run_id = log_mlflow(tfidf_suggestions, model_suggestions, 'cat', X_train, y_train, X_val, y_val, X_test, y_test, 25, 'catboost', experiment_id = experiment_id)

[I 2025-05-04 23:25:30,773] A new study created in memory with name: no-name-af156bfb-db13-498d-ac7d-3425677fc9c7
[I 2025-05-04 23:27:27,852] Trial 0 finished with value: 0.5694221125181432 and parameters: {'max_df': 0.5911102674733995, 'min_df': 0.041955522682820524, 'ngram_range': (1, 2), 'max_features': 1000, 'max_depth': 3, 'learning_rate': 0.0015626453780787069, 'l2_leaf_reg': 61.0088772088764, 'bagging_temperature': 0.1374729603907443, 'n_estimators': 345, 'random_state': 145, 'verbose': 0}. Best is trial 0 with value: 0.5694221125181432.
[I 2025-05-04 23:28:40,894] Trial 1 finished with value: 0.6569770335785607 and parameters: {'max_df': 0.7666792109441115, 'min_df': 0.024011779868490034, 'ngram_range': (2, 2), 'max_features': 3000, 'max_depth': 4, 'learning_rate': 0.014855037698110198, 'l2_leaf_reg': 41.38513738436608, 'bagging_temperature': 0.9788139085956903, 'n_estimators': 168, 'random_state': 133, 'verbose': 0}. Best is trial 1 with value: 0.6569770335785607.
[I 2025-05-0

## Loading the Best CatBoost Model and Evaluating on the Test Set

Once the training and hyperparameter tuning are complete, we retrieve the best-performing CatBoost model using mlflow.search_runs() by filtering on the run name ('catboost') and sorting by validation F1 score.

We then:

- Load the model from the MLflow tracking server.

- Generate predictions on the test set.

- Print evaluation metrics, including Accuracy, Precision, Recall, Specificity, F1 Score, AUROC, and AUPRC.

- Save the confusion matrix, ROC Curve, and Precision-Recall Curve

In [None]:
# search through mlflow runs and select the run with the best F1 score
runs_df = mlflow.search_runs(
    experiment_ids = [experiment_id],
    filter_string = "tags.mlflow.runName = 'catboost'",
    order_by = ["metrics.val_f1 DESC"]
)

best_run_id = runs_df.iloc[0]["run_id"]

# load the model using the best run id
model_uri = f"runs:/{best_run_id}/catboost"
model = mlflow.sklearn.load_model(model_uri)

In [None]:
pred = model.predict(X_test)
pred_proba = model.predict_proba(X_test)[:, 1]

In [None]:
metrics = calculate_metrics(y_test, pred_proba, pred, set = 'test')

In [None]:
print('Test Accuracy:', metrics['test_accuracy'])
print('Test Precision:', metrics['test_precision'])
print('Test Recall:', metrics['test_recall'])
print('Test Specificity:', metrics['test_specificity'])
print('Test F1:', metrics['test_f1'])
print('Test AUROC:', metrics['test_auroc'])
print('Test AUPRC:', metrics['test_auprc'])

Test Accuracy: 0.8092925
Test Precision: 0.8118229431839377
Test Recall: 0.805235
Test Specificity: 0.81335
Test F1: 0.808515551851638
Test AUROC: 0.892993424675
Test AUPRC: 0.8918270833481999


In [None]:
create_confusion_matrix(y_test, pred, path = f'./artifacts/cat/test_confusion_matrix.png', set = 'Test')
plot_roc_curve(y_test, pred_proba, path = f'./artifacts/cat/test_roc_curve.png', set = 'Test')
plot_pr_curve(y_test, pred_proba, path = f'./artifacts/cat/test_pr_curve.png', set = 'Test')