# Baymax - Model Training

### üìã **Notebook Overview**

This notebook implements a comprehensive model training for the **MHP Processed dataset**, which assesses mental health status across three dimensions:
- 0: Stable
- 1: Challenged
- 2: Critical

### üìä **Notebook Structure**

This notebook is organized into **5 main parts**:
- **Part 1:** Imports & Configuration
- **Part 2:** Helper Functions
- **Part 3:** Model Definitions
- **Part 4:** Training Pipeline
- **Part 5:** Baymax Model Training Execution

### ü§ñ **Models Trained**

**Traditional Machine Learning (6 models)**:
- Logistic Regression
- Gradient Boosting
- K-Nearest Neighbors (KNN)
- Random Forest
- Decision Tree
- Support Vector Machine (SVM)

## Part 1: Imports & Configuration

This section sets up the foundation for the entire training pipeline.

### What This Cell Does:

1. **Imports all required libraries:**
   - Standard libraries (`warnings`, `time`, `pathlib`)
   - Data science libraries (`numpy`, `pandas`)
   - Visualization libraries (`matplotlib`, `seaborn`)
   - Machine learning libraries (`scikit-learn`)
   - Utilities (`joblib`)

2. **Sets global configurations:**
   - Random seed (`RANDOM_STATE = 42`) for reproducibility
   - Cross-validation strategy (3-fold Stratified K-Fold)

3. **Defines directory structure:**
   - Input path pointing to the Baymax features directory containing `train.csv` and `test.csv`
   - Output directories for results, models, and figures

4. **Configures display settings:**
   - Pandas display options (show all rows/columns)
   - Matplotlib plotting configuration (DPI, style)
   - Seaborn aesthetic settings

In [1]:
# ============================================================================
# STANDARD LIBRARY IMPORTS
# ============================================================================
import os
import warnings
import time
from pathlib import Path

warnings.filterwarnings("ignore")

# ============================================================================
# DATA SCIENCE & NUMERICAL COMPUTING
# ============================================================================
import numpy as np
import pandas as pd

# ============================================================================
# VISUALIZATION
# ============================================================================
import matplotlib.pyplot as plt
import seaborn as sns

# ============================================================================
# MACHINE LEARNING - SKLEARN
# ============================================================================
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix
)
from sklearn.model_selection import StratifiedKFold

# ============================================================================
# MODEL PERSISTENCE
# ============================================================================
import joblib

# ============================================================================
# GLOBAL CONFIGURATION
# ============================================================================
# Random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Cross-validation configuration
CV = StratifiedKFold(n_splits=3, shuffle=True, random_state=RANDOM_STATE)

# ============================================================================
# PATH CONFIGURATION
# ============================================================================
# Base Directory
BASE_DIR = Path.cwd().parents[0]

# Feature Input Path (single train/test split from Baymax preprocessing)
FEATURES_DIR = BASE_DIR / "features"

# Output Base Directories
RESULTS_BASE = BASE_DIR / "results"
MODELS_BASE  = BASE_DIR / "models"
FIGURES_BASE = BASE_DIR / "figures"

# Create output directories if they don't exist
for p in [RESULTS_BASE, MODELS_BASE, FIGURES_BASE]:
    p.mkdir(parents=True, exist_ok=True)

# ============================================================================
# DISPLAY SETTINGS
# ============================================================================
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

# Plotting Configuration
sns.set(style="whitegrid")
plt.rcParams.update({"figure.dpi": 120})

print("IMPORTS & CONFIGURATION LOADED SUCCESSFULLY")

IMPORTS & CONFIGURATION LOADED SUCCESSFULLY


## Part 2: Helper Functions

This section defines all utility functions used throughout the training pipeline.

### What This Cell Does:

1. **Metrics Computation:**
   - `compute_metrics()` ‚Äî Calculates Accuracy, Precision, Recall, and F1 scores (weighted average) for multi-class evaluation

2. **Visualization Functions:**
   - `plot_and_save_confusion()` ‚Äî Generates and saves confusion matrix heatmaps as PNG files

3. **Data Loading & Preprocessing:**
   - `load_train_test_data()` ‚Äî Loads `train.csv` and `test.csv` from the Baymax features directory, drops rows with missing target values (`Mental Health Status Encoded`), and returns `X_train`, `y_train`, `X_test`, `y_test` as NumPy arrays

4. **Model Persistence:**
   - `save_model()` ‚Äî Saves a trained sklearn model to disk using `joblib`
   - `load_model()` ‚Äî Loads a saved sklearn model from disk

In [2]:
# ============================================================================
# METRICS COMPUTATION
# ============================================================================
def compute_metrics(y_true, y_pred):
    """
    Calculate classification metrics for model evaluation.

    Args:
        y_true: array-like, true labels
        y_pred: array-like, predicted labels

    Returns:
        dict: Dictionary containing Accuracy, Precision, Recall, and F1 scores
    """
    return {
        "Accuracy":  accuracy_score(y_true, y_pred),
        "Precision": precision_score(y_true, y_pred, average="weighted", zero_division=0),
        "Recall":    recall_score(y_true, y_pred, average="weighted", zero_division=0),
        "F1":        f1_score(y_true, y_pred, average="weighted", zero_division=0)
    }


# ============================================================================
# VISUALIZATION FUNCTIONS
# ============================================================================
def plot_and_save_confusion(y_true, y_pred, path, title):
    """
    Generate and save a confusion matrix heatmap.

    Args:
        y_true: array-like, true labels
        y_pred: array-like, predicted labels
        path:   Path or str, file path to save the figure
        title:  str, title for the plot
    """
    cm = confusion_matrix(y_true, y_pred)
    fig, ax = plt.subplots(figsize=(5, 4))
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", ax=ax,
                xticklabels=["Stable", "Challenged", "Critical"],
                yticklabels=["Stable", "Challenged", "Critical"])
    ax.set_xlabel("Predicted")
    ax.set_ylabel("True")
    ax.set_title(title)
    fig.tight_layout()
    fig.savefig(path, dpi=300, bbox_inches="tight")
    plt.close(fig)


# ============================================================================
# DATA LOADING & PREPROCESSING
# ============================================================================
def load_train_test_data(features_dir):
    """
    Load and preprocess training and testing data from the Baymax features directory.

    Args:
        features_dir: Path, directory containing train.csv and test.csv

    Returns:
        tuple: (X_train, y_train, X_test, y_test) as NumPy arrays

    Raises:
        FileNotFoundError: If train.csv or test.csv does not exist
    """
    train_path = features_dir / "train.csv"
    test_path  = features_dir / "test.csv"

    if not train_path.exists() or not test_path.exists():
        raise FileNotFoundError(
            f"Missing train.csv or test.csv in: {features_dir}"
        )

    # Load data
    train_df = pd.read_csv(train_path)
    test_df  = pd.read_csv(test_path)

    # Drop rows with missing target values
    train_df = train_df.dropna(subset=["Mental Health Status Encoded"])
    test_df  = test_df.dropna(subset=["Mental Health Status Encoded"])

    # Separate features and target
    X_train = train_df.drop(columns=["Mental Health Status Encoded"]).values
    y_train = train_df["Mental Health Status Encoded"].astype(int).values
    X_test  = test_df.drop(columns=["Mental Health Status Encoded"]).values
    y_test  = test_df["Mental Health Status Encoded"].astype(int).values

    return X_train, y_train, X_test, y_test


# ============================================================================
# MODEL PERSISTENCE
# ============================================================================
def save_model(model, model_path):
    """
    Save a trained sklearn model to disk using joblib.

    Args:
        model:      trained sklearn estimator
        model_path: Path or str, file path to save the model
    """
    joblib.dump(model, model_path)


def load_model(model_path):
    """
    Load a saved sklearn model from disk.

    Args:
        model_path: Path or str, file path to the saved model

    Returns:
        Loaded sklearn estimator
    """
    return joblib.load(model_path)


print("HELPER FUNCTIONS LOADED SUCCESSFULLY")

HELPER FUNCTIONS LOADED SUCCESSFULLY


## Part 3: Model Definitions

This section defines all six traditional machine learning model architectures with their default parameters.

### What This Cell Does:

**Traditional Machine Learning Models:**
   - `get_ml_models()` ‚Äî Returns a dictionary of 6 ML models, each initialized with default parameters and a fixed `random_state` for reproducibility:
     - **Logistic Regression** ‚Äî `max_iter=1000`, solver defaults
     - **Gradient Boosting** ‚Äî Default ensemble of decision trees with boosting
     - **K-Nearest Neighbors (KNN)** ‚Äî Distance-based classifier with default `k=5`
     - **Random Forest** ‚Äî Ensemble of decision trees with bagging
     - **Decision Tree** ‚Äî Single tree classifier
     - **Support Vector Machine (SVM)** ‚Äî RBF kernel with probability estimates enabled

In [3]:
# ============================================================================
# TRADITIONAL MACHINE LEARNING MODELS
# ============================================================================
def get_ml_models(random_state=42):
    """
    Get a dictionary of traditional machine learning models with default parameters.

    Args:
        random_state: int, random seed for reproducibility

    Returns:
        dict: Model name -> initialized sklearn estimator
    """
    return {
        "Logistic Regression": LogisticRegression(
            max_iter=1000,
            random_state=random_state
        ),
        "Gradient Boosting": GradientBoostingClassifier(
            random_state=random_state
        ),
        "KNN": KNeighborsClassifier(),
        "Random Forest": RandomForestClassifier(
            random_state=random_state
        ),
        "Decision Tree": DecisionTreeClassifier(
            random_state=random_state
        ),
        "SVM": SVC(
            probability=True,
            random_state=random_state
        )
    }


print("MODEL DEFINITIONS LOADED SUCCESSFULLY")

MODEL DEFINITIONS LOADED SUCCESSFULLY


## Part 4: Training Pipeline

This section defines the comprehensive training workflow for all six traditional ML models.

### What This Cell Does:

**Traditional Machine Learning Pipeline:**
   - `train_traditional_ml()` ‚Äî Trains all 6 ML models on the MHP Processed dataset in a single pass:
     - Loads `train.csv` and `test.csv` from the features directory
     - Creates output subdirectories for results and confusion matrix figures
     - Iterates over each model and:
       1. Fits the model on `X_train` / `y_train`
       2. Predicts on `X_test`
       3. Computes Accuracy, Precision, Recall, and F1 (weighted)
       4. Saves the confusion matrix heatmap as a PNG
       5. Saves the trained model to disk as a `.pkl` file
     - Compiles all per-model metrics into a single results DataFrame
     - Saves the results to a CSV in the results directory
     - Returns the combined results DataFrame sorted by Accuracy (descending)

In [4]:
# ============================================================================
# TRADITIONAL MACHINE LEARNING TRAINING PIPELINE
# ============================================================================
def train_traditional_ml(features_dir, results_base, models_base, figures_base,
                          dataset_name="MHP_Processed", random_state=42):
    """
    Train all traditional ML models on the MHP Processed dataset.

    Args:
        features_dir:  Path, directory containing train.csv and test.csv
        results_base:  Path, base directory for saving result CSVs
        models_base:   Path, base directory for saving trained models
        figures_base:  Path, base directory for saving confusion matrix figures
        dataset_name:  str, name of the dataset (used in plot titles and logs)
        random_state:  int, random seed for reproducibility

    Returns:
        pd.DataFrame: Results table with Accuracy, Precision, Recall, and F1
                      for each model, sorted by Accuracy descending.
                      Returns None if data loading fails.
    """
    print("\n" + "=" * 60)
    print(f"‚ñ∂  Training ML Models ‚Äî {dataset_name}")
    print("=" * 60)

    # ------------------------------------------------------------------
    # Create output directories
    # ------------------------------------------------------------------
    results_out = results_base / "Machine Learning"
    models_out  = models_base
    figures_out = figures_base / "Machine Learning"

    for p in [results_out, models_out, figures_out]:
        p.mkdir(parents=True, exist_ok=True)

    # ------------------------------------------------------------------
    # Load data
    # ------------------------------------------------------------------
    try:
        X_train, y_train, X_test, y_test = load_train_test_data(features_dir)
        print(f"\n  Train samples : {X_train.shape[0]}")
        print(f"  Test  samples : {X_test.shape[0]}")
        print(f"  Features      : {X_train.shape[1]}")
        print(f"  Classes       : {sorted(set(y_train))}\n")
    except FileNotFoundError as e:
        print(f"‚ö†Ô∏è  {e}")
        return None

    # ------------------------------------------------------------------
    # Get model definitions
    # ------------------------------------------------------------------
    models = get_ml_models(random_state)
    results = []

    # ------------------------------------------------------------------
    # Train each model
    # ------------------------------------------------------------------
    for name, model in models.items():
        print(f"  - Training {name} ...")

        try:
            # Train
            model.fit(X_train, y_train)

            # Predict
            y_pred = model.predict(X_test)

            # Metrics
            metrics = compute_metrics(y_test, y_pred)
            results.append({"Model": name, **metrics})

            # Confusion matrix
            cm_filename = f"{name.lower().replace(' ', '_')}_confusion.png"
            plot_and_save_confusion(
                y_test, y_pred,
                figures_out / cm_filename,
                f"{name} ‚Äî {dataset_name}"
            )

            # Save model
            model_filename = f"{name.lower().replace(' ', '_')}.pkl"
            save_model(model, models_out / model_filename)

            print(f"    ‚úÖ Accuracy: {metrics['Accuracy']:.4f}  |  "
                  f"F1: {metrics['F1']:.4f}")

        except Exception as e:
            print(f"    ‚ö†Ô∏è  Error training {name}: {e}")
            continue

    # ------------------------------------------------------------------
    # Compile and save results
    # ------------------------------------------------------------------
    if not results:
        print("‚ö†Ô∏è  No results were produced.")
        return None

    results_df = pd.DataFrame(results).sort_values("Accuracy", ascending=False)
    results_df.to_csv(results_out / "ml_results.csv", index=False)
    print(f"\n‚úÖ Results saved ‚Üí {results_out / 'ml_results.csv'}")

    return results_df


print("TRAINING PIPELINE LOADED SUCCESSFULLY")

TRAINING PIPELINE LOADED SUCCESSFULLY


## Part 5: Baymax Model Training Execution

This section executes the full training workflow for the MHP Processed dataset and displays a final summary.

### What This Cell Does:

1. **Prints directory paths** ‚Äî Confirms the resolved feature input path and all output destinations before training begins.

2. **Executes `train_traditional_ml()`** ‚Äî Triggers training of all 6 ML models using the preprocessed Baymax `train.csv` and `test.csv`.

3. **Displays the results table** ‚Äî Prints the full results DataFrame (Model, Accuracy, Precision, Recall, F1) to the notebook output, sorted by Accuracy descending.

In [5]:
# ============================================================================
# BAYMAX ‚Äî TRADITIONAL MACHINE LEARNING EXECUTION
# ============================================================================
print("\n" + "=" * 80)
print("=" * 80)
print("  BAYMAX ‚Äî TRADITIONAL MACHINE LEARNING")
print("=" * 80)
print("=" * 80)

# ------------------------------------------------------------------
# Print resolved directory paths
# ------------------------------------------------------------------
print(f"\nüìÅ Directory Paths:")
print(f"   Features  : {FEATURES_DIR}")
print(f"   Results   : {RESULTS_BASE}")
print(f"   Models    : {MODELS_BASE}")
print(f"   Figures   : {FIGURES_BASE}")

# ------------------------------------------------------------------
# Run training pipeline
# ------------------------------------------------------------------
print("\n‚è≥ Starting Traditional ML Training for Baymax ...")

results_ml = train_traditional_ml(
    features_dir=FEATURES_DIR,
    results_base=RESULTS_BASE,
    models_base=MODELS_BASE,
    figures_base=FIGURES_BASE,
    dataset_name="MHP_Processed",
    random_state=RANDOM_STATE
)

# ------------------------------------------------------------------
# Display results table
# ------------------------------------------------------------------
if results_ml is not None:
    print("\n" + "=" * 80)
    print("  BAYMAX ‚Äî ML RESULTS SUMMARY")
    print("=" * 80)
    print(results_ml.to_string(index=False))

else:
    print("\n‚ö†Ô∏è  No ML results produced for Baymax.")


  BAYMAX ‚Äî TRADITIONAL MACHINE LEARNING

üìÅ Directory Paths:
   Features  : d:\Programming\Projects\Baymax\features
   Results   : d:\Programming\Projects\Baymax\results
   Models    : d:\Programming\Projects\Baymax\models
   Figures   : d:\Programming\Projects\Baymax\figures

‚è≥ Starting Traditional ML Training for Baymax ...

‚ñ∂  Training ML Models ‚Äî MHP_Processed

  Train samples : 1617
  Test  samples : 405
  Features      : 26
  Classes       : [np.int64(0), np.int64(1), np.int64(2)]

  - Training Logistic Regression ...
    ‚úÖ Accuracy: 0.9111  |  F1: 0.9110
  - Training Gradient Boosting ...
    ‚úÖ Accuracy: 0.9136  |  F1: 0.9131
  - Training KNN ...
    ‚úÖ Accuracy: 0.9086  |  F1: 0.9074
  - Training Random Forest ...
    ‚úÖ Accuracy: 0.9235  |  F1: 0.9231
  - Training Decision Tree ...
    ‚úÖ Accuracy: 0.8494  |  F1: 0.8496
  - Training SVM ...
    ‚úÖ Accuracy: 0.9259  |  F1: 0.9256

‚úÖ Results saved ‚Üí d:\Programming\Projects\Baymax\results\Machine Learning\m