# Model Training (Traditional Machine Learning Models)

In this notebook, we train and evaluate six classical machine learning models on three different feature sets.

**Models used**
- Logistic Regression  
- Gradient Boosting Classifier  
- K-Nearest Neighbour (KNN)  
- Random Forest Classifier  
- Decision Tree Classifier  
- Support Vector Machine (SVM)

**Feature sets**
1Ô∏è‚É£ Feature Set 1 ‚Üí Top 7 features from each PSS-10, GAD-7 and PHQ-9 (21 total)  
2Ô∏è‚É£ Feature Set 2 ‚Üí All PSS-10 + All PHQ-9 (19 total)  
3Ô∏è‚É£ Feature Set 3 ‚Üí All GAD-7 + All PHQ-9 (17 total)

Each model will be evaluated on:
- Accuracy  
- Precision  
- Recall  
- F1 Score  
and a **Confusion Matrix** for visual assessment.


In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns
import joblib

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

# Directories
DATA_DIR = Path("../data/processed")
MODEL_DIR = Path("../models")
FIG_DIR = Path("../figures")

MODEL_DIR.mkdir(parents=True, exist_ok=True)
FIG_DIR.mkdir(parents=True, exist_ok=True)

# Load feature sets
fs1_train = pd.read_csv(DATA_DIR / "fs1_train.csv")
fs1_test  = pd.read_csv(DATA_DIR / "fs1_test.csv")
fs2_train = pd.read_csv(DATA_DIR / "fs2_train.csv")
fs2_test  = pd.read_csv(DATA_DIR / "fs2_test.csv")
fs3_train = pd.read_csv(DATA_DIR / "fs3_train.csv")
fs3_test  = pd.read_csv(DATA_DIR / "fs3_test.csv")

print("‚úÖ Feature sets loaded successfully!")

‚úÖ Feature sets loaded successfully!


## Helper Functions for Model Training, Evaluation, and Saving
We define functions to:
- Train all models,
- Evaluate performance (Accuracy, Precision, Recall, F1),
- Plot and save confusion matrices,
- Save trained model files.

In [2]:
def evaluate_model(model, X_test, y_test, model_name, feature_set_name):
    """Compute metrics, show & save confusion matrix."""
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, average="weighted", zero_division=0)
    rec = recall_score(y_test, y_pred, average="weighted", zero_division=0)
    f1 = f1_score(y_test, y_pred, average="weighted", zero_division=0)

    # Confusion Matrix
    cm = confusion_matrix(y_test, y_pred)
    fig, ax = plt.subplots(figsize=(5, 4))
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", ax=ax)
    ax.set_xlabel("Predicted Label")
    ax.set_ylabel("True Label")
    ax.set_title(f"Confusion Matrix ‚Äî {model_name} ({feature_set_name})")

    # Save confusion matrix
    fig_filename = f"confusion_{feature_set_name.lower()}_{model_name.lower().replace(' ', '_')}.png"
    fig_path = FIG_DIR / fig_filename
    fig.savefig(fig_path, bbox_inches="tight", dpi=300)
    plt.close(fig)
    print(f"üñºÔ∏è Saved confusion matrix: {fig_path}")

    return {
        "Model": model_name,
        "Accuracy": acc,
        "Precision": prec,
        "Recall": rec,
        "F1 Score": f1
    }


def train_and_evaluate_all_models(X_train, y_train, X_test, y_test, feature_set_name):
    """Train six ML models, evaluate, and save each model + confusion matrix."""
    models = {
        "Logistic Regression": LogisticRegression(max_iter=1000, solver="lbfgs", multi_class="auto"),
        "Gradient Boosting": GradientBoostingClassifier(random_state=42),
        "KNN": KNeighborsClassifier(n_neighbors=5),
        "Random Forest": RandomForestClassifier(n_estimators=200, random_state=42),
        "Decision Tree": DecisionTreeClassifier(random_state=42),
        "SVM": SVC(kernel="rbf", probability=True, random_state=42)
    }

    results = []
    for name, model in models.items():
        print(f"\nüîπ Training {name} on {feature_set_name} ...")
        model.fit(X_train, y_train)

        # Save the trained model
        model_filename = f"{feature_set_name.lower()}_{name.lower().replace(' ', '_')}.joblib"
        model_path = MODEL_DIR / model_filename
        joblib.dump(model, model_path)
        print(f"üíæ Saved model: {model_path}")

        metrics = evaluate_model(model, X_test, y_test, name, feature_set_name)
        results.append(metrics)

    return pd.DataFrame(results)

## Feature Set 1 ‚Äî Top 7 from each scale (21 features)

In [3]:
X_train1 = fs1_train.drop(columns=["DepressionEncoded"])
y_train1 = fs1_train["DepressionEncoded"]
X_test1  = fs1_test.drop(columns=["DepressionEncoded"])
y_test1  = fs1_test["DepressionEncoded"]

results_fs1 = train_and_evaluate_all_models(X_train1, y_train1, X_test1, y_test1, "FS1")

print("\nüìä Performance on Feature Set 1:")
display(results_fs1.sort_values(by="Accuracy", ascending=False).reset_index(drop=True))


üîπ Training Logistic Regression on FS1 ...




üíæ Saved model: ..\models\fs1_logistic_regression.joblib
üñºÔ∏è Saved confusion matrix: ..\figures\confusion_fs1_logistic_regression.png

üîπ Training Gradient Boosting on FS1 ...
üíæ Saved model: ..\models\fs1_gradient_boosting.joblib
üñºÔ∏è Saved confusion matrix: ..\figures\confusion_fs1_gradient_boosting.png

üîπ Training KNN on FS1 ...
üíæ Saved model: ..\models\fs1_knn.joblib
üñºÔ∏è Saved confusion matrix: ..\figures\confusion_fs1_knn.png

üîπ Training Random Forest on FS1 ...
üíæ Saved model: ..\models\fs1_random_forest.joblib
üñºÔ∏è Saved confusion matrix: ..\figures\confusion_fs1_random_forest.png

üîπ Training Decision Tree on FS1 ...
üíæ Saved model: ..\models\fs1_decision_tree.joblib
üñºÔ∏è Saved confusion matrix: ..\figures\confusion_fs1_decision_tree.png

üîπ Training SVM on FS1 ...
üíæ Saved model: ..\models\fs1_svm.joblib
üñºÔ∏è Saved confusion matrix: ..\figures\confusion_fs1_svm.png

üìä Performance on Feature Set 1:


Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score
0,Logistic Regression,0.866667,0.866973,0.866667,0.866772
1,SVM,0.834568,0.836713,0.834568,0.835062
2,Gradient Boosting,0.817284,0.817421,0.817284,0.817095
3,Random Forest,0.814815,0.815327,0.814815,0.814704
4,KNN,0.750617,0.758258,0.750617,0.748932
5,Decision Tree,0.701235,0.698359,0.701235,0.699357


## Feature Set 2 ‚Äî All PSS-10 + All PHQ-9 (19 features)

In [4]:
X_train2 = fs2_train.drop(columns=["DepressionEncoded"])
y_train2 = fs2_train["DepressionEncoded"]
X_test2  = fs2_test.drop(columns=["DepressionEncoded"])
y_test2  = fs2_test["DepressionEncoded"]

results_fs2 = train_and_evaluate_all_models(X_train2, y_train2, X_test2, y_test2, "FS2")

print("\nüìä Performance on Feature Set 2:")
display(results_fs2.sort_values(by="Accuracy", ascending=False).reset_index(drop=True))


üîπ Training Logistic Regression on FS2 ...
üíæ Saved model: ..\models\fs2_logistic_regression.joblib




üñºÔ∏è Saved confusion matrix: ..\figures\confusion_fs2_logistic_regression.png

üîπ Training Gradient Boosting on FS2 ...
üíæ Saved model: ..\models\fs2_gradient_boosting.joblib
üñºÔ∏è Saved confusion matrix: ..\figures\confusion_fs2_gradient_boosting.png

üîπ Training KNN on FS2 ...
üíæ Saved model: ..\models\fs2_knn.joblib
üñºÔ∏è Saved confusion matrix: ..\figures\confusion_fs2_knn.png

üîπ Training Random Forest on FS2 ...
üíæ Saved model: ..\models\fs2_random_forest.joblib
üñºÔ∏è Saved confusion matrix: ..\figures\confusion_fs2_random_forest.png

üîπ Training Decision Tree on FS2 ...
üíæ Saved model: ..\models\fs2_decision_tree.joblib
üñºÔ∏è Saved confusion matrix: ..\figures\confusion_fs2_decision_tree.png

üîπ Training SVM on FS2 ...
üíæ Saved model: ..\models\fs2_svm.joblib
üñºÔ∏è Saved confusion matrix: ..\figures\confusion_fs2_svm.png

üìä Performance on Feature Set 2:


Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score
0,Logistic Regression,0.987654,0.988137,0.987654,0.987555
1,SVM,0.933333,0.934411,0.933333,0.933285
2,Gradient Boosting,0.891358,0.89243,0.891358,0.891494
3,Random Forest,0.881481,0.883696,0.881481,0.882099
4,KNN,0.814815,0.820436,0.814815,0.815259
5,Decision Tree,0.728395,0.729909,0.728395,0.728001


## Feature Set 3 ‚Äî All GAD-7 + All PHQ-9 (17 features)

In [5]:
X_train3 = fs3_train.drop(columns=["DepressionEncoded"])
y_train3 = fs3_train["DepressionEncoded"]
X_test3  = fs3_test.drop(columns=["DepressionEncoded"])
y_test3  = fs3_test["DepressionEncoded"]

results_fs3 = train_and_evaluate_all_models(X_train3, y_train3, X_test3, y_test3, "FS3")

print("\nüìä Performance on Feature Set 3:")
display(results_fs3.sort_values(by="Accuracy", ascending=False).reset_index(drop=True))


üîπ Training Logistic Regression on FS3 ...




üíæ Saved model: ..\models\fs3_logistic_regression.joblib
üñºÔ∏è Saved confusion matrix: ..\figures\confusion_fs3_logistic_regression.png

üîπ Training Gradient Boosting on FS3 ...
üíæ Saved model: ..\models\fs3_gradient_boosting.joblib
üñºÔ∏è Saved confusion matrix: ..\figures\confusion_fs3_gradient_boosting.png

üîπ Training KNN on FS3 ...
üíæ Saved model: ..\models\fs3_knn.joblib
üñºÔ∏è Saved confusion matrix: ..\figures\confusion_fs3_knn.png

üîπ Training Random Forest on FS3 ...
üíæ Saved model: ..\models\fs3_random_forest.joblib
üñºÔ∏è Saved confusion matrix: ..\figures\confusion_fs3_random_forest.png

üîπ Training Decision Tree on FS3 ...
üíæ Saved model: ..\models\fs3_decision_tree.joblib
üñºÔ∏è Saved confusion matrix: ..\figures\confusion_fs3_decision_tree.png

üîπ Training SVM on FS3 ...
üíæ Saved model: ..\models\fs3_svm.joblib
üñºÔ∏è Saved confusion matrix: ..\figures\confusion_fs3_svm.png

üìä Performance on Feature Set 3:


Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score
0,Logistic Regression,0.985185,0.985368,0.985185,0.985106
1,SVM,0.948148,0.949366,0.948148,0.948229
2,Gradient Boosting,0.88642,0.888041,0.88642,0.886795
3,Random Forest,0.871605,0.872207,0.871605,0.871802
4,KNN,0.817284,0.818993,0.817284,0.817195
5,Decision Tree,0.733333,0.730368,0.733333,0.730842


## Save Performance Results
All model performance tables are saved in `../data/processed/` for documentation and future comparison with Deep Learning models.

In [6]:
results_fs1.to_csv(DATA_DIR / "results_fs1_traditional.csv", index=False)
results_fs2.to_csv(DATA_DIR / "results_fs2_traditional.csv", index=False)
results_fs3.to_csv(DATA_DIR / "results_fs3_traditional.csv", index=False)

print("‚úÖ All results saved to data/processed/")

‚úÖ All results saved to data/processed/
