# Model Training (Machine Learning)

We train six traditional ML classifiers on nine feature-selection datasets:

1. **Recursive Feature Elimination**
2. **Select K Best**
3. **Fisher Score Chi-Square**
4. **Extra Trees Classifier**
5. **Pearson Correlation**
6. **Mutual Information**
7. **Mutual Info Regression**
8. **Manual Uniqueness**
9. **Variance Threshold**

Each dataset trains:
- Logistic Regression  
- Gradient Boosting Classifier  
- K-Nearest Neighbours  
- Random Forest Classifier  
- Decision Tree Classifier  
- Support Vector Machine  

Metrics → `Accuracy`, `Precision`, `Recall`, `F1`  
Visuals → Confusion Matrices

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import joblib
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

FEATURES_BASE = Path("../data/processed/features2")
PROC_BASE = Path("../data/processed/ml2")
MODEL_BASE = Path("../models/ml2")
FIG_BASE = Path("../figures/ml2")

for p in [PROC_BASE, MODEL_BASE, FIG_BASE]:
    p.mkdir(parents=True, exist_ok=True)

METHODS = ["rfe","skb","fscs","etc","pc","mi","mir","mu","vt"]
RANDOM_STATE = 42

## Helper Functions

In [2]:
def compute_metrics(y_true, y_pred):
    return {
        "Accuracy": accuracy_score(y_true, y_pred),
        "Precision": precision_score(y_true, y_pred, average="weighted", zero_division=0),
        "Recall": recall_score(y_true, y_pred, average="weighted", zero_division=0),
        "F1": f1_score(y_true, y_pred, average="weighted", zero_division=0)
    }

def save_confusion(y_true, y_pred, path, title):
    cm = confusion_matrix(y_true, y_pred)
    fig, ax = plt.subplots(figsize=(5,4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax)
    ax.set_xlabel("Predicted"); ax.set_ylabel("True"); ax.set_title(title)
    fig.savefig(path, dpi=300, bbox_inches="tight")
    plt.close(fig)

## Model Definitions

In [3]:
MODELS = {
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=RANDOM_STATE),
    "Gradient Boosting": GradientBoostingClassifier(random_state=RANDOM_STATE),
    "KNN": KNeighborsClassifier(),
    "Random Forest": RandomForestClassifier(random_state=RANDOM_STATE),
    "Decision Tree": DecisionTreeClassifier(random_state=RANDOM_STATE),
    "SVM": SVC(probability=True, random_state=RANDOM_STATE)
}

## Train All Six Models Across Nine Feature Sets

In [4]:
for method in METHODS:
    print("\n" + "="*60)
    print(f"Training for feature set: {method.upper()}")
    in_dir = FEATURES_BASE / method
    if not in_dir.exists():
        print("Missing feature folder:", in_dir); continue

    train_df = pd.read_csv(in_dir / "train.csv").dropna(subset=["DepressionEncoded"])
    test_df = pd.read_csv(in_dir / "test.csv").dropna(subset=["DepressionEncoded"])

    X_train = train_df.drop(columns=["DepressionEncoded"])
    y_train = train_df["DepressionEncoded"].astype(int)
    X_test = test_df.drop(columns=["DepressionEncoded"])
    y_test = test_df["DepressionEncoded"].astype(int)

    results = []
    proc_out = PROC_BASE / method; model_out = MODEL_BASE / method; fig_out = FIG_BASE / method
    for p in [proc_out, model_out, fig_out]: p.mkdir(parents=True, exist_ok=True)

    for name, model in MODELS.items():
        print(" -", name)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        metrics = compute_metrics(y_test, y_pred)
        results.append({"Model": name, **metrics})

        # save confusion and model
        save_confusion(y_test, y_pred, fig_out / f"{name.lower().replace(' ','_')}_confusion.png", f"{name} Confusion ({method})")
        joblib.dump(model, model_out / f"{name.lower().replace(' ','_')}.pkl")

    pd.DataFrame(results).to_csv(proc_out / "results_traditional_ml.csv", index=False)
    print("Saved metrics to:", proc_out / "results_traditional_ml.csv")


Training for feature set: RFE
 - Logistic Regression
 - Gradient Boosting
 - KNN
 - Random Forest
 - Decision Tree
 - SVM
Saved metrics to: ..\data\processed\ml2\rfe\results_traditional_ml.csv

Training for feature set: SKB
 - Logistic Regression
 - Gradient Boosting
 - KNN
 - Random Forest
 - Decision Tree
 - SVM
Saved metrics to: ..\data\processed\ml2\skb\results_traditional_ml.csv

Training for feature set: FSCS
 - Logistic Regression
 - Gradient Boosting
 - KNN
 - Random Forest
 - Decision Tree
 - SVM
Saved metrics to: ..\data\processed\ml2\fscs\results_traditional_ml.csv

Training for feature set: ETC
 - Logistic Regression
 - Gradient Boosting
 - KNN
 - Random Forest
 - Decision Tree
 - SVM
Saved metrics to: ..\data\processed\ml2\etc\results_traditional_ml.csv

Training for feature set: PC
 - Logistic Regression
 - Gradient Boosting
 - KNN
 - Random Forest
 - Decision Tree
 - SVM
Saved metrics to: ..\data\processed\ml2\pc\results_traditional_ml.csv

Training for feature set: MI


## Summary of Model Performance Across All Feature Sets

In [5]:
all_results = []
for method in METHODS:
    res_path = PROC_BASE / method / "results_traditional_ml.csv"
    if res_path.exists():
        df = pd.read_csv(res_path); df["Feature Set"] = method.upper(); all_results.append(df)
    else:
        print("Missing:", method)
if all_results:
    combined = pd.concat(all_results, ignore_index=True).sort_values(["Feature Set","Accuracy"], ascending=[True,False])
    display(combined)
    combined.to_csv(PROC_BASE / "all_model_results_summary_v3.csv", index=False)
    print("Saved combined summary:", PROC_BASE / "all_model_results_summary_v3.csv")
else:
    print("No results found.")

Unnamed: 0,Model,Accuracy,Precision,Recall,F1,Feature Set
18,Logistic Regression,0.869136,0.87092,0.869136,0.868595,ETC
23,SVM,0.854321,0.855699,0.854321,0.854574,ETC
21,Random Forest,0.844444,0.845512,0.844444,0.844046,ETC
19,Gradient Boosting,0.837037,0.838447,0.837037,0.837135,ETC
20,KNN,0.785185,0.7873,0.785185,0.784895,ETC
22,Decision Tree,0.725926,0.720429,0.725926,0.720491,ETC
12,Logistic Regression,0.869136,0.872062,0.869136,0.869049,FSCS
17,SVM,0.844444,0.8474,0.844444,0.844418,FSCS
13,Gradient Boosting,0.82716,0.831728,0.82716,0.826697,FSCS
15,Random Forest,0.822222,0.823659,0.822222,0.821978,FSCS


Saved combined summary: ..\data\processed\ml2\all_model_results_summary_v3.csv
