MODEL SELECTION

Student depression prediction is a classification problem, and require a supervised machine learning algorithm for doing such task.

Things to consider when choosing a machine learning algorithm:
- Final goal: accuracy, speed, scalability
- Data nature: outliters, size, quality, characteristic
- Our data is the combination of both categorical data (Ex: Gender, Sleep Duration,...) and numerical data (CGPA, Work Hour,...)
- Constraints: such as computational limitations



There are serveral supervised machine learning algorithm that can be used for our task, such as:
- Logistic Regression:
- Decision tree
- Random forest
- Support Vector Machine
- Naive Bayes


As this is a classification problem, we use Accuracy, Recall, Precision, F1 Score to measure our model



In [12]:
import os
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


In [19]:
def evaluate_model(y_true, y_pred):
    return {
        "accuracy": accuracy_score(y_true, y_pred),
        "precision": precision_score(y_true, y_pred, average='weighted', zero_division=0),
        "recall": recall_score(y_true, y_pred, average='weighted', zero_division=0),
        "f1_score": f1_score(y_true, y_pred, average='weighted', zero_division=0),
    }

def train_and_evaluate_all_models(base_filename, target_column, n_folds=5):
    models = {
        "Logistic Regression": LogisticRegression(max_iter=1000,n_jobs=-1),
        "K-Nearest Neighbors": KNeighborsClassifier(weights='distance',n_jobs=-1),
        "Gradient Boosting": GradientBoostingClassifier(),
        "AdaBoost": AdaBoostClassifier(),
        # "Voting Classifier": VotingClassifier(estimators=[
        #     ('lr', LogisticRegression(max_iter=1000, n_jobs=-1)),
        #     ('rf', RandomForestClassifier(n_estimators=100)),
        #     ('svc', SVC(probability=True))
        # ], voting='soft',n_jobs=-1),
        "Decision Tree": DecisionTreeClassifier(criterion='entropy'),
        "Random Forest": RandomForestClassifier(n_estimators=100),
        "Support Vector Machine": SVC(),
        "Naive Bayes": GaussianNB()
    }

    results = {name: [] for name in models}

    for fold in range(1, n_folds + 1):
        print(f"\n📁 Fold {fold}")
        train_path = f"../datasets/train/train_fold{fold}_{base_filename}"
        test_path = f"../datasets/test/test_fold{fold}_{base_filename}"

        train_df = pd.read_csv(train_path)
        test_df = pd.read_csv(test_path)

        X_train = train_df.drop(columns=[target_column])
        y_train = train_df[target_column]
        X_test = test_df.drop(columns=[target_column])
        y_test = test_df[target_column]

        for model_name, model in models.items():
            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)
            metrics = evaluate_model(y_test, y_pred)
            results[model_name].append(metrics)

            print(f"🔹 {model_name}")
            print(f"  Accuracy:  {metrics['accuracy']:.4f}")
            print(f"  Precision: {metrics['precision']:.4f}")
            print(f"  Recall:    {metrics['recall']:.4f}")
            print(f"  F1 Score:  {metrics['f1_score']:.4f}")

    # Calculate average metrics
    print("\n📊 Average Performance Across Folds:")
    for model_name in models:
        avg_metrics = {
            metric: sum(r[metric] for r in results[model_name]) / n_folds
            for metric in ["accuracy", "precision", "recall", "f1_score"]
        }
        print(f"\n🔹 {model_name}")
        for metric, score in avg_metrics.items():
            print(f"  {metric.capitalize()}: {score:.4f}")

In [20]:
train_and_evaluate_all_models(
    base_filename="preprocessed_student_depression.csv",
    target_column="Depression",  # change to your actual label column
    n_folds=5
)


📁 Fold 1
🔹 Logistic Regression
  Accuracy:  0.8357
  Precision: 0.8351
  Recall:    0.8357
  F1 Score:  0.8351
🔹 K-Nearest Neighbors
  Accuracy:  0.7866
  Precision: 0.7855
  Recall:    0.7866
  F1 Score:  0.7848
🔹 Gradient Boosting
  Accuracy:  0.8319
  Precision: 0.8313
  Recall:    0.8319
  F1 Score:  0.8313
🔹 AdaBoost
  Accuracy:  0.8324
  Precision: 0.8319
  Recall:    0.8324
  F1 Score:  0.8315
🔹 Decision Tree
  Accuracy:  0.7691
  Precision: 0.7695
  Recall:    0.7691
  F1 Score:  0.7692
🔹 Random Forest
  Accuracy:  0.8276
  Precision: 0.8269
  Recall:    0.8276
  F1 Score:  0.8269
🔹 Support Vector Machine
  Accuracy:  0.8362
  Precision: 0.8358
  Recall:    0.8362
  F1 Score:  0.8352
🔹 Naive Bayes
  Accuracy:  0.6649
  Precision: 0.7758
  Recall:    0.6649
  F1 Score:  0.6517

📁 Fold 2
🔹 Logistic Regression
  Accuracy:  0.8420
  Precision: 0.8414
  Recall:    0.8420
  F1 Score:  0.8412
🔹 K-Nearest Neighbors
  Accuracy:  0.7860
  Precision: 0.7847
  Recall:    0.7860
  F1 Score

CHOOSING METRIC TO DECIDE WHICH MODEL WORKS BEST FOR DATA

- As our goal is to predict if a student has a depression or not, we can tolerate wrong predictions for students who do not have depression (false positive), rather than leaving possible depression cases predicted 'not depression'. From that point, we can have early mental health warning for students that could possibly has depression.
- So to choose the best suitable model for the task, we choose the model that has highest "Recall" in terms of performance metric


===> We choose Support Vector Machine model for the task.
