# Machine Learning: Predicting Mental Health Conditions & Suicide Risk

This notebook focuses on training predictive models to identify students struggling with depression and those at high risk for crisis.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns

# Load data
df = pd.read_csv('Student Mental health.csv')

# Handle missing 'Age'
df['Age'].fillna(df['Age'].median(), inplace=True)

## 1. Feature Engineering & Pre-processing

To build a robust model, we must convert categorical text into numerical formats.

In [2]:
binary_map = {'Yes': 1, 'No': 0}
df['Anxiety_Bin'] = df['Do you have Anxiety?'].map(binary_map)
df['Panic_Bin'] = df['Do you have Panic attack?'].map(binary_map)
df['Specialist_Bin'] = df['Did you seek any specialist for a treatment?'].map(binary_map)
df['Treatment_History'] = df['Did you seek any specialist for a treatment?'].map(binary_map)
df['Marital_Bin'] = df['Marital status'].map(binary_map)

le = LabelEncoder()
df['Gender_Encoded'] = le.fit_transform(df['Choose your gender'])
df['Course_Encoded'] = le.fit_transform(df['What is your course?'])
df['Year_Encoded'] = le.fit_transform(df['Your current year of Study'])
df['CGPA_Encoded'] = le.fit_transform(df['What is your CGPA?'])

# Define Target 1: Mental Health Condition (Depression)
y_condition = df['Do you have Depression?'].map(binary_map)

# Define Target 2: Suicide Risk Level (Proxy score from comorbidity)
df['Risk_Score'] = df['Anxiety_Bin'] + df['Panic_Bin'] + y_condition
y_risk = df['Risk_Score'] # 0: Low, 1: Moderate, 2: Significant, 3: High

# Features for Model 1
X = df[['Gender_Encoded', 'Age', 'Course_Encoded', 'Year_Encoded', 'CGPA_Encoded', 'Marital_Bin', 'Anxiety_Bin', 'Panic_Bin']]

## 2. Model 1: Predicting Depression (Mental Health Condition)

We use a **Random Forest Classifier** to predict the likelihood of depression.

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y_condition, test_size=0.2, random_state=42, stratify=y_condition)

model_dep = RandomForestClassifier(n_estimators=100, random_state=42)
model_dep.fit(X_train, y_train)
y_pred = model_dep.predict(X_test)

print("--- Depression Prediction Performance ---")
print(classification_report(y_test, y_pred, target_names=['No Depression', 'Depression']))
print(f"Overall Accuracy: {accuracy_score(y_test, y_pred):.2f}")

--- Depression Prediction Performance ---
               precision    recall  f1-score   support

No Depression       0.82      1.00      0.90        14
   Depression       1.00      0.57      0.73         7

     accuracy                           0.86        21
    macro avg       0.91      0.79      0.82        21
 weighted avg       0.88      0.86      0.84        21

Overall Accuracy: 0.86


## 3. Model 2: Predicting Suicide Risk Level

Predicting the level of risk (0-3) based on student demographics and academic status.

In [4]:
# Features for Risk Prediction (demographics only to avoid leakage from indicators)
X_risk = df[['Gender_Encoded', 'Age', 'Course_Encoded', 'Year_Encoded', 'CGPA_Encoded', 'Marital_Bin']]
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_risk, y_risk, test_size=0.2, random_state=42)

model_risk = RandomForestClassifier(n_estimators=100, random_state=42)
model_risk.fit(X_train_r, y_train_r)
y_pred_r = model_risk.predict(X_test_r)

print("--- Suicide Risk Level Prediction Performance ---")
print(classification_report(y_test_r, y_pred_r))
print(f"Overall Accuracy: {accuracy_score(y_test_r, y_pred_r):.2f}")

--- Suicide Risk Level Prediction Performance ---
              precision    recall  f1-score   support

           0       0.38      0.38      0.38         8
           1       0.20      0.29      0.24         7
           2       0.67      0.33      0.44         6

    accuracy                           0.33        21
   macro avg       0.41      0.33      0.35        21
weighted avg       0.40      0.33      0.35        21

Overall Accuracy: 0.33


## ðŸ“ˆ Explanation of Results

1. **Accuracy**: Measures the percentage of correct predictions. While high accuracy is good, it can be misleading if the data is unbalanced.
2. **Precision**: High precision for 'Depression' means that when the model identifies a student as depressed, it is highly likely they actually are (fewer false alarms).
3. **Recall**: This is critical in mental health. High recall means the model is good at catching almost all students who are actually depressed, minimizing 'missed' cases.
4. **F1-Score**: The balance between Precision and Recall. It is the most reliable metric for these datasets.

### Insights
- The **Depression Model** typically performs better because it has direct historical indicators to learn from.
- The **Suicide Risk Level** model is harder to train because it relies purely on demographic factors, showing that demographic profiling alone is insufficient for risk assessment without exploring symptomatic data.