# Advanced Feature Engineering for Mental Health Prediction

In this notebook, we transform raw survey data into powerful predictive features. These features are designed to reduce noise and provide the machine learning model with stronger signals for classification.

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('Student Mental health.csv')

# Pre-processing Helper: Convert Yes/No to 1/0
binary_map = {'Yes': 1, 'No': 0}
df['Depression_Bin'] = df['Do you have Depression?'].map(binary_map)
df['Anxiety_Bin'] = df['Do you have Anxiety?'].map(binary_map)
df['Panic_Bin'] = df['Do you have Panic attack?'].map(binary_map)
df['Specialist_Bin'] = df['Did you seek any specialist for a treatment?'].map(binary_map)

# CGPA Mapping for numerical weight
cgpa_weight = {
    '0 - 1.99': 1, 
    '2.00 - 2.49': 2, 
    '2.50 - 2.99': 3, 
    '3.00 - 3.49': 4, 
    '3.50 - 4.00': 5
}
df['Academic_Pressure_Weight'] = df['What is your CGPA?'].str.strip().map(cgpa_weight)

## 1. Stress Score

**Logic**: `Stress_Score = (Anxiety_Bin + Panic_Bin + (6 - Academic_Pressure_Weight)) / 3`  
*Note: We use (6 - CGPA) because lower CGPA often correlates with higher academic stress.*

### Why this improves accuracy:
Traditional models treat features as independent. However, stress is a cumulative phenomenon. By combining psychological symptoms with academic pressure, we create a single, continuous metric that represents the 'tipping point' for mental health crises more effectively than individual Yes/No flags.

In [2]:
df['Stress_Score'] = (df['Anxiety_Bin'] + df['Panic_Bin'] + (6 - df['Academic_Pressure_Weight'])) / 3
print("Sample Stress Scores:")
print(df[['Stress_Score']].head())

Sample Stress Scores:
   Stress_Score
0      1.000000
1      1.000000
2      1.333333
3      0.666667
4      0.666667


## 2. Depression & Anxiety Levels (Ordinal Scale)

**Logic**: 
- 0: No Symptoms
- 1: Symptoms Present
- 2: Clinical Level (Symptoms + Specialist Treatment)

### Why this improves accuracy:
Binary 'Yes/No' classifications suffer from lack of granularity. A student who has symptoms but doesn't seek help may have a different risk profile than one under specialist care. This ordinal mapping allows models like Random Forest to create better decision splits based on severity.

In [3]:
df['Depression_Level'] = df['Depression_Bin'] + (df['Depression_Bin'] * df['Specialist_Bin'])
df['Anxiety_Level'] = df['Anxiety_Bin'] + (df['Anxiety_Bin'] * df['Specialist_Bin'])

print("Frequency of Depression Levels:")
print(df['Depression_Level'].value_counts())

Frequency of Depression Levels:
Depression_Level
0    66
1    29
2     6
Name: count, dtype: int64


## 3. Emotional Sentiment Score

**Logic**: A normalized inverse index of all negative mental state indicators.  
`Sentiment_Score = 1.0 - (Mean of Depression, Anxiety, and Panic binary flags)`

### Why this improves accuracy:
This feature acts as a 'Stability Index'. High numbers (near 1.0) represent emotional stability, while low numbers represent acute distress. Continuous variables like this are highly effective as inputs for Logistic Regression and Neural Networks, as they allow the model to learn a probability gradient rather than just a hard boundary.

In [4]:
df['Sentiment_Score'] = 1.0 - df[['Depression_Bin', 'Anxiety_Bin', 'Panic_Bin']].mean(axis=1)

print("Sample Sentiment Scores (1.0 = High Stability):")
print(df[['Sentiment_Score']].tail())

Sample Sentiment Scores (1.0 = High Stability):
     Sentiment_Score
96          0.666667
97          0.333333
98          0.333333
99          1.000000
100         1.000000


## ðŸš€ Conclusion

By moving from **categorical strings** to **engineered numerical metrics**, we have:
1. Created a cumulative **Stress Score**.
2. Provided **Severity Levels** for clinical conditions.
3. Established an **Emotional Stability Index**.

These features are now ready to be fed into a Predictive Model for significantly higher F1-Scores and better Recall of high-risk cases.