# Feature Engineering

This notebook creates three feature sets from the processed dataset:

1️⃣ **Feature Set 1:** Top 7 features from each of PSS-10, GAD-7 and PHQ-9 (21 total) → target =`Depression Label`  
2️⃣ **Feature Set 2:** All PSS-10 + all PHQ-9 (19 total) → target =`Depression Label`  
3️⃣ **Feature Set 3:** All GAD-7 + all PHQ-9 (17 total) → target =`Depression Label`

Each set is split 80 % training / 20 % testing.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from pathlib import Path

DATA_PATH = Path("../data/processed/mhp_processed.csv")
df = pd.read_csv(DATA_PATH)

# Identify scale columns
pss_cols = [f"PSS{i+1}" for i in range(10)]
gad_cols = [f"GAD{i+1}" for i in range(7)]
phq_cols = [f"PHQ{i+1}" for i in range(9)]

print("Data shape:", df.shape)
print("Columns detected:\nPSS:", pss_cols, "\nGAD:", gad_cols, "\nPHQ:", phq_cols)

## Encode Target Label and Define Feature Selector
We encode `Depression Label` into numeric values for modelling and use a Random Forest to rank feature importance.

In [None]:
# Encode the target
le = LabelEncoder()
df["DepressionEncoded"] = le.fit_transform(df["Depression Label"])

def top_features(feature_list, k=7, target="DepressionEncoded"):
    """Return the top-k feature names from a given list using RandomForest importance."""
    X = df[feature_list]
    y = df[target]
    model = RandomForestClassifier(n_estimators=200, random_state=42)
    model.fit(X, y)
    importances = pd.Series(model.feature_importances_, index=feature_list)
    return importances.sort_values(ascending=False).head(k)

## Feature Set 1 — Top 7 from each PSS-10, GAD-7 and PHQ-9 (21 features)

In [None]:
top_pss = top_features(pss_cols, k=7)
top_gad = top_features(gad_cols, k=7)
top_phq = top_features(phq_cols, k=7)

fs1_features = list(top_pss.index) + list(top_gad.index) + list(top_phq.index)
print("Selected features (21 total):")
print(fs1_features)

X = df[fs1_features]
y = df["DepressionEncoded"]

X_train1, X_test1, y_train1, y_test1 = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("Feature Set 1 Train :", X_train1.shape, " Test :", X_test1.shape)

## Feature Set 2 — All PSS-10 + All PHQ-9 (19 features)

In [None]:
fs2_features = pss_cols + phq_cols
X = df[fs2_features]
y = df["DepressionEncoded"]

X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print("Feature Set 2 Train :", X_train2.shape, " Test :", X_test2.shape)

## Feature Set 3 — All GAD-7 + All PHQ-9 (17 features)

In [None]:
fs3_features = gad_cols + phq_cols
X = df[fs3_features]
y = df["DepressionEncoded"]

X_train3, X_test3, y_train3, y_test3 = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print("Feature Set 3 Train :", X_train3.shape, " Test :", X_test3.shape)

## Save all Feature Sets for Modelling
All train/test splits are saved in `../data/processed/` for later model development.

In [None]:
out_dir = Path("../data/processed")
out_dir.mkdir(parents=True, exist_ok=True)

# Save Feature Set 1
pd.concat([X_train1, y_train1], axis=1).to_csv(out_dir / "fs1_train.csv", index=False)
pd.concat([X_test1, y_test1], axis=1).to_csv(out_dir / "fs1_test.csv", index=False)

# Save Feature Set 2
pd.concat([X_train2, y_train2], axis=1).to_csv(out_dir / "fs2_train.csv", index=False)
pd.concat([X_test2, y_test2], axis=1).to_csv(out_dir / "fs2_test.csv", index=False)

# Save Feature Set 3
pd.concat([X_train3, y_train3], axis=1).to_csv(out_dir / "fs3_train.csv", index=False)
pd.concat([X_test3, y_test3], axis=1).to_csv(out_dir / "fs3_test.csv", index=False)

print("✅ All feature sets saved to data/processed/")