<a href="https://colab.research.google.com/github/uchekalu/Mental-Health-Model/blob/main/Mental_Health_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: SOME KAGGLE DATA SOURCES ARE PRIVATE
# RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES.
import kagglehub
kagglehub.login()


In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

playground_series_s4e11_path = kagglehub.competition_download('playground-series-s4e11')

print('Data source import complete.')


## **Predicting depression based on mental health survey data**

In this notebook, my goal is to predict whether individuals experience depression based on survey data. I will aim at building XGBoost model to train and fit the mental health data. The output is binary (0 for no depression, 1 for depression). Submissions will be evaluated using Accuracy Score.



In [None]:
import pandas as pd
import numpy as np
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

Load and preprocess data

In [None]:
# Load data
train_data = pd.read_csv("/kaggle/input/playground-series-s4e11/train.csv", index_col='id')
test_data = pd.read_csv("/kaggle/input/playground-series-s4e11/test.csv", index_col= 'id')

# Drop rows with missing target values
train_data.dropna(axis=0, subset=['Depression'], inplace=True)

# Separate target and predictors
y = train_data['Depression']
train_data.drop(['Depression'], axis=1, inplace=True)

# split into training and validation
X_train_full, X_valid_full, y_train, y_valid = train_test_split(
train_data, y, train_size=0.8, test_size=0.2, random_state=0)


# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and
                    X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = low_cardinality_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test  = test_data[my_cols].copy()




In [None]:
X_train.head()

Unnamed: 0_level_0,Gender,Working Professional or Student,Have you ever had suicidal thoughts ?,Family History of Mental Illness,Age,Academic Pressure,Work Pressure,CGPA,Study Satisfaction,Job Satisfaction,Work/Study Hours,Financial Stress
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
125323,Female,Student,Yes,No,29.0,5.0,,5.57,2.0,,8.0,1.0
118204,Female,Working Professional,Yes,Yes,37.0,,5.0,,,4.0,2.0,4.0
371,Male,Working Professional,Yes,Yes,53.0,,1.0,,,4.0,12.0,2.0
132975,Female,Working Professional,No,No,41.0,,1.0,,,5.0,5.0,3.0
36674,Male,Working Professional,No,Yes,44.0,,3.0,,,3.0,10.0,3.0


Preprocessing

In [None]:
# Numerical transformer
num_trans = SimpleImputer(strategy = 'constant')

# Categorical Transformer
cat_trans = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohen', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_trans, numerical_cols),
        ('cat', cat_trans, low_cardinality_cols)
    ])

# One-hot encode the data
X_train = pd.get_dummies(X_train)
X_valid = pd.get_dummies(X_valid)
X_test = pd.get_dummies(X_test)
X_train, X_valid = X_train.align(X_valid, join='left', axis=1)
X_train, X_test = X_train.align(X_test, join='left', axis=1)

Train and fit model

In [None]:
# Define model

model  = XGBRegressor(n_estimators = 900, early_stopping_rounds = 5, learning_rate = 0.05)

# Fit the model

model.fit(X_train, y_train,
             eval_set=[(X_valid, y_valid)],
             verbose=False)

# Get predictions
predictions = model.predict(X_valid)

# Calculate MAE
mae = mean_absolute_error(predictions, y_valid)

print("Mean Absolute Error:" , mae)

Mean Absolute Error: 0.09848647864765923


In [None]:
# Save test predictions to file
test_preds = model.predict(X_test)
output = pd.DataFrame({'id': X_test.index,
                       'Depression': test_preds.round()})
output.to_csv('submission.csv', index=False)