# Exploring Mental Health Data

### Introduction
Mental health has become an increasingly important area of study, especially in understanding the factors that contribute to conditions like depression. The 2024 Kaggle Playground Series continues this focus by presenting a dataset derived from a mental health survey aimed at exploring the underlying causes of depression in individuals. This competition invites participants to apply their machine learning skills to analyze and predict depression outcomes, leveraging both synthetic and original data sources to enhance model performance and uncover insightful patterns.



### Importing necessary libraries
- `pandas` and `numpy` for data manipulation and numerical operations.
- `matplotlib.pyplot` and `seaborn` for data visualization.
- `train_test_split` from `sklearn.model_selection` to split the dataset into training and testing sets.
- `accuracy_score`, `classification_report`, and `confusion_matrix` from `sklearn.metrics` for model evaluation.
- `XGBClassifier` from `xgboost` for implementing the XGBoost classifier.
- `SimpleImputer` from `sklearn.impute` for handling missing values.
- `optuna` for hyperparameter optimization.
- `warnings` to ignore warnings during execution

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score , classification_report , confusion_matrix
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.impute import SimpleImputer

import optuna
import warnings
warnings.filterwarnings("ignore")

### Loading datasets

In [2]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
sub = pd.read_csv("sample_submission.csv")

In [3]:
id_col_test = id_col_test= test['id']

### Dropping unnecessary columns

In [4]:
train = train.drop(columns=["Name" , "id"])
test = test.drop(columns=["Name" , "id"] , axis=1)

In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140700 entries, 0 to 140699
Data columns (total 18 columns):
 #   Column                                 Non-Null Count   Dtype  
---  ------                                 --------------   -----  
 0   Gender                                 140700 non-null  object 
 1   Age                                    140700 non-null  float64
 2   City                                   140700 non-null  object 
 3   Working Professional or Student        140700 non-null  object 
 4   Profession                             104070 non-null  object 
 5   Academic Pressure                      27897 non-null   float64
 6   Work Pressure                          112782 non-null  float64
 7   CGPA                                   27898 non-null   float64
 8   Study Satisfaction                     27897 non-null   float64
 9   Job Satisfaction                       112790 non-null  float64
 10  Sleep Duration                         140700 non-null  

In [6]:
train.columns

Index(['Gender', 'Age', 'City', 'Working Professional or Student',
       'Profession', 'Academic Pressure', 'Work Pressure', 'CGPA',
       'Study Satisfaction', 'Job Satisfaction', 'Sleep Duration',
       'Dietary Habits', 'Degree', 'Have you ever had suicidal thoughts ?',
       'Work/Study Hours', 'Financial Stress',
       'Family History of Mental Illness', 'Depression'],
      dtype='object')

In [7]:
num_cols = ['Age' , "Academic Pressure","Work Pressure" , "CGPA","Study Satisfaction","Job Satisfaction", "Work/Study Hours",  "Financial Stress"]

### Checking for missing values in dataset

In [8]:
train.isna().sum()

Gender                                        0
Age                                           0
City                                          0
Working Professional or Student               0
Profession                                36630
Academic Pressure                        112803
Work Pressure                             27918
CGPA                                     112802
Study Satisfaction                       112803
Job Satisfaction                          27910
Sleep Duration                                0
Dietary Habits                                4
Degree                                        2
Have you ever had suicidal thoughts ?         0
Work/Study Hours                              0
Financial Stress                              4
Family History of Mental Illness              0
Depression                                    0
dtype: int64

In [9]:
test.isna().sum()

Gender                                       0
Age                                          0
City                                         0
Working Professional or Student              0
Profession                               24632
Academic Pressure                        75033
Work Pressure                            18778
CGPA                                     75034
Study Satisfaction                       75033
Job Satisfaction                         18774
Sleep Duration                               0
Dietary Habits                               5
Degree                                       2
Have you ever had suicidal thoughts ?        0
Work/Study Hours                             0
Financial Stress                             0
Family History of Mental Illness             0
dtype: int64

In [10]:
def object_count(data):
    # List to store column names with dtype 'object'
    object_columns = []
    
    # Loop through the columns to check their data type
    for column in data.columns:
        if data[column].dtype == 'object':
            object_columns.append(column)
    return object_columns
    
object_columns_train = object_count(train)
object_columns_test = object_count(test)

In [11]:
object_columns_train

['Gender',
 'City',
 'Working Professional or Student',
 'Profession',
 'Sleep Duration',
 'Dietary Habits',
 'Degree',
 'Have you ever had suicidal thoughts ?',
 'Family History of Mental Illness']

### Imputing missing values

In [12]:
num_imputer = SimpleImputer(strategy='median')
train[num_cols] = num_imputer.fit_transform(train[num_cols])


cat_imputer = SimpleImputer(strategy='most_frequent')
train[object_columns_train] = cat_imputer.fit_transform(train[object_columns_train])

In [13]:
train

Unnamed: 0,Gender,Age,City,Working Professional or Student,Profession,Academic Pressure,Work Pressure,CGPA,Study Satisfaction,Job Satisfaction,Sleep Duration,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness,Depression
0,Female,49.0,Ludhiana,Working Professional,Chef,3.0,5.0,7.77,3.0,2.0,More than 8 hours,Healthy,BHM,No,1.0,2.0,No,0
1,Male,26.0,Varanasi,Working Professional,Teacher,3.0,4.0,7.77,3.0,3.0,Less than 5 hours,Unhealthy,LLB,Yes,7.0,3.0,No,1
2,Male,33.0,Visakhapatnam,Student,Teacher,5.0,3.0,8.97,2.0,3.0,5-6 hours,Healthy,B.Pharm,Yes,3.0,1.0,No,1
3,Male,22.0,Mumbai,Working Professional,Teacher,3.0,5.0,7.77,3.0,1.0,Less than 5 hours,Moderate,BBA,Yes,10.0,1.0,Yes,1
4,Female,30.0,Kanpur,Working Professional,Business Analyst,3.0,1.0,7.77,3.0,1.0,5-6 hours,Unhealthy,BBA,Yes,9.0,4.0,Yes,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
140695,Female,18.0,Ahmedabad,Working Professional,Teacher,3.0,5.0,7.77,3.0,4.0,5-6 hours,Unhealthy,Class 12,No,2.0,4.0,Yes,1
140696,Female,41.0,Hyderabad,Working Professional,Content Writer,3.0,5.0,7.77,3.0,4.0,7-8 hours,Moderate,B.Tech,Yes,6.0,5.0,Yes,0
140697,Female,24.0,Kolkata,Working Professional,Marketing Manager,3.0,3.0,7.77,3.0,1.0,More than 8 hours,Moderate,B.Com,No,4.0,4.0,No,0
140698,Female,49.0,Srinagar,Working Professional,Plumber,3.0,5.0,7.77,3.0,2.0,5-6 hours,Moderate,ME,Yes,10.0,1.0,No,0


In [14]:
train.isna().sum()

Gender                                   0
Age                                      0
City                                     0
Working Professional or Student          0
Profession                               0
Academic Pressure                        0
Work Pressure                            0
CGPA                                     0
Study Satisfaction                       0
Job Satisfaction                         0
Sleep Duration                           0
Dietary Habits                           0
Degree                                   0
Have you ever had suicidal thoughts ?    0
Work/Study Hours                         0
Financial Stress                         0
Family History of Mental Illness         0
Depression                               0
dtype: int64

In [15]:
object_columns_train

['Gender',
 'City',
 'Working Professional or Student',
 'Profession',
 'Sleep Duration',
 'Dietary Habits',
 'Degree',
 'Have you ever had suicidal thoughts ?',
 'Family History of Mental Illness']

### Splitting the training dataset into features and labels
- `y`: Labels (target variable) for the training dataset.
- `X`: Features for the training dataset.
- `X_train`, `X_test`, `y_train`, `y_test`: Splitting the dataset into training and testing sets.

In [18]:
y = train['Depression']
X = train.drop(columns=['Depression'] , axis=1)
X_train , X_test ,y_train , y_test = train_test_split(X ,y ,test_size=0.2 , shuffle=True)

In [19]:
X_train

Unnamed: 0,Gender,Age,City,Working Professional or Student,Profession,Academic Pressure,Work Pressure,CGPA,Study Satisfaction,Job Satisfaction,Sleep Duration,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness
88549,Male,53.0,Indore,Working Professional,Judge,3.0,1.0,7.77,3.0,4.0,5-6 hours,Unhealthy,LLM,No,11.0,2.0,No
102627,Female,41.0,Rajkot,Working Professional,Architect,3.0,2.0,7.77,3.0,3.0,7-8 hours,Healthy,BSc,Yes,5.0,1.0,Yes
100594,Female,37.0,Lucknow,Working Professional,Travel Consultant,3.0,2.0,7.77,3.0,4.0,7-8 hours,Moderate,BHM,Yes,10.0,2.0,Yes
19019,Male,56.0,Kolkata,Working Professional,Customer Support,3.0,4.0,7.77,3.0,1.0,More than 8 hours,Unhealthy,BA,No,7.0,1.0,Yes
21327,Female,35.0,Agra,Working Professional,Teacher,3.0,1.0,7.77,3.0,5.0,Less than 5 hours,Moderate,MBA,Yes,0.0,3.0,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51076,Female,45.0,Bhopal,Working Professional,Doctor,3.0,2.0,7.77,3.0,4.0,7-8 hours,Moderate,MD,No,6.0,1.0,Yes
129158,Female,50.0,Agra,Working Professional,Researcher,3.0,4.0,7.77,3.0,4.0,7-8 hours,Healthy,PhD,No,7.0,4.0,No
8044,Male,31.0,Srinagar,Student,Teacher,5.0,3.0,7.74,5.0,3.0,5-6 hours,Unhealthy,BBA,Yes,10.0,3.0,No
25106,Male,56.0,Meerut,Working Professional,Content Writer,3.0,2.0,7.77,3.0,2.0,More than 8 hours,Moderate,M.Ed,No,1.0,5.0,Yes


In [20]:
X_train[object_columns_train] = X_train[object_columns_train].astype("category")
X_test[object_columns_test] = X_test[object_columns_test].astype("category")

### Defining the objective function for hyperparameter optimization
- `objective`: Function to optimize the hyperparameters of the XGBoost classifier using Optuna.
- `study`: Creates an Optuna study object and optimizes the objective function.

In [21]:
def objective(trial):
    param = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),
        'max_depth': trial.suggest_int('max_depth', 2, 10),
        'learning_rate': trial.suggest_loguniform('learning_rate', 1e-4, 0.1),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.9),
        'gamma': trial.suggest_float('gamma', 0, 5),
        'reg_alpha': trial.suggest_loguniform('reg_alpha', 1e-4, 1.0),
        'reg_lambda': trial.suggest_loguniform('reg_lambda', 1e-4, 1.0),
        'use_label_encoder': True,
        'eval_metric': 'logloss',
        'objective': 'binary:logistic',
        'enable_categorical':True
    
    }
    bst = XGBClassifier(**param) 
    bst.fit(X_train , y_train)
    y_pred = bst.predict(X_test)
    accruacy = accuracy_score(y_test , y_pred)
    
    
    return accruacy

# 3. Create a study object and optimize the objective function.
study = optuna.create_study(direction="maximize")
study.optimize(objective , n_trials=400)

[I 2024-11-16 22:12:01,784] A new study created in memory with name: no-name-250e8afb-017d-4a8a-bf78-5962f808d64a
[I 2024-11-16 22:12:02,473] Trial 0 finished with value: 0.8164889836531628 and parameters: {'n_estimators': 144, 'max_depth': 3, 'learning_rate': 0.0008019156661202953, 'subsample': 0.5504766092464237, 'colsample_bytree': 0.6213871032589693, 'gamma': 4.834972328386871, 'reg_alpha': 0.00314675570418986, 'reg_lambda': 0.0006120967619640058}. Best is trial 0 with value: 0.8164889836531628.
[I 2024-11-16 22:12:05,595] Trial 1 finished with value: 0.8164889836531628 and parameters: {'n_estimators': 474, 'max_depth': 6, 'learning_rate': 0.00034447545990187327, 'subsample': 0.9065718258524094, 'colsample_bytree': 0.6047957357420797, 'gamma': 0.020225676448903385, 'reg_alpha': 0.012627339558967516, 'reg_lambda': 0.09028147153227106}. Best is trial 0 with value: 0.8164889836531628.
[I 2024-11-16 22:12:06,138] Trial 2 finished with value: 0.8164889836531628 and parameters: {'n_estim

In [22]:
best_params = study.best_params
best_params

{'n_estimators': 478,
 'max_depth': 2,
 'learning_rate': 0.051029237233931025,
 'subsample': 0.8232781397271202,
 'colsample_bytree': 0.8502698875429519,
 'gamma': 1.9953821027696945,
 'reg_alpha': 0.00024838850253540304,
 'reg_lambda': 0.00010285754819084692}

In [23]:
object_columns_train

['Gender',
 'City',
 'Working Professional or Student',
 'Profession',
 'Sleep Duration',
 'Dietary Habits',
 'Degree',
 'Have you ever had suicidal thoughts ?',
 'Family History of Mental Illness']

In [24]:
X[object_columns_train] = X[object_columns_train].astype("category")
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140700 entries, 0 to 140699
Data columns (total 17 columns):
 #   Column                                 Non-Null Count   Dtype   
---  ------                                 --------------   -----   
 0   Gender                                 140700 non-null  category
 1   Age                                    140700 non-null  float64 
 2   City                                   140700 non-null  category
 3   Working Professional or Student        140700 non-null  category
 4   Profession                             140700 non-null  category
 5   Academic Pressure                      140700 non-null  float64 
 6   Work Pressure                          140700 non-null  float64 
 7   CGPA                                   140700 non-null  float64 
 8   Study Satisfaction                     140700 non-null  float64 
 9   Job Satisfaction                       140700 non-null  float64 
 10  Sleep Duration                         14070

### Training the XGBoost model with the best hyperparameters
- `xgb_model`: Trains the XGBoost classifier using the best hyperparameters.

In [25]:
xgb_model = XGBClassifier(**best_params, enable_categorical=True) 
xgb_model.fit(X, y)

In [26]:
object_columns_train

['Gender',
 'City',
 'Working Professional or Student',
 'Profession',
 'Sleep Duration',
 'Dietary Habits',
 'Degree',
 'Have you ever had suicidal thoughts ?',
 'Family History of Mental Illness']

In [27]:
test[object_columns_train] = test[object_columns_train].astype("category")
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93800 entries, 0 to 93799
Data columns (total 17 columns):
 #   Column                                 Non-Null Count  Dtype   
---  ------                                 --------------  -----   
 0   Gender                                 93800 non-null  category
 1   Age                                    93800 non-null  float64 
 2   City                                   93800 non-null  category
 3   Working Professional or Student        93800 non-null  category
 4   Profession                             69168 non-null  category
 5   Academic Pressure                      18767 non-null  float64 
 6   Work Pressure                          75022 non-null  float64 
 7   CGPA                                   18766 non-null  float64 
 8   Study Satisfaction                     18767 non-null  float64 
 9   Job Satisfaction                       75026 non-null  float64 
 10  Sleep Duration                         93800 non-null  cat

### Predicting labels for the testing dataset
- `y_test_pred`: Predicted labels for the testing dataset.

In [28]:
y_test_pred = xgb_model.predict(test)

### Adding predictions to the testing dataset and saving the results
- Adds the predicted labels to the testing dataset.
- Saves the results to a CSV file named 'submission.csv'.

In [29]:
test['predicted'] = y_test_pred
sub["id"]=id_col_test
sub["Depression"]=test['predicted']
sub.to_csv('submission.csv', index=False)

In [30]:
sub['Depression'].value_counts()

Depression
0    73633
1    20167
Name: count, dtype: int64