# Mental Health Prediction Using Machine Learning

The 2024 Kaggle Playground Series aims to provide engaging and practical datasets for machine learning enthusiasts to enhance their skills. This project focuses on Mental Health Prediction, using data from a mental health survey to analyze the factors that contribute to depression. The goal is to build a predictive model that determines whether an individual is experiencing depression based on various factors present in the dataset.

The dataset contains missing values, requiring preprocessing techniques such as data imputation and visualization for better insights. Various data analysis techniques, including count plots, pie charts, and heatmaps, will be used to understand the key contributing factors to depression.

## Project Objectives

The main objectives of this project are:

- Understand the Dataset – Perform an in-depth exploration of the provided training, testing, sample submission, and original data to gain insights into its structure and attributes.
- Handle Missing Values – Identify and impute missing values to ensure data quality and improve model performance.
- Data Visualization & Analysis – Generate multiple visualization plots (count plots, pie charts, heatmaps, etc.) to analyze key factors affecting mental health and depression.
- Model Implementation – Utilize the CatBoost model with optimized parameters to predict depression based on survey responses.
- Enhance Model Performance – Implement Repeated Stratified K-Fold Cross-Validation to refine predictions and improve the reliability of the model.
- Evaluate Results – Measure the model’s accuracy and effectiveness in classifying individuals with or without depression based on survey responses.

## Project Scope

<b>In-Scope:</b>

- Dataset Exploration – Understanding the data, missing values, and feature distributions.
- Data Preprocessing – Cleaning the dataset, handling missing values, and preparing it for modeling.
- Feature Engineering – Creating meaningful features from the dataset to enhance predictions.
- Data Visualization – Using plots and statistical analysis to explore depression risk factors.
- Model Selection & Implementation – Implementing different models for prediction.
- Performance Improvement – Using Repeated Stratified K-Fold for better accuracy. <b>MIGHT BE REVISED</B>
- Prediction & Insights – Determining whether a person is at risk of depression based on analyzed factors.

<b>Out-of-Scope:</b>

- Medical Diagnosis – The project does not provide a medical diagnosis but rather a statistical analysis and prediction.
- Real-time Monitoring – The model will not be deployed for real-time monitoring of mental health conditions.
- Therapeutic Interventions – The project does not propose medical or psychological treatment solutions.

## Data Source

The dataset used in this project originates from the 2024 Kaggle Playground Series (Season 4, Episode 11) competition, titled "<a href="https://www.kaggle.com/competitions/playground-series-s4e11/overview">Exploring Mental Health Data</a>". The data was derived from the <a href="https://www.kaggle.com/datasets/sumansharmadataworld/depression-surveydataset-for-analysis">Depression Survey/Dataset</a> and has been augmented with synthetic data to increase its size.

The dataset consists of 234,500 observations, with a 6:4 train-test split. It contains 20 features, each representing different attributes related to an individual's mental health and well-being. The target variable, "Depression," is a binary flag (0 or 1) indicating whether an individual is experiencing depression.

<b>Understanding the Features</b>

<table border="1">
  <tr>
    <th><b>Column Name</b></th>
    <th><b>Description</b></th>
  </tr>
  <tr>
    <td>ID</td>
    <td>Unique identifier for each participant in the dataset</td>
  </tr>
  <tr>
    <td>Name</td>
    <td>Name of the participant</td>
  </tr>
  <tr>
    <td>Gender</td>
    <td>Gender of participant (listed as Male or Female)</td>
  </tr>
  <tr>
    <td>Age</td>
    <td>Age of the participant</td>
  </tr>
  <tr>
    <td>City</td>
    <td>The city that the participant resides</td>
  </tr>
  <tr>
    <td>Working Professional or Student</td>
    <td>Indicates whether the participant is a working professional or a student</td>
  </tr>
  <tr>
    <td>Profession</td>
    <td>Participant's profession or field of study</td>
  </tr>
  <tr>
    <td>Academic Pressure</td>
    <td>Level of pressure the participant's experiences in academics (on a scale of 1-5)</td>
  </tr>
  <tr>
    <td>Work Pressure</td>
    <td>Level of pressure the participant's experiences at their job (on a scale of 1-5)</td>
  </tr>
  <tr>
    <td>CGPA</td>
    <td>Cumulative Grade Point Average of the participant</td>
  </tr>
  <tr>
    <td>Study Satisfaction</td>
    <td>The participant's satisfaction with their studies (on a scale of 1-5)</td>
  </tr>
    </tr>
    <tr>
    <td>Job Satisfaction</td>
    <td>The participant's satisfaction with their jobs (on a scale of 1-5)</td>
  </tr>
  <tr>
    <td>Sleep Duration</td>
    <td>Average duration of sleep per night</td>
  </tr>
  <tr>
    <td>Dietary Habits</td>
    <td>Dietary habits of the participant (listed mainly as healthy, moderate and unhealthy)</td>
  </tr>
  <tr>
    <td>Degree</td>
    <td>Level of education the participant is pursuing or has completed</td>
  </tr>
  <tr>
    <td>Have you ever had suicidal thoughts?</td>
    <td>Indicates whether the participant has ever had suicidal thoughts (listed as yes or no)</td>
  </tr>
  <tr>
    <td>Work/Study Hours</td>
    <td>Number of hours the participant spends working or studying per day on average</td>
  </tr>
  <tr>
    <td>Financial Stress</td>
    <td>Level of financial stress the participant experiences (on a scale of 1-5)</td>
  </tr>
  <tr>
    <td>Family History of Mental Illness</td>
    <td>Indicates whether the participant has a family history of mental illness (listed as yes or no)</td>
  </tr>
  <tr>
    <td>Depression</td>
    <td>The participant's depression status (listed as 0 or 1)</td>
  </tr>
</table>


## Step 1. Environment Set-Up and Data Import

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
# data import
train = pd.read_csv("/workspaces/myfolder/MentalHealth_Workbench/data/train.csv")
test = pd.read_csv("/workspaces/myfolder/MentalHealth_Workbench/data/train.csv")

## Step 2. Exploratory Data Analysis

Exploratory data analysis (EDA) is crucial in data science projects because it helps us understand the structure and characteristics of the data we're working with. By exploring variables, identifying patterns, detecting anomalies, and visualizing relationships, EDA enables us to make informed decisions about data preprocessing, feature engineering, and model selection. It also plays a key role in uncovering insights and formulating hypotheses, laying the groundwork for more accurate modeling and impactful conclusions.

## Step 3. Data Transformation/Wrangling

Data wrangling is essential in the model creation cycle as it ensures data quality, prepares data for modeling techniques, uncovers insights, and supports reproducibility. It forms the foundation upon which accurate, reliable, and actionable models can be derived from data in the field of data science. These steps are done based on what our exploratory data analysis (EDA) uncovered. In this case, we will be imputing our missing variables, encoding our categorical variables, and splitting our dataset for testing and training.

## Step 4: Modelling

Modeling in the data science process involves the application of machine learning algorithms to analyze data, make predictions, or uncover patterns. It is a pivotal phase where the insights gleaned from data are translated into actionable decisions and solutions.

Machine learning models are employed to address various tasks, such as classification, regression, clustering, and recommendation systems, depending on the nature of the problem at hand. These models learn from historical data to generalize patterns and make predictions on new, unseen data.

## Step 5: Model Tuning

## Step 6: Model Evaluation and Selection

Model evaluation and comparison are indispensable in the data science process as they validate the effectiveness and reliability of predictive models. By systematically evaluating models against relevant metrics such as accuracy, precision, recall, and F1-score, data scientists can assess which models perform best for specific tasks and datasets. This process not only ensures the chosen model meets desired performance criteria but also identifies potential weaknesses or biases that could impact its real-world application. Moreover, comparing different models allows data scientists to make informed decisions, selecting the most suitable model that balances accuracy, interpretability, and computational efficiency.

## Step 7: Model Registration and Deployment

Models built on Workbench, whether they are scikit-learn models or SAS Viya ML models can be registered into the model repository on Viya (SAS Model Manager). This is a crucial step in ensuring that models can be goverened properly as corporate assets before being pushed into production.

Let's look at examples of how to register a SAS Viya ML model.

## Future Work