# Behind the Grades: Mental Health Trends in Indian Students


### Motivation

Our project explores the Student Depression Dataset from Kaggle, which contains survey responses from approximately 28,000 university students across India. The dataset provides comprehensive insights into student mental health, including 18 different metrics for each student: depression status, academic performance (CGPA), lifestyle habits (sleep patterns, social media usage), basic demographics (age, gender), and important mental health indicators (presence of suicidal thoughts, family history of mental illness). This well-structured information allows us to analyze both the prevalence and potential causes of depression among Indian students.

We selected this dataset because student mental health represents a critical yet frequently overlooked issue. With more than 60% of students in the dataset reporting depression symptoms, we identified an important opportunity to illuminate the challenges that exist beneath the surface of academic achievement. Being students ourselves, we relate to these difficulties and wanted to develop visualizations that could help others identify warning signs and better understand the various factors that influence mental wellbeing in educational environments.

Our main objective was to create an accessible narrative that effectively communicates the complex nature of student mental health. Rather than presenting only statistics, we aimed to convey the real human experiences behind the numbers. Through clear visualizations and thoughtful analysis, we sought to demonstrate how factors such as academic pressure, sleep quality, and social media habits connect with depression. By designing our website as a progressive journey from basic statistics to more nuanced relationships, we intended to create a resource valuable to educators, parents, and students alike, potentially contributing to increased awareness and improved support systems.

### Dataset:
This comprehensive dataset was obtained from Kaggle's public repository and loaded using the kagglehub API for reproducible analysis. The data is structured in tabular format with each row representing an individual student's survey response.

Source: https://www.kaggle.com/datasets/adilshamim8/student-depression-dataset

In [1]:
import kagglehub
import pandas as pd
import os

path = kagglehub.dataset_download("adilshamim8/student-depression-dataset")

print("Path to dataset files:", path)


print("Files in dataset:", os.listdir(path))
csv_path = os.path.join(path, "student_depression_dataset.csv")
df = pd.read_csv(csv_path)

  from .autonotebook import tqdm as notebook_tqdm


Path to dataset files: C:\Users\UPASANA\.cache\kagglehub\datasets\adilshamim8\student-depression-dataset\versions\1
Files in dataset: ['student_depression_dataset.csv']


### Basic stats

In [2]:
print(df.info())
print(df.isna().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27901 entries, 0 to 27900
Data columns (total 18 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   id                                     27901 non-null  int64  
 1   Gender                                 27901 non-null  object 
 2   Age                                    27901 non-null  float64
 3   City                                   27901 non-null  object 
 4   Profession                             27901 non-null  object 
 5   Academic Pressure                      27901 non-null  float64
 6   Work Pressure                          27901 non-null  float64
 7   CGPA                                   27901 non-null  float64
 8   Study Satisfaction                     27901 non-null  float64
 9   Job Satisfaction                       27901 non-null  float64
 10  Sleep Duration                         27901 non-null  object 
 11  Di


The dataset contains **27,901 entries** and **18 columns**. Each row represents an individual student's response to a mental health survey, covering aspects such as academic stress, lifestyle habits, and mental health indicators. There are **no missing values** in any of the columns, which simplifies preprocessing.

**Numerical (`float64` and `int64`)**:
- `Age`, `Academic Pressure`, `Work Pressure`, `CGPA`, `Study Satisfaction`, `Job Satisfaction`, `Work/Study Hours`
- `Depression` (target: binary 0/1), `id`

**Categorical (`object`)**:
- `Gender`, `City`, `Profession`, `Sleep Duration`, `Dietary Habits`, `Degree`
- `Have you ever had suicidal thoughts ?`, `Financial Stress`, `Family History of Mental Illness`

This combination of quantitative and qualitative data supports both statistical analysis and rich visual exploration.

#### Data Cleaning and Preprocessing

While the dataset is complete, several preprocessing steps were needed to prepare it for analysis:

1. **Binary Conversion**  
   To enable numerical comparison, `Yes`/`No` responses were mapped to binary values:
   - `Have you ever had suicidal thoughts ?` → `Suicidal_Thoughts`
   - `Family History of Mental Illness`  
   ```python
   {"Yes": 1, "No": 0}
   ```

2. **Ordinal Encoding**
   - **Financial stress**
   ```python
   {"Low": 1, "Medium": 2, "High": 3}
   ```
   - **Sleep duration**
   ```python
   {
       "Less than 5 hours": 4,
       "5-6 hours": 5.5,
       "7-8 hours": 7.5,
       "More than 8 hours": 9
   }
   ```
   - **Financial stress**
   ```python
   {"Low": 1, "Medium": 2, "High": 3}
   ```

3. **Column Renaming**  
   Long column names like "Have you ever had suicidal thoughts ?" were renamed to Suicidal_Thoughts for simplicity in coding and plotting.

In [3]:
df_clean = df.copy()

# binary conversion: convert 'Yes'/'No' to 1/0
df_clean['Suicidal_Thoughts'] = df_clean['Have you ever had suicidal thoughts ?'].map({'Yes': 1, 'No': 0})
df_clean['Family_History'] = df_clean['Family History of Mental Illness'].map({'Yes': 1, 'No': 0})

# ordinal encoding: map 'Financial Stress' to numeric 
financial_stress_map = {'Low': 1, 'Medium': 2, 'High': 3}
df_clean['Financial_Stress_Score'] = df_clean['Financial Stress'].map(financial_stress_map)

# map 'Sleep Duration' to estimated hours
sleep_duration_map = {
    'Less than 5 hours': 4,
    '5-6 hours': 5.5,
    '7-8 hours': 7.5,
    'More than 8 hours': 9
}
df_clean['Sleep_Hours'] = df_clean['Sleep Duration'].map(sleep_duration_map)

# rename columns
df_clean.rename(columns={
    'Work/Study Hours': 'Work_Study_Hours',
    'Job Satisfaction': 'Job_Satisfaction',
    'Study Satisfaction': 'Study_Satisfaction',
    'Academic Pressure': 'Academic_Pressure',
    'Work Pressure': 'Work_Pressure',
    'Have you ever had suicidal thoughts ?': 'Suicidal_Thoughts',
    'Sleep Duration': 'Sleep_Duration',
    'Dietary Habits': 'Dietary_Habits',
    'Financial Stress': 'Financial_Stress',
    'Family History of Mental Illness': 'Family_History',
}, inplace=True)


print(df_clean.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27901 entries, 0 to 27900
Data columns (total 22 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      27901 non-null  int64  
 1   Gender                  27901 non-null  object 
 2   Age                     27901 non-null  float64
 3   City                    27901 non-null  object 
 4   Profession              27901 non-null  object 
 5   Academic_Pressure       27901 non-null  float64
 6   Work_Pressure           27901 non-null  float64
 7   CGPA                    27901 non-null  float64
 8   Study_Satisfaction      27901 non-null  float64
 9   Job_Satisfaction        27901 non-null  float64
 10  Sleep_Duration          27901 non-null  object 
 11  Dietary_Habits          27901 non-null  object 
 12  Degree                  27901 non-null  object 
 13  Suicidal_Thoughts       27901 non-null  object 
 14  Work_Study_Hours        27901 non-null

In [4]:
# one-hot encode 'Gender' and 'Profession'
df_encoded = pd.get_dummies(df_clean, columns=['Gender', 'Profession'], drop_first=True)
print(df_encoded.columns)
print(df_encoded.info())
print(df_encoded.head())
#df_encoded.to_csv("student_depression_cleaned.csv", index=False)

Index(['id', 'Age', 'City', 'Academic_Pressure', 'Work_Pressure', 'CGPA',
       'Study_Satisfaction', 'Job_Satisfaction', 'Sleep_Duration',
       'Dietary_Habits', 'Degree', 'Suicidal_Thoughts', 'Work_Study_Hours',
       'Financial_Stress', 'Family_History', 'Depression', 'Suicidal_Thoughts',
       'Family_History', 'Financial_Stress_Score', 'Sleep_Hours',
       'Gender_Male', 'Profession_'Content Writer'',
       'Profession_'Digital Marketer'', 'Profession_'Educational Consultant'',
       'Profession_'UX/UI Designer'', 'Profession_Architect',
       'Profession_Chef', 'Profession_Doctor', 'Profession_Entrepreneur',
       'Profession_Lawyer', 'Profession_Manager', 'Profession_Pharmacist',
       'Profession_Student', 'Profession_Teacher'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27901 entries, 0 to 27900
Data columns (total 34 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               ---

### EDA (Exploratory Data Analysis)

In [5]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.barplot(
    data=df_clean,
    x='Suicidal_Thoughts',
    y='Financial_Stress_Score',
    ci='sd'
)
plt.title("Financial Stress vs Suicidal Thoughts")
plt.xlabel("Suicidal Thoughts (0=No, 1=Yes)")
plt.ylabel("Average Financial Stress Score")
plt.show()



The `ci` parameter is deprecated. Use `errorbar='sd'` for the same effect.

  sns.barplot(


ValueError: 2