In [1]:
import os
print(os.getcwd())

C:\Users\ADMIN


PROJECT TITLE: DEPRESSION RISK CLASSIFICATION USING MACHINE LEARNING.

GOAL: To predict whether an individual is at a high risk of depression based on survey responses.

DATASET: PHQ-9 Dataset from Mendeley

In [16]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

#loading the dataset
df =pd.read_csv(r"C:\Users\ADMIN\Downloads\PHQ-9_Dataset_5th Edition.csv")
df.head()


Unnamed: 0,Age,Gender,Little interest or pleasure in doing things,"Feeling down, depressed, or hopeless","Trouble falling or staying asleep, or sleeping too much",Feeling tired or having little energy,Poor appetite or overeating,Feeling bad about yourself—or that you are a failure or have let yourself or your family down,"Trouble concentrating on things, such as reading the newspaper or watching television",Moving or speaking so slowly that other people could have noticed? Or the opposite—being so fidgety or restless that you have been moving around a lot more than usual,Thoughts that you would be better off dead or of hurting yourself in some way,PHQ_Total,PHQ_Severity,Sleep Quality,Study Pressure,Financial Pressure
0,22,Male,More than half the days,Not at all,Not at all,Not at all,Not at all,Not at all,Not at all,More than half the days,Not at all,4,Minimal,Good,Good,Average
1,25,Male,Not at all,Not at all,Nearly every day,Nearly every day,Nearly every day,Not at all,More than half the days,More than half the days,More than half the days,15,Moderately severe,Worst,Bad,Average
2,22,Female,Not at all,Not at all,Not at all,Not at all,Not at all,Not at all,Several days,Not at all,Not at all,1,Minimal,Average,Bad,Average
3,18,Female,Nearly every day,Nearly every day,Not at all,Nearly every day,More than half the days,Not at all,Not at all,Not at all,Not at all,11,Moderate,Average,Bad,Worst
4,24,Male,Not at all,Not at all,Not at all,Not at all,Not at all,Not at all,Not at all,More than half the days,Not at all,2,Minimal,Good,Average,Good


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 682 entries, 0 to 681
Data columns (total 16 columns):
 #   Column                                                                                                                                                                    Non-Null Count  Dtype 
---  ------                                                                                                                                                                    --------------  ----- 
 0   Age                                                                                                                                                                       682 non-null    int64 
 1   Gender                                                                                                                                                                    682 non-null    object
 2   Little interest or pleasure in doing things                                                       

From the dataset above, we can see that it contains 682 individuals assessed using the PHQ-9 depression screening questionnaire along with contextual factors such as age, gender, sleep quality, study pressure, financial pressure. The following columns are the heart of our model: 
Little interest or pleasure in doing things

Feeling down, depressed, or hopeless

Trouble sleeping

Low energy

Appetite changes

Negative self-perception

Concentration difficulties

Psychomotor agitation/retardation

Suicidal ideation

Also, each response is categorical, i.e, not at all, several days, more than half days and nearly every day. These are ordinal variables, not nominal.

We now want to do # Data Preparation and Encoding.

We will start by defining the prediction task, which is to predict depression risk (Binary target variable) and we will create a new column called HighRisk_Depression.

0-PHQ_Total < 10 (Minimal/mild)

1-PHQ_Total >=10 (Moderate/moderately severe/severe)

In [17]:
df['HighRisk_Depression']= (df['PHQ_Total'] >=10).astype(int)

df['HighRisk_Depression'].value_counts(normalize=True)

HighRisk_Depression
0    0.529326
1    0.470674
Name: proportion, dtype: float64

We now want to drop the Leakage columns, which are PHQ_Total and PHQ_Severity

In [10]:
df= df.drop(columns=['PHQ_Total','PHQ_Severity'])

We will now encode PHQ-9 Questionnaire Responses because they are ordinal and not categorical. We will map them in the following manner:

# Response                Value

Not at all                0

Several days              1

More than half the days   2

Nearly every day          3

In [21]:
#We will start by cleaning the column names
df.columns = df.columns.str.strip()

print(df.columns.tolist())

['Age', 'Gender', 'Little interest or pleasure in doing things', 'Feeling down, depressed, or hopeless', 'Trouble falling or staying asleep, or sleeping too much', 'Feeling tired or having little energy', 'Poor appetite or overeating', 'Feeling bad about yourself—or that you are a failure or have let yourself or your family down', 'Trouble concentrating on things, such as reading the newspaper or watching television', 'Moving or speaking so slowly that other people could have noticed? Or the opposite—being so fidgety or restless that you have been moving around a lot more than usual', 'Thoughts that you would be better off dead or of hurting yourself in some way', 'PHQ_Total', 'PHQ_Severity', 'Sleep Quality', 'Study Pressure', 'Financial Pressure', 'HighRisk_Depression']


In [22]:
phq_mapping = {
    'Not at all': 0,
    'Several days': 1,
    'More than half the days': 2,
    'Nearly every day': 3
}

phq_columns = [
    'Little interest or pleasure in doing things',
    'Feeling down, depressed, or hopeless',
    'Trouble falling or staying asleep, or sleeping too much',
    'Feeling tired or having little energy',
    'Poor appetite or overeating',
    'Feeling bad about yourself—or that you are a failure or have let yourself or your family down',
    'Trouble concentrating on things, such as reading the newspaper or watching television',
    'Moving or speaking so slowly that other people could have noticed? Or the opposite—being so fidgety or restless that you have been moving around a lot more than usual',
    'Thoughts that you would be better off dead or of hurting yourself in some way'
]

for col in phq_columns:
    df[col] = df[col].map(phq_mapping)


We are now encoding other categorical variables:

In [23]:
df['Gender'] = df['Gender'].map({'Male': 0,'Female': 1})

We now want to encode the ordinal lifestyle variables

In [25]:
pressure_mapping = {
    'Good': 0,
    'Average': 1,
    'Bad': 2,
    'Worst': 3
}

df['Sleep Quality'] = df['Sleep Quality'].map(pressure_mapping)
df['Study Pressure'] = df['Study Pressure'].map(pressure_mapping)
df['Financial Pressure'] = df['Financial Pressure'].map(pressure_mapping)

We now want to do Feature-Target Split

In [26]:
X = df.drop(columns=['HighRisk_Depression'])

y = df['HighRisk_Depression']

print(X.shape,y.shape)

(682, 16) (682,)


We will now perform Stratified Train-Test Split. The reason why we are using stratified sampling is to preserve class distribution

In [27]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(
    X,y,test_size =0.25, stratify= y,random_state=42
)
print(y_train.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))

HighRisk_Depression
0    0.528376
1    0.471624
Name: proportion, dtype: float64
HighRisk_Depression
0    0.532164
1    0.467836
Name: proportion, dtype: float64


From the results above, we can see that the dataset ehibits a slight class imbalance with approximately 53% low risk and 47% high risk cases. Our baseline accuracy is 53% and any model we will build must outperform this to be meaningful. The reason why this baseline is critical is that it can be used for model comparison and demonstrating improvement

We will do Feature Matrix to exclude labels and text columns

In [29]:
#Feature Matrix
X = df.drop(columns=[
    'PHQ_Severity',
    'HighRisk_Depression'
])

In [31]:
#The next step is doing feature scaling
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()