PROJECT TITLE: DEPRESSION RISK CLASSIFICATION USING MACHINE LEARNING.

GOAL: To predict whether an individual is at a high risk of depression based on survey responses.

DATASET: PHQ-9 Dataset from Mendeley

In [45]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

#loading the dataset
df =pd.read_csv(r"C:\Users\ADMIN\Downloads\PHQ-9_Dataset_5th Edition.csv")
df.head()


Unnamed: 0,Age,Gender,Little interest or pleasure in doing things,"Feeling down, depressed, or hopeless","Trouble falling or staying asleep, or sleeping too much",Feeling tired or having little energy,Poor appetite or overeating,Feeling bad about yourself—or that you are a failure or have let yourself or your family down,"Trouble concentrating on things, such as reading the newspaper or watching television",Moving or speaking so slowly that other people could have noticed? Or the opposite—being so fidgety or restless that you have been moving around a lot more than usual,Thoughts that you would be better off dead or of hurting yourself in some way,PHQ_Total,PHQ_Severity,Sleep Quality,Study Pressure,Financial Pressure
0,22,Male,More than half the days,Not at all,Not at all,Not at all,Not at all,Not at all,Not at all,More than half the days,Not at all,4,Minimal,Good,Good,Average
1,25,Male,Not at all,Not at all,Nearly every day,Nearly every day,Nearly every day,Not at all,More than half the days,More than half the days,More than half the days,15,Moderately severe,Worst,Bad,Average
2,22,Female,Not at all,Not at all,Not at all,Not at all,Not at all,Not at all,Several days,Not at all,Not at all,1,Minimal,Average,Bad,Average
3,18,Female,Nearly every day,Nearly every day,Not at all,Nearly every day,More than half the days,Not at all,Not at all,Not at all,Not at all,11,Moderate,Average,Bad,Worst
4,24,Male,Not at all,Not at all,Not at all,Not at all,Not at all,Not at all,Not at all,More than half the days,Not at all,2,Minimal,Good,Average,Good


In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 682 entries, 0 to 681
Data columns (total 16 columns):
 #   Column                                                                                                                                                                    Non-Null Count  Dtype 
---  ------                                                                                                                                                                    --------------  ----- 
 0   Age                                                                                                                                                                       682 non-null    int64 
 1   Gender                                                                                                                                                                    682 non-null    object
 2   Little interest or pleasure in doing things                                                       

From the dataset above, we can see that it contains 682 individuals assessed using the PHQ-9 depression screening questionnaire along with contextual factors such as age, gender, sleep quality, study pressure, financial pressure. The following columns are the heart of our model: 
Little interest or pleasure in doing things

Feeling down, depressed, or hopeless

Trouble sleeping

Low energy

Appetite changes

Negative self-perception

Concentration difficulties

Psychomotor agitation/retardation

Suicidal ideation

Also, each response is categorical, i.e, not at all, several days, more than half days and nearly every day. These are ordinal variables, not nominal.

We now want to do # Data Preparation and Encoding.

We will start by defining the prediction task, which is to predict depression risk (Binary target variable) and we will create a new column called HighRisk_Depression.

0-PHQ_Total < 10 (Minimal/mild)

1-PHQ_Total >=10 (Moderate/moderately severe/severe)

In [46]:
df['HighRisk_Depression']= (df['PHQ_Total'] >=10).astype(int)

df['HighRisk_Depression'].value_counts(normalize=True)

HighRisk_Depression
0    0.529326
1    0.470674
Name: proportion, dtype: float64

We now want to drop the Leakage columns, which are PHQ_Total and PHQ_Severity

In [47]:
df= df.drop(columns=['PHQ_Total','PHQ_Severity'])

We will now encode PHQ-9 Questionnaire Responses because they are ordinal and not categorical. We will map them in the following manner:

# Response                Value

Not at all                0

Several days              1

More than half the days   2

Nearly every day          3

In [75]:
#We will start by cleaning the column names
#df.columns = df.columns.str.strip()
# Clean column names and the data itself
'''df.columns = df.columns.str.strip()
for col in phq_columns:
    df[col] = df[col].astype(str).str.strip() # Remove hidden spaces from survey answers

# Re-apply your mapping
df[col] = df[col].map(phq_mapping)
print(df.columns.tolist())'''
# 1. Strip spaces from EVERY column name in the entire dataframe
df.columns = df.columns.str.strip()

# 2. Re-define your columns list WITHOUT any leading/trailing spaces
phq_columns = [
    'Little interest or pleasure in doing things',
    'Feeling down, depressed, or hopeless',
    'Trouble falling or staying asleep, or sleeping too much',
    'Feeling tired or having little energy',
    'Poor appetite or overeating',
    'Feeling bad about yourself—or that you are a failure or have let yourself or your family down',
    'Trouble concentrating on things, such as reading the newspaper or watching television',
    'Moving or speaking so slowly that other people could have noticed? Or the opposite—being so fidgety or restless that you have been moving around a lot more than usual',
    'Thoughts that you would be better off dead or of hurting yourself in some way'
]

# 3. Clean the text INSIDE the cells before mapping
for col in phq_columns:
    if col in df.columns:
        # Convert to string and strip spaces to ensure 'Not at all' matches exactly
        df[col] = df[col].astype(str).str.strip()
        df[col] = df[col].map(phq_mapping)

# 4. Critical Check: If this prints > 0, the warning will disappear
print("Total non-null values after mapping:", df[phq_columns].notnull().sum().sum())

Total non-null values after mapping: 0


In [76]:
phq_mapping = {
    'Not at all': 0,
    'Several days': 1,
    'More than half the days': 2,
    'Nearly every day': 3
}

phq_columns = [
    'Little interest or pleasure in doing things',
    'Feeling down, depressed, or hopeless',
    'Trouble falling or staying asleep, or sleeping too much',
    'Feeling tired or having little energy',
    'Poor appetite or overeating',
    'Feeling bad about yourself—or that you are a failure or have let yourself or your family down',
    'Trouble concentrating on things, such as reading the newspaper or watching television',
    'Moving or speaking so slowly that other people could have noticed? Or the opposite—being so fidgety or restless that you have been moving around a lot more than usual',
    'Thoughts that you would be better off dead or of hurting yourself in some way'
]

for col in phq_columns:
    df[col] = df[col].map(phq_mapping)


We are now encoding other categorical variables:

In [64]:
df['Gender'] = df['Gender'].map({'Male': 0,'Female': 1})

We now want to do ordinal mappings

In [66]:
ordinal_mappings = {
    'Sleep Quality': {'Worst': 0, 'Average': 1, 'Good': 2},
    'Study Pressure': {'Bad': 0, 'Average': 1, 'Good': 2},
    'Financial Pressure': {'Bad': 0, 'Average': 1, 'Good': 2}
}
# Applying them
for col, mapping in ordinal_mappings.items():
    df[col] = df[col].map(mapping)


We now want to do Feature-Target Split

In [67]:
X = df.drop(columns=['HighRisk_Depression'])

y = df['HighRisk_Depression']

print(X.shape,y.shape)

(682, 14) (682,)


We will now perform Stratified Train-Test Split. The reason why we are using stratified sampling is to preserve class distribution

In [68]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(
    X,y,test_size =0.25, stratify= y,random_state=42
)
print(y_train.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))

HighRisk_Depression
0    0.528376
1    0.471624
Name: proportion, dtype: float64
HighRisk_Depression
0    0.532164
1    0.467836
Name: proportion, dtype: float64


From the results above, we can see that the dataset ehibits a slight class imbalance with approximately 53% low risk and 47% high risk cases. Our baseline accuracy is 53% and any model we will build must outperform this to be meaningful. The reason why this baseline is critical is that it can be used for model comparison and demonstrating improvement

The ordinal variables were encoded using domain-informed mappings and nominal variables were one-hot encoded.

In [69]:
#One-Hot Encoding nominal columns
nominal_cols = ['Gender']

X_train = pd.get_dummies(X_train, columns=nominal_cols, drop_first=True)
X_test = pd.get_dummies(X_test, columns=nominal_cols, drop_first=True)

# Align columns
X_train, X_test = X_train.align(X_test, join='left', axis=1, fill_value=0)

We will do Feature Matrix to exclude labels and text columns

In [70]:
#Feature Matrix
X = df.drop(columns=[
    'HighRisk_Depression'
])


In [62]:
# Check how many non-null values are in these columns
print(df[phq_columns].info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 682 entries, 0 to 681
Data columns (total 9 columns):
 #   Column                                                                                                                                                                  Non-Null Count  Dtype  
---  ------                                                                                                                                                                  --------------  -----  
 0   Little interest or pleasure in doing things                                                                                                                             682 non-null    int64  
 1   Feeling down, depressed, or hopeless                                                                                                                                    682 non-null    int64  
 2   Trouble falling or staying asleep, or sleeping too much                                                

We will now handle the missing values(NaNs) in our training data and we will use a SimpleImputer

In [77]:
from sklearn.impute import SimpleImputer

# 1. Initialize the Imputer (using median is best for ordinal health data)
imputer = SimpleImputer(strategy='median')

# 2. Fit and transform the training data, and transform the test data
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

print("Missing values handled.")

Missing values handled.


 'Feeling down, depressed, or hopeless'
 'Trouble falling or staying asleep, or sleeping too much'
 'Feeling tired or having little energy' 'Poor appetite or overeating'
 'Feeling bad about yourself—or that you are a failure or have let yourself or your family down'
 'Trouble concentrating on things, such as reading the newspaper or watching television'
 'Moving or speaking so slowly that other people could have noticed? Or the opposite—being so fidgety or restless that you have been moving around a lot more than usual'
 'Thoughts that you would be better off dead or of hurting yourself in some way']. At least one non-missing value is needed for imputation with strategy='median'.
 'Feeling down, depressed, or hopeless'
 'Trouble falling or staying asleep, or sleeping too much'
 'Feeling tired or having little energy' 'Poor appetite or overeating'
 'Feeling bad about yourself—or that you are a failure or have let yourself or your family down'
 'Trouble concentrating on things, such as r

We now need to do Feature scalling because PHQ Scores, age, pressure variables are on different scales and also it prevents models from being biased towards large-value features.

In [78]:
#The next step is doing numeric feature scaling
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# we will use the imputed here
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_test_scaled = scaler.transform(X_test_imputed)



We now want to do Logistic Regression and it will be our Benchmark Model

In [73]:
from sklearn.linear_model import LogisticRegression

# Initializing with 'balanced' weights to handle the slight class imbalance
log_reg = LogisticRegression(class_weight='balanced', random_state=42)


log_reg.fit(X_train_scaled, y_train)
print("Logistic Regression Model Fitted Successfully!")

Logistic Regression Model Fitted Successfully!


We will now do model evaluation

In [74]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

y_pred = log_reg.predict(X_test_scaled)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:\n")
print(confusion_matrix(y_test, y_pred))


Accuracy: 0.7426900584795322

Classification Report:

              precision    recall  f1-score   support

           0       0.81      0.68      0.74        91
           1       0.69      0.81      0.75        80

    accuracy                           0.74       171
   macro avg       0.75      0.75      0.74       171
weighted avg       0.75      0.74      0.74       171

Confusion Matrix:

[[62 29]
 [15 65]]


In [79]:
import os
print(os.getcwd())

C:\Users\ADMIN
