PROJECT TITLE: DEPRESSION RISK CLASSIFICATION USING MACHINE LEARNING.

GOAL: To build a machine learning model that automatically classifies depression severity(Minimal, Mild, Moderate) based on patient responses to the PHQ-9 questionnaire.

DATASET: PHQ-9 Dataset from Mendeley

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

#loading the dataset
df =pd.read_csv(r"C:\Users\ADMIN\Downloads\PHQ-9_Dataset_5th Edition.csv")


# Clean Headers
df.columns = df.columns.str.strip()

# We group columns by their name and keep only the first instance.
df = df.groupby(level=0, axis=1).first()

# 4. Define Questions
phq_questions = [
    'Little interest or pleasure in doing things',
    'Feeling down, depressed, or hopeless',
    'Trouble falling or staying asleep, or sleeping too much',
    'Feeling tired or having little energy',
    'Poor appetite or overeating',
    'Feeling bad about yourself—or that you are a failure or have let yourself or your family down',
    'Trouble concentrating on things, such as reading the newspaper or watching television',
    'Moving or speaking so slowly that other people could have noticed? Or the opposite—being so fidgety or restless that you have been moving around a lot more than usual',
    'Thoughts that you would be better off dead or of hurting yourself in some way'
]

# 5. Define Mapping
phq_mapping = {
    'Not at all': 0,
    'Several days': 1,
    'More than half the days': 2,
    'Nearly every day': 3
}

# Apply Cleaning & Mapping
for col in phq_questions:
    # Check if column exists to avoid KeyError
    if col in df.columns:
        df[col] = df[col].astype(str).str.strip().map(phq_mapping)

print("Success! Missing values in questions:", df[phq_questions].isna().sum().sum())
print(df[phq_questions].head())


Success! Missing values in questions: 0
   Little interest or pleasure in doing things  \
0                                            2   
1                                            0   
2                                            0   
3                                            3   
4                                            0   

   Feeling down, depressed, or hopeless  \
0                                     0   
1                                     0   
2                                     0   
3                                     3   
4                                     0   

   Trouble falling or staying asleep, or sleeping too much  \
0                                                  0         
1                                                  3         
2                                                  0         
3                                                  0         
4                                                  0         

   Feeling tired or having 

  df = df.groupby(level=0, axis=1).first()


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 682 entries, 0 to 681
Data columns (total 16 columns):
 #   Column                                                                                                                                                                  Non-Null Count  Dtype 
---  ------                                                                                                                                                                  --------------  ----- 
 0   Age                                                                                                                                                                     682 non-null    int64 
 1   Feeling bad about yourself—or that you are a failure or have let yourself or your family down                                                                           682 non-null    int64 
 2   Feeling down, depressed, or hopeless                                                                      

From the dataset above, we can see that it contains 682 individuals assessed using the PHQ-9 depression screening questionnaire along with contextual factors such as age, gender, sleep quality, study pressure, financial pressure. 

Also, each response is categorical, i.e, not at all, several days, more than half days and nearly every day. These are ordinal variables, not nominal.


We will now Prepare the Target(PHQ_Severity)

Our target variable is categorical(e.g. severe, mild). We map these categories to an ordered scale of: 

 0: None-Minimal

 1: Mild

 2: Moderate

 3: Moderately Severe

 4: Severe

In [6]:
#checking unique features to ensure we map correctly
print("Unique Severity Labels:", df['PHQ_Severity'].unique())

#Defining Target Mapping(ordinal)
severity_mapping = {
    'Minimal': 0,
    'Mild': 1,
    'Moderate': 2,
    'Moderately severe': 3,
    'Severe': 4,
    'None-minimal': 0 # Handling potential variation in data
}

# 3. Apply Mapping
df['Severity_Encoded'] = df['PHQ_Severity'].str.strip().map(severity_mapping)

# 4. Drop rows where target might be missing
df = df.dropna(subset=['Severity_Encoded'])

print("Target distribution:")
print(df['Severity_Encoded'].value_counts().sort_index())

Unique Severity Labels: ['Minimal' 'Moderately severe' 'Moderate' 'Mild' 'Severe']
Target distribution:
Severity_Encoded
0    206
1    155
2    128
3    125
4     68
Name: count, dtype: int64



We now move to Feature Selection and Splitting using Train-Test Split using stratify to ensure that all severity levels are represented proportionally in both sets. 

We will select only the 9 PHQ question responses as our input features(X) and exclude demographic data(Age, Gender) and derived scores(PHQ_Total) to ensure the model focuses on the symptoms solely.

In [7]:
from sklearn.model_selection import train_test_split

#defining features (X) and Target (y)
X = df[phq_questions]           # Only the 9 questions
y = df['Severity_Encoded']      # The encoded severity level

#Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X,y, test_size = 0.20, stratify=y,  random_state= 42
)

print(f"Training Data Shape: {X_train.shape}")
print(f"Testing Data Shape: {X_test.shape}")

Training Data Shape: (545, 9)
Testing Data Shape: (137, 9)


We will train our model using Logistic Regression and use it as our Baseline Model. We will start by scaling our data which is crucial for logistic regression convergence, training and evaluation.

In [None]:
#  LOGISTIC REGRESSION BASELINE MODEL
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, classification_report

# 1. Create a Pipeline
# Scaling is crucial for Logistic Regression convergence
# We use a pipeline to ensure scaling happens automatically before training
log_reg_pipeline = make_pipeline(
    StandardScaler(),
    LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000, random_state=42)
)

# 2. Train the Baseline
log_reg_pipeline.fit(X_train, y_train)
print("Baseline Logistic Regression trained successfully.")

# 3. Evaluate on Test Data
y_pred_baseline = log_reg_pipeline.predict(X_test)

# 4. Print Results
baseline_acc = accuracy_score(y_test, y_pred_baseline)
print(f"Baseline Accuracy: {baseline_acc:.4f}")

print("\nClassification Report (Baseline):")
print(classification_report(y_test, y_pred_baseline))