# Data Preprocessing

This notebook cleans and standardizes the **MHP dataset** for modelling.  

Steps performed:
1. Rename demographic columns and normalize their values
2. Rename and recode **PSS-10 (Stress)** items
3. Rename and recode **GAD-7 (Anxiety)** items
4. Rename and recode **PHQ-9 (Depression)** items
5. Create `Depression Value` (PHQ-9 sum) and categorical `Depression Label`
6. Check for missing & duplicate values

In [1]:
import pandas as pd
from pathlib import Path

RAW_PATH = Path("../data/raw/mhp_dataset.csv")
PROCESSED_PATH = Path("../data/processed/mhp_processed.csv")

df = pd.read_csv(RAW_PATH)
print("Shape before processing:", df.shape)
df.head()

Shape before processing: (2028, 33)


Unnamed: 0,1. Age,2. Gender,3. University,4. Department,5. Academic Year,6. Current CGPA,7. Did you receive a waiver or scholarship at your university?,"1. In a semester, how often have you felt upset due to something that happened in your academic affairs?","2. In a semester, how often you felt as if you were unable to control important things in your academic affairs?","3. In a semester, how often you felt nervous and stressed because of academic pressure?",...,"7. In a semester, how often have you felt afraid, as if something awful might happen?","1. In a semester, how often have you had little interest or pleasure in doing things?","2. In a semester, how often have you been feeling down, depressed or hopeless?","3. In a semester, how often have you had trouble falling or staying asleep, or sleeping too much?","4. In a semester, how often have you been feeling tired or having little energy?","5. In a semester, how often have you had poor appetite or overeating?","6. In a semester, how often have you been feeling bad about yourself - or that you are a failure or have let yourself or your family down?","7. In a semester, how often have you been having trouble concentrating on things, such as reading the books or watching television?","8. In a semester, how often have you moved or spoke too slowly for other people to notice? Or you've been moving a lot more than usual because you've been restless?","9. In a semester, how often have you had thoughts that you would be better off dead, or of hurting yourself?"
0,18-22,Female,"Independent University, Bangladesh (IUB)",Engineering - CS / CSE / CSC / Similar to CS,Second Year or Equivalent,2.50 - 2.99,No,3 - Fairly Often,4 - Very Often,3 - Fairly Often,...,2 - More than half the days,2 - More than half the days,2 - More than half the days,3 - Nearly every day,2 - More than half the days,2 - More than half the days,2 - More than half the days,2 - More than half the days,3 - Nearly every day,2 - More than half the days
1,18-22,Male,"Independent University, Bangladesh (IUB)",Engineering - CS / CSE / CSC / Similar to CS,Third Year or Equivalent,3.00 - 3.39,No,3 - Fairly Often,3 - Fairly Often,4 - Very Often,...,2 - More than half the days,3 - Nearly every day,2 - More than half the days,2 - More than half the days,2 - More than half the days,2 - More than half the days,2 - More than half the days,2 - More than half the days,2 - More than half the days,2 - More than half the days
2,18-22,Male,American International University Bangladesh (...,Engineering - CS / CSE / CSC / Similar to CS,Third Year or Equivalent,3.00 - 3.39,No,0 - Never,0 - Never,0 - Never,...,0 - Not at all,0 - Not at all,0 - Not at all,0 - Not at all,0 - Not at all,0 - Not at all,0 - Not at all,0 - Not at all,0 - Not at all,0 - Not at all
3,18-22,Male,American International University Bangladesh (...,Engineering - CS / CSE / CSC / Similar to CS,Third Year or Equivalent,3.00 - 3.39,No,3 - Fairly Often,1 - Almost Never,2 - Sometimes,...,2 - More than half the days,2 - More than half the days,1 - Several days,2 - More than half the days,1 - Several days,2 - More than half the days,1 - Several days,2 - More than half the days,2 - More than half the days,1 - Several days
4,18-22,Male,North South University (NSU),Engineering - CS / CSE / CSC / Similar to CS,Second Year or Equivalent,2.50 - 2.99,No,4 - Very Often,4 - Very Often,4 - Very Often,...,3 - Nearly every day,1 - Several days,3 - Nearly every day,3 - Nearly every day,3 - Nearly every day,1 - Several days,3 - Nearly every day,0 - Not at all,3 - Nearly every day,3 - Nearly every day


## Demographic columns cleanup

In [2]:
new_demo_names = [
    "Age", "Gender", "University", "Department",
    "Year", "CGPA", "Scholarship"
]
df.rename(columns=dict(zip(df.columns[:7], new_demo_names)), inplace=True)

df["Gender"] = df["Gender"].replace({
    "Prefer not to say": "Other",
    "prefer not to say": "Other"
}).str.title()

import re
def extract_initials(text):
    if isinstance(text, str):
        m = re.search(r"\(([^)]+)\)", text)
        if m:
            return m.group(1).strip()
        else:
            return text.strip().split()[0]
    return text

df["University"] = df["University"].apply(extract_initials)

df["Department"] = df["Department"].astype(str).str.split().str[0]

df["Year"] = df["Year"].astype(str).str.split().str[0]

df["Scholarship"] = df["Scholarship"].replace({
    "Yes, full waiver": "Yes",
    "Yes, partial waiver": "Yes",
    "No waiver": "No"
}).fillna("No")

df[new_demo_names].head()

Unnamed: 0,Age,Gender,University,Department,Year,CGPA,Scholarship
0,18-22,Female,IUB,Engineering,Second,2.50 - 2.99,No
1,18-22,Male,IUB,Engineering,Third,3.00 - 3.39,No
2,18-22,Male,AIUB,Engineering,Third,3.00 - 3.39,No
3,18-22,Male,AIUB,Engineering,Third,3.00 - 3.39,No
4,18-22,Male,NSU,Engineering,Second,2.50 - 2.99,No


## PSS-10 (Stress) columns

In [3]:
pss_cols = df.columns[7:17]
df.rename(columns=dict(zip(pss_cols, [f"PSS{i+1}" for i in range(10)])), inplace=True)

pss_map = {
    "0 - Never": 0,
    "1 - Almost Never": 1,
    "2 - Sometimes": 2,
    "3 - Fairly Often": 3,
    "4 - Very Often": 4
}

for c in [f"PSS{i+1}" for i in range(10)]:
    df[c] = df[c].replace(pss_map)
    df[c] = pd.to_numeric(df[c], errors="coerce")

df[[f"PSS{i+1}" for i in range(10)]].head()

  df[c] = df[c].replace(pss_map)


Unnamed: 0,PSS1,PSS2,PSS3,PSS4,PSS5,PSS6,PSS7,PSS8,PSS9,PSS10
0,3,4,3,2,2,1,2,2,4,4
1,3,3,4,2,3,2,2,2,2,3
2,0,0,0,0,0,1,0,0,0,0
3,3,1,2,1,4,3,2,2,3,2
4,4,4,4,2,2,2,0,2,4,4


## GAD-7 (Anxiety) columns

In [4]:
gad_cols = df.columns[17:24]
df.rename(columns=dict(zip(gad_cols, [f"GAD{i+1}" for i in range(7)])), inplace=True)

gad_map = {
    "0 - Not at all": 0,
    "1 - Several days (less than 15 days)": 1,
    "1 - Several days": 1,
    "2 - More than half the semester": 2,
    "2 - More than half the days": 2,
    "3 - Nearly every day": 3
}

for c in [f"GAD{i+1}" for i in range(7)]:
    df[c] = df[c].replace(gad_map)
    df[c] = pd.to_numeric(df[c], errors="coerce")

df[[f"GAD{i+1}" for i in range(7)]].head()

  df[c] = df[c].replace(gad_map)


Unnamed: 0,GAD1,GAD2,GAD3,GAD4,GAD5,GAD6,GAD7
0,2,2,3,2,2,2,2
1,1,2,2,1,1,3,2
2,0,0,0,0,0,0,0
3,2,1,1,1,2,1,2
4,3,0,3,3,1,1,3


## PHQ-9 (Depression) columns

In [5]:
phq_cols = df.columns[24:33]
df.rename(columns=dict(zip(phq_cols, [f"PHQ{i+1}" for i in range(9)])), inplace=True)

phq_map = {
    "0 - Not at all": 0,
    "1 - Several days": 1,
    "2 - More than half the days": 2,
    "3 - Nearly every day": 3
}

for c in [f"PHQ{i+1}" for i in range(9)]:
    df[c] = df[c].replace(phq_map)
    df[c] = pd.to_numeric(df[c], errors="coerce")

df["Depression Value"] = df[[f"PHQ{i+1}" for i in range(9)]].sum(axis=1)

def phq_label(val):
    if pd.isna(val):
        return None
    if val <= 4:
        return "Minimal"
    elif val <= 9:
        return "Mild"
    elif val <= 14:
        return "Moderate"
    elif val <= 19:
        return "Moderately Severe"
    else:
        return "Severe"

df["Depression Label"] = df["Depression Value"].apply(phq_label)

display_cols = [f"PHQ{i+1}" for i in range(9)] + ["Depression Value", "Depression Label"]
df[display_cols].head(10)

  df[c] = df[c].replace(phq_map)


Unnamed: 0,PHQ1,PHQ2,PHQ3,PHQ4,PHQ5,PHQ6,PHQ7,PHQ8,PHQ9,Depression Value,Depression Label
0,2,2,3,2,2,2,2,3,2,20,Severe
1,3,2,2,2,2,2,2,2,2,19,Moderately Severe
2,0,0,0,0,0,0,0,0,0,0,Minimal
3,2,1,2,1,2,1,2,2,1,14,Moderate
4,1,3,3,3,1,3,0,3,3,20,Severe
5,1,1,1,2,2,1,2,0,0,10,Moderate
6,1,0,0,1,0,0,0,1,0,3,Minimal
7,0,0,0,2,0,0,1,0,0,3,Minimal
8,3,3,2,1,2,3,2,1,2,19,Moderately Severe
9,1,0,0,1,0,0,0,0,0,2,Minimal


## Data Quality Checks — Missing & Duplicate Values

In [6]:
print("Dataset info before cleaning:\n")
print(df.info())

missing_counts = df.isna().sum()
missing_total = missing_counts.sum()

print("\nMissing values summary:")
print(missing_counts[missing_counts > 0].sort_values(ascending=False))

if missing_total > 0:
    print(f"\n⚠️ Found {missing_total} missing values. Handling them now...")

    num_cols = df.select_dtypes(include=["int64", "float64"]).columns
    cat_cols = df.select_dtypes(include=["object"]).columns

    df[num_cols] = df[num_cols].apply(lambda col: col.fillna(col.median()))
    df[cat_cols] = df[cat_cols].apply(lambda col: col.fillna(col.mode()[0] if not col.mode().empty else "Unknown"))

    print("✅ Missing values handled (numeric → median, categorical → mode).")
else:
    print("\n✅ No missing values found.")

dup_count = df.duplicated().sum()
print(f"\nDuplicate rows found: {dup_count}")

if dup_count > 0:
    df = df.drop_duplicates().reset_index(drop=True)
    print(f"✅ Removed {dup_count} duplicate rows.")
else:
    print("✅ No duplicate rows found.")

print("\nAfter cleaning:")
print(f"Shape: {df.shape}")
print("\nDepression label distribution:")
print(df["Depression Label"].value_counts())

Dataset info before cleaning:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2028 entries, 0 to 2027
Data columns (total 35 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Age               2028 non-null   object
 1   Gender            2028 non-null   object
 2   University        2028 non-null   object
 3   Department        2028 non-null   object
 4   Year              2028 non-null   object
 5   CGPA              2028 non-null   object
 6   Scholarship       2028 non-null   object
 7   PSS1              2028 non-null   int64 
 8   PSS2              2028 non-null   int64 
 9   PSS3              2028 non-null   int64 
 10  PSS4              2028 non-null   int64 
 11  PSS5              2028 non-null   int64 
 12  PSS6              2028 non-null   int64 
 13  PSS7              2028 non-null   int64 
 14  PSS8              2028 non-null   int64 
 15  PSS9              2028 non-null   int64 
 16  PSS10             2028 non-nu

## Remove Derived Columns to Prevent Overfitting

In [7]:
if "Depression Value" in df.columns:
    df.drop(columns=["Depression Value"], inplace=True)
    print("✅ 'Depression Value' column removed to prevent overfitting.")
else:
    print("ℹ️ 'Depression Value' column already removed or not found.")

print(f"Remaining columns ({len(df.columns)}):")
print(df.columns.tolist())

✅ 'Depression Value' column removed to prevent overfitting.
Remaining columns (34):
['Age', 'Gender', 'University', 'Department', 'Year', 'CGPA', 'Scholarship', 'PSS1', 'PSS2', 'PSS3', 'PSS4', 'PSS5', 'PSS6', 'PSS7', 'PSS8', 'PSS9', 'PSS10', 'GAD1', 'GAD2', 'GAD3', 'GAD4', 'GAD5', 'GAD6', 'GAD7', 'PHQ1', 'PHQ2', 'PHQ3', 'PHQ4', 'PHQ5', 'PHQ6', 'PHQ7', 'PHQ8', 'PHQ9', 'Depression Label']


## Save processed dataset

In [8]:
PROCESSED_PATH.parent.mkdir(parents=True, exist_ok=True)
df.to_csv(PROCESSED_PATH, index=False)
print(f"Processed dataset saved to: {PROCESSED_PATH.resolve()}")

Processed dataset saved to: D:\Study\CSE299\Depression Assessment\data\processed\mhp_processed.csv
