# Feature Engineering (NLP)

This notebook loads the cleaned text dataset (mhp_processed_text.csv), prepares the "Student Information" text column for NLP-based model training, assigns Depression Label as the classification target, and splits the dataset into training and testing sets (80/20).

In [1]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)

## Load Dataset

In [2]:
data_path = "../data/processed/mhp_processed_text.csv"

df = pd.read_csv(data_path)
df.head()

Unnamed: 0,Student Information,Depression Label
0,"The student is around 18-22 years old, female, studying at IUB, pursuing Engineering degree, currently in their second year. They do not have a scholarship. For Emotional Response to Setbacks, they report fairly often. For Sense of Control Over Academics, they report very often. For Overall Academic Stress Level, they report fairly often. For Confidence in Coping Abilities, they report sometimes. For Problem-Solving Self-Efficacy, they report sometimes. For Perception of Academic Progress, they report almost never. For Tolerance for Academic Frustration, they report sometimes. For Academic Self-Confidence, they report sometimes. For Frustration With Academic Results, they report very often. For Sense of Academic Helplessness, they report very often. For Feeling Nervous or On Edge, they report more than half the days. For Uncontrollable Worry, they report more than half the days. For Difficulty Relaxing, they report nearly every day. For Irritability Due to Anxiety, they report more than half the days. For Frequency of Excessive Worry, they report more than half the days. For Physical Symptoms of Anxiety, they report more than half the days. For Fear of Something Bad Happening, they report more than half the days. For Loss of Interest, they report more than half the days. For Low Mood or Hopelessness, they report more than half the days. For Sleep Difficulties, they report nearly every day. For Fatigue or Low Energy, they report more than half the days. For Appetite or Weight Changes, they report more than half the days. For Feelings of Worthlessness, they report more than half the days. For Difficulty Concentrating, they report more than half the days. For Psychomotor Changes, they report nearly every day. For Suicidal Thoughts, they report more than half the days.",Severe
1,"The student is around 18-22 years old, male, studying at IUB, pursuing Engineering degree, currently in their third year. They do not have a scholarship. For Emotional Response to Setbacks, they report fairly often. For Sense of Control Over Academics, they report fairly often. For Overall Academic Stress Level, they report very often. For Confidence in Coping Abilities, they report sometimes. For Problem-Solving Self-Efficacy, they report fairly often. For Perception of Academic Progress, they report sometimes. For Tolerance for Academic Frustration, they report sometimes. For Academic Self-Confidence, they report sometimes. For Frustration With Academic Results, they report sometimes. For Sense of Academic Helplessness, they report fairly often. For Feeling Nervous or On Edge, they report several days. For Uncontrollable Worry, they report more than half the days. For Difficulty Relaxing, they report more than half the days. For Irritability Due to Anxiety, they report several days. For Frequency of Excessive Worry, they report several days. For Physical Symptoms of Anxiety, they report nearly every day. For Fear of Something Bad Happening, they report more than half the days. For Loss of Interest, they report nearly every day. For Low Mood or Hopelessness, they report more than half the days. For Sleep Difficulties, they report more than half the days. For Fatigue or Low Energy, they report more than half the days. For Appetite or Weight Changes, they report more than half the days. For Feelings of Worthlessness, they report more than half the days. For Difficulty Concentrating, they report more than half the days. For Psychomotor Changes, they report more than half the days. For Suicidal Thoughts, they report more than half the days.",Moderately Severe
2,"The student is around 18-22 years old, male, studying at AIUB, pursuing Engineering degree, currently in their third year. They do not have a scholarship. For Emotional Response to Setbacks, they report never. For Sense of Control Over Academics, they report never. For Overall Academic Stress Level, they report never. For Confidence in Coping Abilities, they report never. For Problem-Solving Self-Efficacy, they report never. For Perception of Academic Progress, they report almost never. For Tolerance for Academic Frustration, they report never. For Academic Self-Confidence, they report never. For Frustration With Academic Results, they report never. For Sense of Academic Helplessness, they report never. For Feeling Nervous or On Edge, they report not at all. For Uncontrollable Worry, they report not at all. For Difficulty Relaxing, they report not at all. For Irritability Due to Anxiety, they report not at all. For Frequency of Excessive Worry, they report not at all. For Physical Symptoms of Anxiety, they report not at all. For Fear of Something Bad Happening, they report not at all. For Loss of Interest, they report not at all. For Low Mood or Hopelessness, they report not at all. For Sleep Difficulties, they report not at all. For Fatigue or Low Energy, they report not at all. For Appetite or Weight Changes, they report not at all. For Feelings of Worthlessness, they report not at all. For Difficulty Concentrating, they report not at all. For Psychomotor Changes, they report not at all. For Suicidal Thoughts, they report not at all.",Minimal
3,"The student is around 18-22 years old, male, studying at AIUB, pursuing Engineering degree, currently in their third year. They do not have a scholarship. For Emotional Response to Setbacks, they report fairly often. For Sense of Control Over Academics, they report almost never. For Overall Academic Stress Level, they report sometimes. For Confidence in Coping Abilities, they report almost never. For Problem-Solving Self-Efficacy, they report very often. For Perception of Academic Progress, they report fairly often. For Tolerance for Academic Frustration, they report sometimes. For Academic Self-Confidence, they report sometimes. For Frustration With Academic Results, they report fairly often. For Sense of Academic Helplessness, they report sometimes. For Feeling Nervous or On Edge, they report more than half the days. For Uncontrollable Worry, they report several days. For Difficulty Relaxing, they report several days. For Irritability Due to Anxiety, they report several days. For Frequency of Excessive Worry, they report more than half the days. For Physical Symptoms of Anxiety, they report several days. For Fear of Something Bad Happening, they report more than half the days. For Loss of Interest, they report more than half the days. For Low Mood or Hopelessness, they report several days. For Sleep Difficulties, they report more than half the days. For Fatigue or Low Energy, they report several days. For Appetite or Weight Changes, they report more than half the days. For Feelings of Worthlessness, they report several days. For Difficulty Concentrating, they report more than half the days. For Psychomotor Changes, they report more than half the days. For Suicidal Thoughts, they report several days.",Moderate
4,"The student is around 18-22 years old, male, studying at NSU, pursuing Engineering degree, currently in their second year. They do not have a scholarship. For Emotional Response to Setbacks, they report very often. For Sense of Control Over Academics, they report very often. For Overall Academic Stress Level, they report very often. For Confidence in Coping Abilities, they report sometimes. For Problem-Solving Self-Efficacy, they report sometimes. For Perception of Academic Progress, they report sometimes. For Tolerance for Academic Frustration, they report never. For Academic Self-Confidence, they report sometimes. For Frustration With Academic Results, they report very often. For Sense of Academic Helplessness, they report very often. For Feeling Nervous or On Edge, they report nearly every day. For Uncontrollable Worry, they report not at all. For Difficulty Relaxing, they report nearly every day. For Irritability Due to Anxiety, they report nearly every day. For Frequency of Excessive Worry, they report several days. For Physical Symptoms of Anxiety, they report several days. For Fear of Something Bad Happening, they report nearly every day. For Loss of Interest, they report several days. For Low Mood or Hopelessness, they report nearly every day. For Sleep Difficulties, they report nearly every day. For Fatigue or Low Energy, they report nearly every day. For Appetite or Weight Changes, they report several days. For Feelings of Worthlessness, they report nearly every day. For Difficulty Concentrating, they report not at all. For Psychomotor Changes, they report nearly every day. For Suicidal Thoughts, they report nearly every day.",Severe


## Inspect Dataset Structure

In [3]:
print("Dataset Shape:", df.shape)
print("\nColumns:", df.columns.tolist())
print("\nMissing Values:\n", df.isnull().sum())

Dataset Shape: (2022, 2)

Columns: ['Student Information', 'Depression Label']

Missing Values:
 Student Information    0
Depression Label       0
dtype: int64


## Verify Required Columns

In [4]:
required_columns = ["Student Information", "Depression Label"]

for col in required_columns:
    if col not in df.columns:
        raise ValueError(f"Required column missing: {col}")

print("All required columns found.")

All required columns found.


## Prepare Inputs and Target Variables

In [5]:
X = df["Student Information"]
y = df["Depression Label"]

print("Sample Text:\n", X.iloc[0])
print("\nSample Label:", y.iloc[0])

Sample Text:
 The student is around 18-22 years old, female, studying at IUB, pursuing Engineering degree, currently in their second year. They do not have a scholarship. For Emotional Response to Setbacks, they report fairly often. For Sense of Control Over Academics, they report very often. For Overall Academic Stress Level, they report fairly often. For Confidence in Coping Abilities, they report sometimes. For Problem-Solving Self-Efficacy, they report sometimes. For Perception of Academic Progress, they report almost never. For Tolerance for Academic Frustration, they report sometimes. For Academic Self-Confidence, they report sometimes. For Frustration With Academic Results, they report very often. For Sense of Academic Helplessness, they report very often. For Feeling Nervous or On Edge, they report more than half the days. For Uncontrollable Worry, they report more than half the days. For Difficulty Relaxing, they report nearly every day. For Irritability Due to Anxiety, they r

## Train/Test Split

In [6]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.20,
    random_state=42,
    stratify=y
)

print("Train size:", len(X_train))
print("Test size:", len(X_test))

Train size: 1617
Test size: 405


## Convert Splits Into DataFrames

In [7]:
train_df = pd.DataFrame({
    "Student Information": X_train,
    "Depression Label": y_train
})

test_df = pd.DataFrame({
    "Student Information": X_test,
    "Depression Label": y_test
})

train_df.head(), test_df.head()

(                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

## Output Directory

In [8]:
output_dir = "../data/processed/nlpfeatures"

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

print("Output directory confirmed:", output_dir)

Output directory confirmed: ../data/processed/nlpfeatures


## Save Train/Test Files

In [9]:
train_path = os.path.join(output_dir, "train.csv")
test_path = os.path.join(output_dir, "test.csv")

train_df.to_csv(train_path, index=False)
test_df.to_csv(test_path, index=False)

print("Saved:")
print("Train ->", train_path)
print("Test  ->", test_path)

Saved:
Train -> ../data/processed/nlpfeatures\train.csv
Test  -> ../data/processed/nlpfeatures\test.csv
