# Detecting Mental Health Distress in Online Text

This notebook implements our course project: a text classifier that detects
suicide and depression related posts. We use:
1. A traditional ML baseline (TF-IDF + logistic regression).
2. A transformer-based model (DistilBERT) fine-tuned on the Kaggle dataset.



## 1. Data loading & preprocessing
We load the Kaggle *Suicide and Depression Detection* dataset, inspect the columns,
clean the text (lowercasing, removing URLs and mentions), and map string labels
to integer IDs for modeling.

*Slide: `nnintro-prompt` (motivation and NLP pipeline).*

### 1.1 Imports, device, and random seed

We import the required libraries (PyTorch, pandas, scikit-learn, etc.), set
the computation device (CPU/GPU), and fix random seeds for reproducibility.
(Slide: `nnintro-prompt`.)

In [1]:
import random
import torch
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
from sklearn.model_selection import train_test_split
from collections import defaultdict

### 1.2 Device and random seed setup

We enable tqdm support, choose whether to use GPU or CPU, and set a fixed random
seed for reproducibility. This makes our experiments easier to rerun and compare.
(Slide: `nnintro-prompt`.)


In [2]:
# enable tqdm in pandas
tqdm.pandas()

# # set to True to use the gpu (if there is one available)
use_gpu = True

# select device
device = torch.device('cuda' if use_gpu and torch.cuda.is_available() else 'cpu')
print(f'device: {device.type}')

# random seed
seed = 1234

# set random seed
if seed is not None:
    print(f'random seed: {seed}')
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)

device: cuda
random seed: 1234


### 1.3 Load the dataset and preview rows

We load the Kaggle *Suicide and Depression Detection* CSV file from disk and
show the first few rows to understand the structure of the data.
(Slide: `nnintro-prompt`.)


In [3]:
df = pd.read_csv("../data/SuicideAndDepression_Detection.csv")

df.head()

Unnamed: 0,text,class
0,Does life actually work for most / non-depress...,depression
1,I found my friend's bodyIt was almost nine yea...,depression
2,Ex Wife Threatening SuicideRecently I left my ...,SuicideWatch
3,Am I weird I don't get affected by compliments...,teenagers
4,Finally 2020 is almost over... So I can never ...,teenagers


### 1.4 Check label distribution

We compute the value counts of the `class` column to see how many examples we
have for each label and whether the dataset is imbalanced.
(Slides: `nnintro-prompt`, `nnintro-ch8-dist`.)


In [4]:
pd.value_counts(df['class'])

  pd.value_counts(df['class'])


class
SuicideWatch    116037
teenagers       116037
depression      116036
Name: count, dtype: int64

In [5]:
print(df.loc[0, 'text'])

Does life actually work for most / non-depressed people?It doesn't seem possible to me that everyone isn't miserable. What do you think? My boyfriend told me the other week that in reality we are the minority. Most people are fine, if not happy. Oddball.


### 1.6 Basic text normalization

We normalize the text by lowercasing, removing backslashes, and slightly
adjusting punctuation patterns. This is a simple cleanup step before modeling.
(Slide: `nnintro-prompt` – basic NLP preprocessing.)


In [6]:
df['text'] = df['text'].str.lower()
df['text'] = df['text'].str.replace('\\', '', regex=False)
df['text'] = df['text'].str.replace(r'([?.!])([A-Z])', r'\1 \2', regex=True)

### 1.7 Inspect the example after cleaning

We print the same example again after normalization to verify that the cleaning
step worked as expected.
(Slide: `nnintro-prompt`.)


In [7]:
print(df.loc[0, 'text'])

does life actually work for most / non-depressed people?it doesn't seem possible to me that everyone isn't miserable. what do you think? my boyfriend told me the other week that in reality we are the minority. most people are fine, if not happy. oddball.


### 1.8 Map string labels to integer IDs

We map each label string to an integer ID (`teenagers` → 0, `depression` → 1,
`SuicideWatch` → 2), which is the format required by our models.
(Slide: `nnintro-prompt` – basic ML setup.)


In [8]:
label_mapping = {
    "teenagers": 0,
    "depression": 1,
    "SuicideWatch": 2
}
df['class'] = df['class'].map(label_mapping)

In [9]:
print("Label distribution after mapping (full dataset):")
print(df["class"].value_counts())

df["text_len"] = df["text"].str.len()
print("\nText length statistics:")
print(df["text_len"].describe())


Label distribution after mapping (full dataset):
class
2.0    116037
0.0    116037
1.0    116036
Name: count, dtype: int64

Text length statistics:
count    348123.000000
mean        897.267836
std        1322.454725
min           3.000000
25%         183.000000
50%         472.000000
75%        1102.000000
max       40297.000000
Name: text_len, dtype: float64


### 1.9 Check for missing labels

We count how many rows have missing (`NaN`) values in the `class` column to
avoid training on incomplete labels.
(Slide: `nnintro-ch8-dist` – data quality and evaluation.)


In [10]:
nan_count = df['class'].isna().sum()
print(f"Number of NaN rows in 'class': {nan_count}")

Number of NaN rows in 'class': 14


### 1.10 Inspect rows with missing labels

We inspect the rows with missing labels to see what kind of data is being
removed and confirm that this cleanup step is reasonable.
(Slide: `nnintro-ch8-dist`.)


In [11]:
df[df['class'].isna()]

Unnamed: 0,text,class,text_len
11557,i feel like im in a nightmare.something happen...,,520.0
11558,it's like i'm living in a nightmare and everyt...,,67.0
11559,(view post history for more info on my dad),,43.0
41048,a doodle of my struggle with depressionhttp://...,,63.0
47570,thinking of putting this as my profile picture...,,107.0
61160,if i told you i want to move on with my life a...,,223.0
141715,i think i might need someone to talk me down f...,,802.0
141716,i've known that i'll never get any love outsid...,,82.0
156657,a clip that describes how i feel when i'm tryi...,,119.0
156658,depression,,10.0


### 1.11 Drop rows with missing labels

We drop all rows whose `class` label is missing and reset the index. This keeps
only fully labeled examples for training and evaluation.
(Slide: `nnintro-ch8-dist`.)


In [12]:
df = df.dropna(subset=['class']).reset_index(drop=True)

### 1.12 Train/dev split

We split the cleaned dataset into a training set and a development (validation)
set, stratifying by label so that the class distribution is similar in both
splits.
(Slide: `nnintro-ch8-dist` – train/dev splits.)


In [13]:
train_df, dev_df = train_test_split(df, train_size=0.8, random_state=seed, stratify=df['class'])
train_df.reset_index(inplace=True, drop=True)
dev_df.reset_index(inplace=True, drop=True)

print(f"Train rows: {len(train_df):,}, Dev rows: {len(dev_df):,}")

Train rows: 278,488, Dev rows: 69,622


### 1.13 Ensure text column is string

We ensure the `text` column is stored as strings in both the training and
development dataframes, which avoids type issues in later preprocessing.
(Slide: `nnintro-prompt`.)


In [14]:
train_df['text'] = train_df['text'].astype(str)
dev_df['text'] = dev_df['text'].astype(str)

---

## 2. Baseline: TF-IDF + Logistic Regression

We build a baseline classifier using TF-IDF features and multinomial logistic
regression. TF-IDF turns each post into a sparse feature vector, and logistic
regression learns a linear decision boundary with softmax and cross-entropy.

*Slides: `nnintro-ch3-lr`, `nnintro-ch5-ffnn`.*

### 2.1 Prepare texts and labels

We extract the raw texts and integer labels from the train and dev splits to
use as input and targets for the baseline classifier.
(Slides: `nnintro-ch3-lr`, `nnintro-ch5-ffnn`.)


In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

train_texts = train_df["text"].astype(str)
dev_texts   = dev_df["text"].astype(str)
train_labels = train_df["class"].values
dev_labels   = dev_df["class"].values


### 2.2 Build TF-IDF features

We convert each document into a sparse TF-IDF vector over unigrams and
bigrams. These vectors are the input features to our linear classifier.
(Slides: `nnintro-ch3-lr`, `nnintro-ch5-ffnn`.)


In [16]:
vectorizer = TfidfVectorizer(
    max_features=20000,      # vocab size
    ngram_range=(1, 2),      # unigram + bigram
    min_df=5                 # drop very rare terms
)
X_train = vectorizer.fit_transform(train_texts)
X_dev   = vectorizer.transform(dev_texts)

print("TF-IDF train shape:", X_train.shape)
print("TF-IDF dev   shape:", X_dev.shape)


TF-IDF train shape: (278488, 20000)
TF-IDF dev   shape: (69622, 20000)


### 2.3 Train multinomial logistic regression

We train a multinomial logistic regression model on the TF-IDF features, which
is equivalent to a one-layer neural network with a softmax output.
(Slides: `nnintro-ch3-lr`, `nnintro-ch5-ffnn`.)


In [17]:
lr_clf = LogisticRegression(
    max_iter=1000,
    n_jobs=-1
)
lr_clf.fit(X_train, train_labels)


0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


### 2.4 Evaluate the baseline on the dev set

We evaluate the TF-IDF + logistic regression baseline on the dev set using the
classification report (precision, recall, and F1 for each class).
(Slides: `nnintro-ch3-lr`, `nnintro-ch8-dist`.)


In [18]:
dev_pred = lr_clf.predict(X_dev)

id_to_label = {v: k for k, v in label_mapping.items()}
target_names = [id_to_label[i] for i in sorted(id_to_label.keys())]

print("\n=== Baseline: Logistic Regression on TF-IDF features ===")
print(classification_report(dev_labels, dev_pred, target_names=target_names))



=== Baseline: Logistic Regression on TF-IDF features ===
              precision    recall  f1-score   support

   teenagers       0.89      0.93      0.91     23207
  depression       0.77      0.75      0.76     23207
SuicideWatch       0.78      0.76      0.77     23208

    accuracy                           0.82     69622
   macro avg       0.81      0.82      0.81     69622
weighted avg       0.81      0.82      0.81     69622



### 2.5 Free memory used by the baseline

After running the TF-IDF + logistic regression baseline, we delete the large
TF-IDF matrices and related variables to free RAM before loading the transformer
model.


In [19]:
# Free large baseline objects to save RAM
del X_train, X_dev
del train_texts, dev_texts
del train_labels, dev_labels
del lr_clf, vectorizer

import gc
gc.collect()


27

---

## 3. Transformer-based model (DistilBERT fine-tuning)

We fine-tune a pre-trained DistilBERT model for three-way text classification.
Posts are tokenized into subwords, encoded by the transformer, and the [CLS]
representation is passed to a linear classification head.

*Slide: `nnintro-ch12-transformer`.*

### 3.1 Tokenization with DistilBERT

We load the DistilBERT tokenizer and tokenize all train and dev texts into
subword IDs with padding and truncation. This produces the input tensors that
will be fed into the transformer model.
(Slide: `nnintro-ch12-transformer`.)


In [20]:
from transformers import DistilBertTokenizerFast

tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

# shorter sequence length to save memory
max_length = 128

print("Tokenizer loaded. max_length =", max_length)


Tokenizer loaded. max_length = 128


### 3.2 Torch Dataset wrapper

We define a `TransformerDataset` class that wraps the tokenized encodings and
labels into a PyTorch `Dataset`, so that we can iterate over examples with a
DataLoader.
(Slides: `nnintro-ch5-ffnn`, `nnintro-ch12-transformer`.)


In [21]:
import torch
from torch.utils.data import Dataset

class OnTheFlyDataset(Dataset):
    """
    Tokenize each example on the fly inside __getitem__.
    This avoids storing all tokenized encodings in memory at once
    and avoids a long blocking pre-processing step.
    """
    def __init__(self, df, tokenizer, max_length=128):
        self.texts = df["text"].tolist()
        self.labels = df["class"].tolist()
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text  = self.texts[idx]
        label = self.labels[idx]

        enc = self.tokenizer(
            text,
            truncation=True,
            padding="max_length",
            max_length=self.max_length,
            return_tensors="pt"
        )
        item = {k: v.squeeze(0) for k, v in enc.items()}
        item["labels"] = torch.tensor(label, dtype=torch.long)
        return item

train_dataset = OnTheFlyDataset(train_df, tokenizer, max_length=max_length)
dev_dataset   = OnTheFlyDataset(dev_df, tokenizer, max_length=max_length)

print("Example item from train_dataset:")
sample = train_dataset[0]
for key in sample:
    if key != "labels":
        print(f"{key} shape: {sample[key].shape}, dtype: {sample[key].dtype}")
print("label:", sample["labels"])


Example item from train_dataset:
input_ids shape: torch.Size([128]), dtype: torch.int64
attention_mask shape: torch.Size([128]), dtype: torch.int64
label: tensor(0)


### 3.3 DataLoaders for training and dev

We create PyTorch `DataLoader` objects for the training and development sets,
which handle batching and shuffling of examples during training.
(Slide: `nnintro-ch5-ffnn` – minibatch training.)


In [29]:
from torch.utils.data import DataLoader

batch_size = 128
num_workers = 0  

train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=True,
    num_workers=num_workers,
    pin_memory=True
)
dev_loader = DataLoader(
    dev_dataset,
    batch_size=batch_size,
    shuffle=False,
    num_workers=num_workers,
    pin_memory=True
)

print(f"# train batches: {len(train_loader)}")
print(f"# dev   batches: {len(dev_loader)}")


# train batches: 2176
# dev   batches: 544


### 3.4 Model and optimizer setup

We load a pre-trained `DistilBertForSequenceClassification` model, move it to
the chosen device, and set up the optimizer and learning-rate scheduler for
fine-tuning.
(Slide: `nnintro-ch12-transformer`.)


In [30]:
from transformers import DistilBertForSequenceClassification, get_linear_schedule_with_warmup
from torch.optim import AdamW

num_labels = 3  # teenagers / depression / SuicideWatch

model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=num_labels
)
model.to(device)

epochs = 3  # you can change this
optimizer = AdamW(model.parameters(), lr=2e-5)

total_steps = len(train_loader) * epochs
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=total_steps
)

print("Model loaded on:", device)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded on: cuda


### 3.5 Training loop and dev metrics

We fine-tune the DistilBERT model for several epochs. After each epoch we
evaluate on the dev set and report accuracy, macro precision, macro recall,
and macro F1.
(Slides: `nnintro-ch12-transformer`, `nnintro-ch8-dist`.)


add ipywidgets if your system not installed

In [24]:
# %pip install ipywidgets 

In [31]:
import torch

use_amp = (device.type == "cuda")   
scaler = torch.amp.GradScaler(enabled=use_amp)

In [32]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

for epoch in range(epochs):
    model.train()
    total_loss = 0.0

    for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}/{epochs} - train"):
        optimizer.zero_grad()

        batch = {k: v.to(device, non_blocking=True) for k, v in batch.items()}

        with torch.autocast(
            device_type=device.type,   
            dtype=torch.float16,       
            enabled=use_amp           
        ):
            outputs = model(**batch)
            loss = outputs.loss

        if use_amp:
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
        else:
            loss.backward()
            optimizer.step()

        scheduler.step()
        total_loss += loss.item()


    
    avg_train_loss = total_loss / len(train_loader)
    
    # ---- eval on dev ----
    model.eval()
    all_preds, all_labels = [], []

    with torch.no_grad():
        for batch in tqdm(dev_loader, desc=f"Epoch {epoch+1}/{epochs} - dev"):
            labels = batch["labels"].cpu().numpy()

            batch = {k: v.to(device, non_blocking=True) for k, v in batch.items()}

            outputs = model(**batch)
            logits = outputs.logits
            preds = torch.argmax(logits, dim=-1).cpu().numpy()

            all_preds.extend(preds)
            all_labels.extend(labels)

    
    acc = accuracy_score(all_labels, all_preds)
    prec, rec, f1, _ = precision_recall_fscore_support(
        all_labels, all_preds, average="macro", zero_division=0
    )
    
    print(f"\nEpoch {epoch+1}/{epochs}")
    print(f"  train loss        : {avg_train_loss:.4f}")
    print(f"  dev accuracy      : {acc:.4f}")
    print(f"  dev macro-prec.   : {prec:.4f}")
    print(f"  dev macro-recall  : {rec:.4f}")
    print(f"  dev macro-F1      : {f1:.4f}")


Epoch 1/3 - train:   0%|          | 0/2176 [00:00<?, ?it/s]

Epoch 1/3 - dev:   0%|          | 0/544 [00:00<?, ?it/s]


Epoch 1/3
  train loss        : 0.3910
  dev accuracy      : 0.8560
  dev macro-prec.   : 0.8558
  dev macro-recall  : 0.8560
  dev macro-F1      : 0.8559


Epoch 2/3 - train:   0%|          | 0/2176 [00:00<?, ?it/s]

Epoch 2/3 - dev:   0%|          | 0/544 [00:00<?, ?it/s]


Epoch 2/3
  train loss        : 0.3133
  dev accuracy      : 0.8599
  dev macro-prec.   : 0.8603
  dev macro-recall  : 0.8599
  dev macro-F1      : 0.8601


Epoch 3/3 - train:   0%|          | 0/2176 [00:00<?, ?it/s]

Epoch 3/3 - dev:   0%|          | 0/544 [00:00<?, ?it/s]


Epoch 3/3
  train loss        : 0.2795
  dev accuracy      : 0.8604
  dev macro-prec.   : 0.8604
  dev macro-recall  : 0.8604
  dev macro-F1      : 0.8604


---

## 4. Evaluation & Error Analysis

We evaluate both models on a held-out development set using accuracy, precision,
recall, and macro F1. We also inspect misclassified examples to understand
typical errors and limitations.

*Slide: `nnintro-ch8-dist` (distributions and evaluation).*



### 4.1 Collect misclassified examples

Using the fine-tuned transformer and the dev DataLoader, we collect predictions,
compare them with the gold labels, and build a DataFrame of misclassified
examples for qualitative error analysis.
(Slide: `nnintro-ch8-dist` – evaluation and model analysis.)


In [33]:
# build id_to_label mapping (inverse of label_mapping)
id_to_label = {v: k for k, v in label_mapping.items()}

model.eval()
all_preds, all_labels, all_texts = [], [], []

with torch.no_grad():
    for batch in tqdm(dev_loader, desc="Collecting predictions for error analysis"):
        labels = batch["labels"].numpy()
        input_ids = batch["input_ids"]
        
        batch_device = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch_device)
        logits = outputs.logits
        preds = torch.argmax(logits, dim=-1).cpu().numpy()
        
        all_labels.extend(labels)
        all_preds.extend(preds)
        
        texts = tokenizer.batch_decode(input_ids, skip_special_tokens=True)
        all_texts.extend(texts)

error_df = pd.DataFrame({
    "text": all_texts,
    "gold": [id_to_label[int(y)] for y in all_labels],
    "pred": [id_to_label[int(y)] for y in all_preds]
})

errors = error_df[error_df["gold"] != error_df["pred"]]
print(f"# misclassified examples: {len(errors)}")

# show a few misclassified examples
errors.head(10)


Collecting predictions for error analysis:   0%|          | 0/544 [00:00<?, ?it/s]

# misclassified examples: 9722


Unnamed: 0,text,gold,pred
2,the only psychiatrist i could get in contact w...,SuicideWatch,depression
5,29 m just not good enough. i ' m off my antide...,SuicideWatch,depression
23,i ' m ready! haha yay!... please be happy. i h...,SuicideWatch,depression
30,"decided to quit my major, buying gun sooni pos...",SuicideWatch,depression
31,"i don ' t know anymoreman, i am just really mi...",SuicideWatch,depression
35,i am cripplingly addicted to custard edit : r ...,teenagers,depression
51,so depressed right nowit always happens i do s...,SuicideWatch,depression
52,just 2 hours left : ) my parents are going to ...,depression,SuicideWatch
55,everything is falling apart i originally poste...,SuicideWatch,teenagers
79,i ' m tempted to kill my self. i keep telling ...,SuicideWatch,depression


---

## 5. Save model & conclusions

We save the fine-tuned model and tokenizer for future use and summarize the main
results: comparison between the baseline and the transformer model, and ideas
for potential improvements.

*Slide: `nnintro-prompt` (research mindset and applications).*



In [34]:
save_dir = "saved_distilbert_model"

model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)

print(f"Model and tokenizer saved to: {save_dir}")


Model and tokenizer saved to: saved_distilbert_model


---

## References to course slides

- `nnintro-prompt`: Motivation, NLP applications, research mindset.
- `nnintro-ch3-lr`: Logistic regression and softmax classification.
- `nnintro-ch5-ffnn`: Feed-forward neural networks.
- `nnintro-ch8-dist`: Data distributions and evaluation metrics.
- `nnintro-ch12-transformer`: Transformer architecture and self-attention.