# Text Classification by DistilBERT

Purpose: 
Explore using DistilBERT, an unsupervised pretrained model from Hugging Face, to classify the anxiery level (low or high) for Reddit posts.

In [1]:
import pandas as pd
import numpy as np


In [2]:
raw_data = pd.read_csv('../data/processed/reddit_submission.csv')
raw_data.dropna(subset=['selftext'], inplace=True)
raw_data['full_text'] = raw_data['title'] + ' ' + raw_data['selftext']
print(raw_data.info())
raw_data.head()

FileNotFoundError: [Errno 2] No such file or directory: '../data/processed/reddit_submission.csv'

## Extract and Label Sample Data

**Purpose**: using stratified sampling with a floor-and-proportional-remainders method to extract sample data to test DistilBERT pipeline efficiently.

In [None]:
# Extract Sample
# display(raw_data['subreddit'].value_counts())

# B = 50
# F = 5
# K = len(raw_data['subreddit'].unique())

# num_samples_dict = {}

# for subreddit in raw_data['subreddit'].unique():
#     N_i = len(raw_data[raw_data['subreddit'] == subreddit])
#     num_samples = F + round((B - F * K) * N_i / raw_data.shape[0])
#     num_samples_dict[subreddit] = num_samples


# sample_data = []
# for key, value in num_samples_dict.items():
#     print(f"{key}: {value}")
#     sample = raw_data[raw_data['subreddit'] == key].sample(value, random_state=42)
#     sample_data.append(sample)
# sample_data = pd.concat(sample_data)
# display(sample_data.head(10))
# print(sample_data.shape)
# sample_data.to_csv('../data/processed/reddit_sample_for_labeling.csv', index=False)

# -- Delete this cell, function moved to an individual python file --- 

**Instruction for Labeling**
- 0 – None: You don’t feel anxious when reading the full_text.
- 1 – Minimal: The full_text raises brief concern, but with no impairment.
- 2 – Mild: The full_text triggers occasional or situational worry, but daily functioning remains intact.
- 3 – Moderate: The full_text leads to frequent worry or rumination, with some impact on focus or sleep.
- 4 – Severe: The full_text evokes strong fear, anticipation, or catastrophizing language, showing clear impairment.
- 5 – Crisis-level Severe: The full_text reflects major impairment

In [None]:
label_data = pd.read_csv('../data/processed/reddit_sample_for_labeled.csv')
print(label_data['anxiety_level'].value_counts())

print('-'*20)

for i in range(6):
    print(f"Anxiety Level {i}: {len(label_data[label_data['anxiety_level'] == i])} samples")
    display(label_data[label_data['anxiety_level'] == i][['full_text']].values[:3])

anxiety_level
0    24
3     8
2     6
4     5
1     5
5     3
Name: count, dtype: int64
--------------------
Anxiety Level 0: 24 samples


array([['Health anxiety people who here is diagnosed with health ocd? \n\n[View Poll](https://www.reddit.com/poll/1gyab28)'],
       ["Equity X-Ray: In-Depth Research #23 I feel the market is misunderstanding the transitional shift in the Aebi Schmidt Group (AEBI). After its reverse merger with The Shyft Group, the market is stuck on the ghost of the old company and has yet to shift its view to a new global industrial leader, with a competitive business model, not to mention a remarkable valuation inconsistency. I feel investors are diluting their view on the company's strong, and non-discretionary revenue drivers, and historically powerful free-cash-flow-generative business model, because they view the merger as so complicated. Quite frankly, in my review, I believe the business model is simple, and the long-term growth prospects are fair, and because of my purposely conservative valuation model, I also feel that there is significant mispricing.\n\nI am officially covering the stock w

Anxiety Level 1: 5 samples


array([['Mirtazapine Side Effects On day 4 of 15 mg.  The hunger has already gotten better, along with the sedation.  However, I am now getting very vivid dreams, vivid to the point my sleep isn’t feeling very restful.  I also experience body zaps at night and my muscles, especially in my legs, are much tighter than they used to be, to the point where I’m getting cramps where I rarely got cramps in the past.  Curious if anyone else has experienced these side effects and if they go away with time.  I would like to stay on it l, as I can already feel a noticeable positive shift in my mood and I expect that to get better up until the 4-6 week mark.'],
       ["Psychologist's advice for me and you. So, I've been going through a medical whirlwind, and I cry about it to my psych often. Here's what she said. Rules of thumb, if you will.\n\nGoogling is bad, especially in specific side effects and symptoms! Because it compiles EVERY possible aspect, especially meds side effects! (because compan

Anxiety Level 2: 6 samples


array([["Discussing/Disclosing your HA to a doctor? Hello all! I have had some pretty intense health anxiety that developed from a combination of life situations and health situations. I'm curious if anyone has had success discussing their health anxiety with a doctor. I read stories all the time about how doctors are dismissive of symptoms and write them off as anxiety, or how doctors treat patients who have health anxiety like they're just paranoid and don't listen to their concerns. If you have, I'd love to know how that experience went for you. If it wasn't a good experience, is there anything you would do differently if you knew what you know now? I'm considering bringing it up with my doctor, but I'm concerned I'll be written off since I already have anxiety issues.\n\nThanks for any insight you can give!"],
       ['How do you calm HA when big events are approaching?    Hi everyone. I’ve been experiencing health anxiety for a while, but I often find it spiraling more when I have

Anxiety Level 3: 8 samples


array([['How to reduce anxiety during this specific occasion So I\'m in uni and it\'s exam period. I have always had stress before an exam but it never caused me any damage. But this exam period, on the first day I almost fainted in the bus (because of the hot weather, I hadn\'t eaten well etc) but it kinda left me traumatized. During these last days my mind has really been torturing me and I\'ve started being anxious without a reason. It\'s like my mind has been stuck. So I wanted to ask if you have some advice on how to distract my mind while being in the bus, because I know for sure my mind won\'t let me relax and I\'m afraid it will happen again. It\'s also a time when Im the most "alone with myself" because I don\'t have anyone to talk with to distract myself or do things with my phone (except for listening to music) so I don\'t really know what to do. '],
       ['Reminder, take your medicine even if you’re afraid As someone who has been dealing with high stress and health anxiet

Anxiety Level 4: 5 samples


array([['I had a panic attack just looking at jobs last night.. I’ve been working from home for the past 3 years, and I feel like it’s really caused my social skills to lack. I thought marijuana was causing it, but I’m 5 months sober and I still forget what I’m saying sometimes or forget to enunciate my words lol. I need a new job soon because I won’t be able to afford to stay at my current job, and I’m bored with it anyway. I know I’ve needed to look for jobs for a while now, but I’ve been avoiding to an extent that I never have before. After finally getting my resume mostly together after MONTHS, I figured I would just do a simple search on indeed to see what’s out there. Well, I saw a job that I thought would be cool and then I had a panic attack shortly after. Part of it excited me, but then I started thinking about having to interview, having to go into the office, am I capable of being a professional, what if people don’t like me, bad boss PTSD, etc. I feel so pathetic because I 

Anxiety Level 5: 3 samples


array([["I don't know what to do So I'm seeking help can't go to professional therapist so I often search and seek reassurance from @i or Google and ik it's bad but atp I simply don't know... I found reddit and it's been helpful to know I'm not alone going through this mental turmoil but there is just so much going on inside my head all the intrusive thoughts all the excessive feeling if I start writing down each and everything it would take me nice 20 mins the least and it's just to much confusing to even begin with, all the fucking things i have done in past eats me everyday and yes sucidial thoughts r always there but i am not gonna act upon it cuz I have someone who I need to live for and yes I do have people who love me but I feel like a disappointment of their life and I might be overthinking all this but one thing is for sure that even if I keep all my relations aside I can't handle myself anymore it's just getting more difficult and I always pray to god to do smthg about it may

## Train DistilBERT Model

In [None]:
import numpy as np, pandas as pd, torch
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          TrainingArguments, Trainer, DataCollatorWithPadding)                         
from sklearn.model_selection import train_test_split
from datasets import Dataset

In [None]:
label_data.head(2)
processed_data = label_data.drop(columns=['post_id', 'subreddit', 'subreddit', 'title', 'title'])

test_size = 0.2
X_train, X_test, y_train, y_test = train_test_split(processed_data['full_text'], processed_data['anxiety_level'], test_size=test_size, stratify=processed_data['anxiety_level'], random_state=42)
print(f"Train size: {len(X_train)}, Test size: {len(X_test)}")

def make_ds(frame):
    return Dataset.from_pandas(frame[["text","label"]].reset_index(drop=True))

train_ds = Dataset.from_dict({"text": X_train, "label": y_train})
test_ds  = Dataset.from_dict({"text": X_test,  "label": y_test})

Train size: 40, Test size: 11


In [None]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize(batch):
    return tokenizer(batch["text"], truncation=True)

train_ds = train_ds.map(tokenize, batched=True)
test_ds  = test_ds.map(tokenize, batched=True)


cols = ["input_ids", "attention_mask", "label"]
train_ds.set_format(type="torch", columns=cols)
test_ds.set_format(type="torch", columns=cols)

Map:   0%|          | 0/40 [00:00<?, ? examples/s]

Map:   0%|          | 0/11 [00:00<?, ? examples/s]

In [None]:
num_labels = len(np.unique(y_train))
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=num_labels
)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

training_args = TrainingArguments(
    output_dir="distilbert-anxiety",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
)

def compute_metrics(eval_pred):
    from sklearn.metrics import accuracy_score, f1_score, mean_squared_error, root_mean_squared_error
    logits, labels = eval_pred
    preds = logits.argmax(axis=-1)
    return {
        "accuracy": accuracy_score(labels, preds),
        "macro_f1": f1_score(labels, preds, average="macro"),
        "mean_squared_error": mean_squared_error(labels, preds),
        "root_mean_squared_error": root_mean_squared_error(labels, preds)
    }

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=test_ds,   # or a separate validation set if you have one
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# 5. Train & evaluate
trainer.train()
metrics = trainer.evaluate()
print(metrics)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,Macro F1,Mean Squared Error,Root Mean Squared Error
1,No log,1.717561,0.454545,0.104167,5.818182,2.412091
2,No log,1.695269,0.454545,0.104167,5.818182,2.412091
3,No log,1.684683,0.454545,0.104167,5.818182,2.412091


{'eval_loss': 1.6846829652786255, 'eval_accuracy': 0.45454545454545453, 'eval_macro_f1': 0.10416666666666667, 'eval_mean_squared_error': 5.818181818181818, 'eval_root_mean_squared_error': 2.412090756622109, 'eval_runtime': 17.226, 'eval_samples_per_second': 0.639, 'eval_steps_per_second': 0.058, 'epoch': 3.0}
