Paper Title: Identifying Depression on Reddit: The Effect of Training Data

Paper Link: https://github.com/Inusette/Identifying-depression/tree/master/Data_Collector

# Data

Eight datasets were created/used:

DSF: Depression Support Forums (Ramirez-Esparza et al., 2008)

DND: Non-Depression Forums (Gorbunova, 2017)

DS: Reddit Depression Support subreddit

BC: Reddit Breast Cancer subreddit (control)

FF: Reddit Family/Friendship advice subreddits (control)

DO: Posts by diagnosed depressed authors in other subreddits

ND: Posts by non-depressed authors in other subreddits

AllD–AllND: Combination of all positive and negative sets

Each dataset contained ~400 training posts; DO/ND also had separate test sets.

# How depression and non-depression emotions are separated?

Depression (Positive Class)

Depression Support Forums (DSF)
Posts explicitly written in online forums by users self-identifying as depressed (Ramirez-Esparza et al., 2008).

Reddit Depression Support Subreddit (DS)
Posts from r/depression, where authors seek community support.

Diagnosed Depression in Other Subreddits (DO)
Posts by users who explicitly mention being diagnosed with depression (e.g., “I was just diagnosed with depression”) in r/depression.

Then, the authors collected these same users’ posts in other unrelated subreddits (excluding ones like r/Anxiety, r/mentalhealth, r/depression_help).

This produces depression-positive examples outside of overt depression discussions.

Non-Depression (Negative Class)

Non-Depression Forums (DND)
Control group texts from forums unrelated to depression (Gorbunova, 2017).

Reddit Breast Cancer Subreddit (BC)
Chosen as a comparison group since users discuss illness but not depression.

Reddit Family/Friendship Advice Subreddits (FF)
Posts topically closer to depression but not about mental health (e.g., seeking advice on family or friendship).

No Depression in Other Subreddits (ND)
Posts from Reddit users who never posted in depression-related communities during the same timeframe.

This group serves as the true control for DO (so DO vs. ND is the most realistic classification setting).

Important Note

For DO and ND, they ensured one post per author to avoid bias from prolific writers.

Unlike Yates et al. (2017), they did not manually validate every diagnosis claim, so some false positives are possible (users might exaggerate or misuse “diagnosed with depression”).

In [4]:
from pathlib import Path
import pandas as pd

# Folders relative to the notebook location (Data_Process)
project_dir = Path.cwd()
source_dir = project_dir / "Data_Lake" / "Dataset_1"
warehouse_dir = project_dir / "Data_Warehouse"
warehouse_dir.mkdir(parents=True, exist_ok=True)  # ensure target exists

rows = []

# Walk all subfolders under Dataset_1 and collect .txt files
for txt_path in source_dir.rglob("*.txt"):
    try:
        text = txt_path.read_text(encoding="utf-8", errors="ignore").strip()
        if text:  # skip empty files
            # optional: keep the immediate subfolder name as source
            source_folder = txt_path.parent.name
            rows.append({"text": text, "label": "depression", "source": source_folder})
    except Exception as e:
        print(f"Could not read {txt_path}: {e}")

# Build DataFrame and save
df = pd.DataFrame(rows, columns=["text", "label", "source"])
out_path = warehouse_dir / "depression_dataset.csv"
df.to_csv(out_path, index=False, encoding="utf-8")

print(f"Found {len(rows)} texts")
print(f"Saved to: {out_path}")


Found 3023 texts
Saved to: d:\Sajjad-Workspace\XAI\Data_Process\Data_Warehouse\depression_dataset.csv
