Paper Title: Identifying Depression on Reddit: The Effect of Training Data

Paper Link: https://github.com/Inusette/Identifying-depression/tree/master/Data_Collector

# Data

Eight datasets were created/used:

DSF: Depression Support Forums (Ramirez-Esparza et al., 2008)

DND: Non-Depression Forums (Gorbunova, 2017)

DS: Reddit Depression Support subreddit

BC: Reddit Breast Cancer subreddit (control)

FF: Reddit Family/Friendship advice subreddits (control)

DO: Posts by diagnosed depressed authors in other subreddits

ND: Posts by non-depressed authors in other subreddits

AllD–AllND: Combination of all positive and negative sets

Each dataset contained ~400 training posts; DO/ND also had separate test sets.

# How depression and non-depression emotions are separated?

Depression (Positive Class)

Depression Support Forums (DSF)
Posts explicitly written in online forums by users self-identifying as depressed (Ramirez-Esparza et al., 2008).

Reddit Depression Support Subreddit (DS)
Posts from r/depression, where authors seek community support.

Diagnosed Depression in Other Subreddits (DO)
Posts by users who explicitly mention being diagnosed with depression (e.g., “I was just diagnosed with depression”) in r/depression.

Then, the authors collected these same users’ posts in other unrelated subreddits (excluding ones like r/Anxiety, r/mentalhealth, r/depression_help).

This produces depression-positive examples outside of overt depression discussions.

Non-Depression (Negative Class)

Non-Depression Forums (DND)
Control group texts from forums unrelated to depression (Gorbunova, 2017).

Reddit Breast Cancer Subreddit (BC)
Chosen as a comparison group since users discuss illness but not depression.

Reddit Family/Friendship Advice Subreddits (FF)
Posts topically closer to depression but not about mental health (e.g., seeking advice on family or friendship).

No Depression in Other Subreddits (ND)
Posts from Reddit users who never posted in depression-related communities during the same timeframe.

This group serves as the true control for DO (so DO vs. ND is the most realistic classification setting).

Important Note

For DO and ND, they ensured one post per author to avoid bias from prolific writers.

Unlike Yates et al. (2017), they did not manually validate every diagnosis claim, so some false positives are possible (users might exaggerate or misuse “diagnosed with depression”).

In [4]:
from pathlib import Path
import pandas as pd

# Folders relative to the notebook location (Data_Process)
project_dir = Path.cwd()
source_dir = project_dir / "Data_Lake" / "Dataset_1"
warehouse_dir = project_dir / "Data_Warehouse"
warehouse_dir.mkdir(parents=True, exist_ok=True)  # ensure target exists

rows = []

# Walk all subfolders under Dataset_1 and collect .txt files
for txt_path in source_dir.rglob("*.txt"):
    try:
        text = txt_path.read_text(encoding="utf-8", errors="ignore").strip()
        if text:  # skip empty files
            # optional: keep the immediate subfolder name as source
            source_folder = txt_path.parent.name
            rows.append({"text": text, "label": "depression", "source": source_folder})
    except Exception as e:
        print(f"Could not read {txt_path}: {e}")

# Build DataFrame and save
df = pd.DataFrame(rows, columns=["text", "label", "source"])
out_path = warehouse_dir / "depression_dataset.csv"
df.to_csv(out_path, index=False, encoding="utf-8")

print(f"Found {len(rows)} texts")
print(f"Saved to: {out_path}")


Found 3023 texts
Saved to: d:\Sajjad-Workspace\XAI\Data_Process\Data_Warehouse\depression_dataset.csv


## Temporary code to rename columns and add new column

In [None]:

import pandas as pd
from pathlib import Path

# Correct relative path (since notebook is in Data_Process)
in_path = Path("Data_Warehouse/depression_dataset.csv")

# Load the CSV
df = pd.read_csv(in_path)

# Rename 'source' → 'sub-source'
df = df.rename(columns={"source": "sub-source"})

# Add new column 'source' with constant value
df["source"] = "dataset_1"

# Save updated CSV
out_path = Path("Data_Warehouse/depression_dataset_final.csv")
df.to_csv(out_path, index=False, encoding="utf-8")

print(f"Updated CSV saved at {out_path}")


Updated CSV saved at Data_Warehouse\depression_dataset_final.csv


In [1]:
from pathlib import Path
import pandas as pd

# ===== CONFIG =====
project_dir = Path.cwd()              # assumes you are in Data_Process
warehouse_dir = project_dir / "Data_Warehouse"

# ===== find depression datasets =====
depression_files = sorted(warehouse_dir.glob("depression_dataset*.csv"))
if not depression_files:
    raise FileNotFoundError("No depression_dataset*.csv files found in Data_Warehouse")

# ===== check each file =====
summary = []

for file in depression_files:
    df = pd.read_csv(file)

    if "text" not in df.columns:
        print(f"Skipping {file.name} (no 'text' column found)")
        continue

    df["word_count"] = df["text"].apply(lambda x: len(str(x).split()))
    long_posts = df[df["word_count"] > 400]

    summary.append({
        "file": file.name,
        "total_posts": len(df),
        "posts_gt_400_words": len(long_posts)
    })

# ===== summary table =====
summary_df = pd.DataFrame(summary)
print(summary_df.to_string(index=False))


                    file  total_posts  posts_gt_400_words
  depression_dataset.csv         3023                 371
depression_dataset_2.csv          915                  88


In [2]:
from pathlib import Path
import pandas as pd

# ===== CONFIG =====
project_dir = Path.cwd()                  # run this from Data_Process
warehouse_dir = project_dir / "Data_Warehouse"
pattern = "depression_dataset*.csv"
WORD_CAP = 400

def get_unique_path(base_dir: Path, base_name: str) -> Path:
    """Return a unique path by adding _2, _3, ... if needed."""
    out_path = base_dir / base_name
    if not out_path.exists():
        return out_path
    stem, ext = base_name.rsplit(".", 1)
    i = 2
    while True:
        candidate = base_dir / f"{stem}_{i}.{ext}"
        if not candidate.exists():
            return candidate
        i += 1

# ===== find input files =====
files = sorted(warehouse_dir.glob(pattern))
if not files:
    raise FileNotFoundError(f"No files matched {pattern} in {warehouse_dir}")

# ===== process and merge =====
merged_parts = []
total_rows = 0
total_kept = 0
total_removed = 0

for fp in files:
    df = pd.read_csv(fp)
    if "text" not in df.columns:
        print(f"Skipping {fp.name} because 'text' column not found")
        continue

    df["word_count"] = df["text"].apply(lambda x: len(str(x).split()))
    kept = df[df["word_count"] <= WORD_CAP].drop(columns=["word_count"])
    removed = len(df) - len(kept)

    merged_parts.append(kept)
    total_rows += len(df)
    total_kept += len(kept)
    total_removed += removed

# combine
if not merged_parts:
    raise ValueError("No valid datasets to merge after filtering")

merged = pd.concat(merged_parts, ignore_index=True)
merged = merged.drop_duplicates().reset_index(drop=True)

# ===== save =====
out_path = get_unique_path(warehouse_dir, "depression_dataset_small_merged.csv")
merged.to_csv(out_path, index=False, encoding="utf-8")

# ===== report =====
print(f"Input files: {[fp.name for fp in files]}")
print(f"Total rows in input: {total_rows}")
print(f"Rows kept (<= {WORD_CAP} words): {total_kept}")
print(f"Rows removed (> {WORD_CAP} words): {total_removed}")
print(f"Deduplicated rows in output: {len(merged)}")
print(f"Saved to: {out_path.relative_to(project_dir)}")


Input files: ['depression_dataset.csv', 'depression_dataset_2.csv']
Total rows in input: 3938
Rows kept (<= 400 words): 3479
Rows removed (> 400 words): 459
Deduplicated rows in output: 3476
Saved to: Data_Warehouse\depression_dataset_small_merged.csv
