# 01. Data Structure & Synthetic Placeholder Construction

## Public Release (Sanitized Version)

This notebook defines the **data schema and structural assumptions** used throughout the study, without exposing any raw or sensitive text data.

### Rationale
- The original datasets (AI Hub Wellness, Everytime) contain sensitive
  mental-health–related text and cannot be redistributed.
- To ensure reproducibility, we provide **synthetic placeholder data**
  that preserves:
  - column names
  - data types
  - multi-label structure
  - class imbalance characteristics

This notebook corresponds to the *Dataset Description* and *Preprocessing Overview* sections of the paper.

In [None]:
# ==================================================
# 1. Imports and Global Configuration
# ==================================================
import numpy as np
import pandas as pd

# Reuse global configuration from Notebook 00
NUM_LABELS = 9
TEXT_COLUMN = 'text'
LABEL_COLUMNS = [f'A{i}' for i in range(1, NUM_LABELS + 1)]


In [None]:
# ==================================================
# 2. Dataset Schema Definition
# ==================================================
# Each row corresponds to a single utterance or post.
# Multi-label annotation follows DSM-5 depressive
# disorder diagnostic criteria (A1–A9).

SCHEMA_DESCRIPTION = {
    'text': 'Anonymized user utterance (synthetic placeholder)',
    'A1': 'Depressed mood',
    'A2': 'Loss of interest or pleasure',
    'A3': 'Weight/appetite change',
    'A4': 'Sleep disturbance',
    'A5': 'Psychomotor agitation or retardation',
    'A6': 'Fatigue or loss of energy',
    'A7': 'Feelings of worthlessness or guilt',
    'A8': 'Diminished ability to think or concentrate',
    'A9': 'Recurrent thoughts of death or suicidal ideation'
}

SCHEMA_DESCRIPTION


In [None]:
# ==================================================
# 3. Synthetic Placeholder Data Generation
# ==================================================
# Note:
# - The values below do NOT correspond to real users.
# - Label probabilities are chosen to mimic realistic
#   class imbalance observed in mental health datasets.

np.random.seed(42)
NUM_SAMPLES = 2000

data = {
    TEXT_COLUMN: ['SYNTHETIC_TEXT'] * NUM_SAMPLES,
}

# Approximate label prevalence (example values)
label_prevalence = {
    'A1': 0.35,
    'A2': 0.30,
    'A3': 0.12,
    'A4': 0.20,
    'A5': 0.10,
    'A6': 0.25,
    'A7': 0.18,
    'A8': 0.15,
    'A9': 0.07,
}

for label, prob in label_prevalence.items():
    data[label] = np.random.binomial(1, prob, NUM_SAMPLES)

df = pd.DataFrame(data)
df.head()


In [None]:
# ==================================================
# 4. Basic Dataset Integrity Checks
# ==================================================
# Ensure correct shape and label distribution
df.shape


In [None]:
# --------------------------------------------------
# Label distribution (for verification only)
# --------------------------------------------------
label_distribution = df[LABEL_COLUMNS].mean()
label_distribution


In [None]:
# ==================================================
# 5. Notes for Downstream Processing
# ==================================================
# - This DataFrame serves as a drop-in replacement
#   for the real dataset in public notebooks.
# - All preprocessing, labeling, modeling, and
#   evaluation logic operates on this structure.
# - No downstream notebook assumes access to
#   original raw text.
