
# 06. Evaluation and Metrics for Multi-label Classification (Sanitized)

This notebook documents the evaluation protocol used to assess multi-label
DSM-5 depression detection models. All examples use **synthetic placeholders**
to preserve privacy while maintaining full methodological transparency.



## 1. Why Accuracy Is Not Used

In multi-label, imbalanced mental health datasets:
- Accuracy can be misleading
- Most labels are sparse
- Correctly predicting all labels is rare

Therefore, we adopt **F1-based metrics**.



## 2. Metric Definitions

- **Micro F1**: global performance across all labels
- **Macro F1**: unweighted mean across labels
- **Weighted F1**: label-frequency-weighted performance


In [None]:

import numpy as np
from sklearn.metrics import f1_score



## 3. Synthetic Ground Truth and Predictions


In [None]:

NUM_SAMPLES = 100
NUM_LABELS = 9

y_true = np.random.randint(0, 2, size=(NUM_SAMPLES, NUM_LABELS))
y_pred = np.random.randint(0, 2, size=(NUM_SAMPLES, NUM_LABELS))



## 4. F1 Score Computation


In [None]:

micro_f1 = f1_score(y_true, y_pred, average='micro')
macro_f1 = f1_score(y_true, y_pred, average='macro')
weighted_f1 = f1_score(y_true, y_pred, average='weighted')

micro_f1, macro_f1, weighted_f1



## 5. Per-label Performance

Per-label F1 scores allow diagnosis of which DSM-5 criteria
are more difficult to detect.


In [None]:

per_label_f1 = [
    f1_score(y_true[:, i], y_pred[:, i])
    for i in range(NUM_LABELS)
]

per_label_f1



## 6. Interpretation Notes

- Micro F1 favors frequent symptoms
- Macro F1 highlights rare but clinically important symptoms
- Weighted F1 balances both perspectives



## 7. Reporting Strategy

In the paper:
- Weighted F1 is reported as the primary metric
- Micro and Macro F1 are included for completeness
- Results are averaged across runs



## 8. Ethics and Reproducibility

- No real patient or user data are included
- Metric computation exactly matches the original experiments
- Evaluation logic is fully reproducible
