# Data Pre (Lean) — Handover Notes

This notebook explains what the `Data Pre.py` script does and what it produces. It covers: cleaning, artifacts, 70/15/15 split with train-only K-fold labels, fairness guardrails, and SMOTE preparation.


## 1) What the script does (high-level)
- Cleaning & normalization (missing tokens unified; key numerics coerced; text normalized; ordinals/binaries derived; range clipping; dedup).
- Two cleaned tables: `clean_full.csv` (analysis) and `clean_numeric.csv` (modeling) + `label_mappings.json`.
- Privacy-safe output: RR3-rounded public table + k-anonymity note.
- Stratified 70/15/15 split + **train-only** K-fold labels (default 5).
- Fairness guardrails: `clean_numeric_model.csv`, group representation & missingness gaps in summary, optional `sample_weights.csv`.
- SMOTE prep (diagnostics only): `smote_config.json`, `imbalance_report.txt`.
- Run log: `cleaning_summary.txt`.


## 2) Artifacts and how to use them
- `clean_full.csv`: EDA/reporting; keeps text & NA.
- `clean_numeric.csv`: modeling; numeric/_ord/_bin/_lbl; simple imputation.
- `label_mappings.json`: codebooks for *_lbl.
- `reports/tables_public/mh_by_gender_rr3.csv`: public share (RR3).
- `splits_70_15_15_k5.csv`: columns = `row_id`, `split`, `cv_fold` (train only).
- `clean_numeric_model.csv` & `sample_weights.csv`: fairness-friendly options.
- `smote_config.json` & `imbalance_report.txt`: for SMOTENC later (training folds only).
- `cleaning_summary.txt`: one-page summary of all above.


## 3) Splits & K-fold labels
- 70/15/15 split stratified by `Mental_Health_Condition` (Yes/No → 1/0).
- `cv_fold` assigned **only** for training rows (0..K-1), `-1` for val/test.
- Use CV inside train; use val for early stopping (if needed); evaluate once on test.
- Any learnable step (scaler/encoder/**SMOTENC**) must be fit inside training folds only.


## 4) Fairness guardrails (non-intrusive)
- `clean_numeric_model.csv`: drops direct sensitive columns; optional `*_isna` flags.
- Summary logs group representation/positivity and missingness gaps; warns on small groups in val/test.
- `sample_weights.csv`: `w_label`, `w_group`, `w_combo` (optional, not auto-applied).


## 5) SMOTE preparation
- No oversampling in cleaning to avoid leakage; only diagnostics/config.
- `smote_config.json`: `categorical_indices` (all *_lbl/_bin), `numeric_indices`, and recommended knobs (`sampler`, `sampling_strategy`, `k_neighbors`, `random_state`).
- `imbalance_report.txt`: train class counts and per-fold counts.
- In modeling, apply **SMOTENC only on training folds**; never on val/test.


## 6) Quick checklist
- Files present: cleaned tables, labels, summary, splits; fairness add-ons; SMOTE prep; public table.
- Derived columns exist; ranges clipped; splits approx 70/15/15; `cv_fold` only on train.
- Summary includes fairness notes + SMOTE prep line.
