# Research Data Cleaning & Privacy – Team Handout 

**Audience:** teammates and presenters  
**Goal:** explain *what* our preprocessing does, *why* we do it, and *how* to reproduce it.  
**Artifacts produced:** `clean_full.csv`, `clean_numeric.csv`, `label_mappings.json`, `cleaning_summary.txt` (+ privacy-safe tables in `reports/tables_public/`).


## 1) Folder Layout & How to Run
```
repo/
  data/                             # put the original CSV here
    processed/                      # script writes here
  reports/
    tables_public/                  # privacy-safe tables (RR3)
  Data Pre .py                      # relative-path script
```

**Run (default paths):**
```bash
python "Data Pre.py"
```
**Override paths (optional):**
```bash
python "Data Pre.py"   --input data/raw/your.csv   --out_dir data/processed   --pub_dir reports/tables_public
```
**Strictness toggle:** set `STRICT=True` in the script to fail on unmapped tokens (useful after first pass).


## 2) Transformations – What, Why, How

### 2.1 Missingness normalization
**What:** Turn visual blanks and placeholders into true missing values.  
**Why:** Avoid treating blanks like real categories.  
**Rule:** Blank/whitespace → `NA`; case-insensitive placeholders `NA|N/A|null|none|-|--` → `NA`.

### 2.2 Token normalization (string hygiene)
**What:** Lowercase, trim, collapse non-alphanumerics to spaces; standardize common variants.  
**Why:** Maximizes vocabulary hit-rate and prevents silent mapping failures.

### 2.3 Binary encoding
**What:** Normalize yes/no-like fields to {1,0}, keep others as `NA`.  
**Why:** Consolidates many spellings (`yes,y,true,1,...`) into one signal.  
**Rule:** `{yes,y,true,t,1,ever,present,positive}→1`; `{no,n,false,f,0,never,absent,negative}→0`.

### 2.4 Ordinal encoding
**What:** Map ordered categories to integers; also derive binary thresholds.  
**Why:** Preserve order information (more signal than one-hot for these).  
**Vocabularies:**
- `Smoking_Habit`: `{non-smoker:0, occasional smoker:1, regular smoker:2, heavy smoker:3}` → binary `>=2`
- `Alcohol_Consumption`: `{non-drinker:0, social drinker:1, regular drinker:2, heavy drinker:3}` → binary `>=2`
- `Diet_Quality`: `{unhealthy:0, average:1, healthy:2}`
- `Severity`: `{low:0, medium:1, high:2}`
- `Stress_Level`: `{low:0, medium:1, high:2}` (**treated as ordinal, not a 1–10 score**)

### 2.5 Continuous range clipping
**What:** Replace out-of-range numeric values with `NA` using a data dictionary.  
**Why:** Remove obvious errors/outliers to stabilize learning.  
**Ranges:** Age [10,100], Sleep [0,16], Work [0,100], Physical Activity [0,20], Social Media [0,24].

### 2.6 De-duplication
Drop duplicates (prefer `ID` if present).

### 2.7 Label encoding (nominal)
Encode nominal fields into integers and save mapping JSON:  
`Gender_norm`, `Occupation`, `Country`, `Relationship_Status_norm`, `Mental_Health_Condition` → `*_lbl` + `label_mappings.json`.

### 2.8 Drop constant/low-variance columns
Remove columns with `<2` distinct values (incl. NA).

### 2.9 Simple imputation (numeric view only)
On `clean_numeric.csv`: integer/boolean → **mode**; other numeric → **median**.  
Keep NA in `clean_full.csv` for honest analysis.


## 3) Privacy by Design

### 3.1 De-identification & generalization
- **Age banding:** add `AgeBand` (20-year buckets) for public views.  
- **Low-frequency merge:** rare `Occupation`/`Country` → `"Other"` for public release.

### 3.2 k-Anonymity checks
Group by `{Gender_norm × Country_pub × Occupation_pub × AgeBand}`; flag groups with count `< k` (default `k=5`).  
Adjust banding or raise the low-frequency threshold until violations drop.

### 3.3 RR3 randomized rounding (public tables)
First suppress cells `<3`, then stochastically round remaining counts to multiples of 3 (RR3).  
We publish examples to `reports/tables_public/`.


## 4) Quality Assurance & Reproducibility
- **STRICT mode:** after a trial run, set `STRICT=True` to surface unmapped tokens; add synonyms or fix data.  
- **Summary file:** `cleaning_summary.txt` logs rows/cols, dropped constants, NA rates for key fields, k-anonymity status.  
- **Determinism:** RR3 uses a fixed RNG seed (set `SEED`) for reproducible demos.  
- **No leakage:** encoders/transformers are defined independent of any train/test split; when modeling, fit them on the **training** split.


## 5) Common Pitfalls (and quick fixes)
- **Columns not found** (`Smoking Habit` vs `Smoking_Habit`): the script resolves case/spaces, but always verify raw headers.  
- **Everything becomes NA after clipping:** you may be applying numeric ranges to textual ordinal fields (e.g., `Stress_Level`). Treat as ordinal instead.  
- **Unknown category explosion:** add synonyms to the mapping extension, or switch to `STRICT=True` to get the exact tokens to fix.
