# Phase-2 Notebook Reference Cards

---

## `02_make_dataset.ipynb` — *SBP-Only Mini Pipeline*
**Purpose:**  
Create a small, beginner-friendly preprocessing pipeline focused only on SBP cleaning.

**Inputs:**  
- `../data/raw/heart_attack_china.csv`

**Steps:**  
1. Rename `Blood_Pressure` → `SBP`  
2. Convert `SBP` to numeric  
3. Add `SBP_missing` flag  
4. Add `SBP_hypertensive` (`SBP ≥ 140` → 1 else 0)

**Outputs:**  
- `../data/processed/heart_attack_china_clean.csv`  
- `../data/processed/heart_attack_china_final.csv`

**Edit if:**  
- Column name differs → adjust `"Blood_Pressure"` rename  
- Hypertension threshold changes → update `140`

---

## `phase2_all_in_notebook.ipynb` — *Full Phase 2 Pipeline (Updated 12/5)*

**Purpose:**  
Produce all Phase-2 datasets used in the project:  
- Analysis-ready  
- Model-ready  
- WHO-enhanced  
- Air-quality-enhanced  

This pipeline now also merges OpenAQ PM2.5 data.

**Inputs:**  
- `../data/raw/heart_attack_china.csv`  
- WHO dataset: `../data/raw/who_health_china.csv`  
- Province → city mapping (in notebook)  
- OpenAQ API (PM2.5)  
- OSM (Nominatim) for geocoding (city coordinates)

**Steps:**  
1. Load and clean the raw heart-attack dataset  
2. Standardize string fields, normalize categories  
3. Derive features:
   - `Gender_simple`  
   - Yes/No → boolean  
   - Hypertension / Diabetes / Obesity flags  
   - Risk-factor count  
4. Clean WHO dataset:
   - Keep latest-year values per indicator × sex  
   - Pivot wide  
   - Merge with patient-level data
5. Get province-level city coordinates using OSM  
6. Query OpenAQ API:
   - Find air-quality stations near each city  
   - Retrieve latest PM2.5  
   - Merge PM2.5 into dataset (urban/rural split)
7. Generate and save all Phase-2 CSV outputs

**Outputs (Updated):**  
- `../data/processed/heart_attack_china_analysis_ready.csv`  
- `../data/processed/heart_attack_china_model_ready.csv`  
- `../data/processed/heart_attack_china_with_who_latest_by_sex.csv`  
- `../data/processed/heart_attack_china_with_air_quality.csv`  
- *(Intermediate)* `../data/processed/province_air_quality.csv`

**Edit if:**  
- WHO file changes → update pivot logic  
- PM2.5 radius or query limits change → update OpenAQ URL parameters  
- Province/city mapping changes → update the `cities = {}` dictionary  
