# Phase 2

This notebook **cleans and prepares** the Heart Attack China dataset and WHO dataset, producing three processed outputs:

- **Analysis-ready:** `../data/processed/heart_attack_china_analysis_ready.csv`  
- **Model-ready:** `../data/processed/heart_attack_china_model_ready.csv`  
- **Merged (with WHO context):** `../data/processed/heart_attack_china_with_who_latest_by_sex.csv`

---

# Raw Inputs and Processing Steps

**Raw Input Paths**
- Heart Attack data: `../data/raw/heart_attack_china.csv`  
- WHO data: `../data/raw/who_health_china.csv`  
- Output directory: `../data/processed`

**Processing Overview**
- Clean heart-attack data and engineer key features (e.g., gender, SBP flags, yes/no mappings).  
- Clean WHO data, keep the *latest year* per indicator × sex, and pivot it to a wide format.  
- Map `Gender_simple` → WHO sex codes and join datasets.  
- Save three outputs: analysis-ready, model-ready, and merged with WHO context.



## 1) Combining Both Heart_attack_china and who_health_china

In [29]:

# Import and file paths
import pandas as pd
from pathlib import Path

RAW_PATH = "../data/raw/heart_attack_china.csv"
WHO_PATH = "../data/raw/who_health_china.csv"
OUTDIR   = "../data/processed"

print("RAW:", RAW_PATH)
print("WHO:", WHO_PATH)
print("OUTDIR:", OUTDIR)


# Load the two csv's ha and who data
ha = pd.read_csv(RAW_PATH, low_memory=False)
who = pd.read_csv(WHO_PATH, low_memory=False)


#~~~~~~~~~ Heart attack csv ~~~~~~~~~~~~

# Some basic cleaning on the heart attack data
# strip spaces, make columns, and trim text columns
ha = ha.copy()
ha.columns = (
    ha.columns.str.strip()
              .str.replace(r"\s+", "_", regex=True)
              .str.replace(r"[^\w_]", "", regex=True)
)

for col in ha.select_dtypes(include="object").columns:
    ha[col] = ha[col].astype(str).str.strip()


# Make a simple gender column as M, F, or NA
# If the cvs does not have a gender, create the column
if "Gender" in ha.columns:
    g = ha["Gender"].astype(str).str.upper().str[0]
    g = g.replace({"M": "M", "F": "F"})
    ha["Gender_simple"] = g.where(g.isin(["M", "F"]), pd.NA)
else:
    ha["Gender_simple"] = pd.NA


# Change the blood pressur column to SBP
# then create the missing + hypertensive flags
if "Blood_Pressure" in ha.columns and "SBP" not in ha.columns:
    ha = ha.rename(columns={"Blood_Pressure": "SBP"})

if "SBP" in ha.columns:
    ha["SBP"] = pd.to_numeric(ha["SBP"], errors="coerce")
    ha["SBP_missing"] = ha["SBP"].isna()
    ha["SBP_hypertensive"] = (ha["SBP"] >= 140).astype("Int64")


# Simple yes or no mapping for a few columns
yes_no_map = {"yes": 1, "no": 0, "y": 1, "n": 0, "true": 1, "false": 0}

# These columns can be adjusted later if needed
# These columns will be set to yes / no, y / n, and true / false
# with 1 and 0 respectively 
for col in ["Hypertension", "Diabetes", "Obesity", "Heart_Attack"]:
    if col in ha.columns:
        ser = ha[col].astype(str).str.lower().str.strip()
        ha[col] = ser.map(yes_no_map).astype("Int64")

#~~~~~~~~~ WHO csv ~~~~~~~~~~~~

# Clean and prep the WHO csv
who = who.copy()

# Strip underscores in columns with regex
who.columns = (
    who.columns.str.strip()
               .str.replace(r"\s+", "_", regex=True)
               .str.replace(r"[^\w_]", "", regex=True)
)

# Keep only China rows just in case there are 
# others
if "country" in who.columns:
    who = who[who["country"].astype(str).str.lower() == "china"].copy()


# Error check to see if the needed columns
# are available 
needed = ["year", "indicator", "sex", "value"]
for col in needed:
    if col not in who.columns:
        raise ValueError(f"WHO file missing column: {col}")


# Change the year to numeric and keep the latest
who["year"] = pd.to_numeric(who["year"], errors="coerce")
who = who.dropna(subset=["year"])
who = (
    who.sort_values(["indicator", "sex", "year"])
       .groupby(["indicator", "sex"], as_index=False)
       .tail(1)
)


# Create a WHO table one row per gender
# BISX both sexes
# SEX_MLE attaches to M
# SEX_FMLE attaches to F

# Build the wide table first
who["colname"] = (
    "WHO_" + who["indicator"].astype(str).str.replace(r"\s+", "_", regex=True)
    + "_" + who["sex"].astype(str) + "_latest"
)

# Pivot table like you would do in Excell
wide = who.pivot_table(index=[], columns="colname", values="value", aggfunc="first").reset_index(drop=True)

all_cols = wide.columns.tolist()
btsx_cols = [c for c in all_cols if "_SEX_BTSX_" in c]
m_cols    = [c for c in all_cols if "_SEX_MLE_"  in c]
f_cols    = [c for c in all_cols if "_SEX_FMLE_" in c]

# If we have no rows at all, create empty shells
if wide.shape[0] == 0:
    wide_m = pd.DataFrame([{c: pd.NA for c in (btsx_cols + m_cols)}])
    wide_f = pd.DataFrame([{c: pd.NA for c in (btsx_cols + f_cols)}])
else:
    wide_m = wide[btsx_cols + m_cols].copy()
    wide_f = wide[btsx_cols + f_cols].copy()

wide_m["Gender_simple"] = "M"
wide_f["Gender_simple"] = "F"

who_by_gender = pd.concat([wide_m, wide_f], ignore_index=True)


#~~~~~~~~~ Merge part on gender ~~~~~~~~~~~

df_merged = ha.merge(
    who_by_gender,
    on="Gender_simple",
    how="left"
)


# Create an analysis ready csv
# just picking out useful columns
# We will use this csv to do some
# anslysis on
analysis_ready = df_merged.copy()


# This model is just the column we think we will model on
# Can be adjusted if needed
keep_cols = [
    "Patient_ID", "Age", "Gender_simple", "SBP", "SBP_missing",
    "SBP_hypertensive", "Hypertension", "Diabetes", "Obesity",
    "Heart_Attack"
]
keep_cols = [c for c in keep_cols if c in df_merged.columns]

model_ready = df_merged[keep_cols].copy()


#~~~~~~~~~ Saving Portion ~~~~~~~~~~~

# Save everything in case we want to adjust
outdir = Path(OUTDIR)
outdir.mkdir(parents=True, exist_ok=True)

analysis_path = outdir / "heart_attack_china_analysis_ready.csv"
model_path    = outdir / "heart_attack_china_model_ready.csv"
merged_path   = outdir / "heart_attack_china_with_who_latest_by_sex.csv"

analysis_ready.to_csv(analysis_path, index=False, encoding="utf-8")
model_ready.to_csv(model_path, index=False, encoding="utf-8")
df_merged.to_csv(merged_path, index=False, encoding="utf-8")

print("Wrote:")
print(" - analysis:", analysis_path)
print(" - model:   ", model_path)
print(" - merged:  ", merged_path)


RAW: ../data/raw/heart_attack_china.csv
WHO: ../data/raw/who_health_china.csv
OUTDIR: ../data/processed
Wrote:
 - analysis: ../data/processed/heart_attack_china_analysis_ready.csv
 - model:    ../data/processed/heart_attack_china_model_ready.csv
 - merged:   ../data/processed/heart_attack_china_with_who_latest_by_sex.csv


In [27]:
# Print statements to see the first 5 rows
# in each saved csv.  We can choose anyone
# really to perfom analysis on depending on
# what we want to do.

print("Analysis Ready CSV First 5:")
print()
print(analysis_ready.head(5))
print()
print("Model Ready CSV First 5:")
print()
print(model_ready.head(5))
print()
print("DF Merged CSV First 5:")
print()
print(df_merged.head(5))


Analysis Ready CSV First 5:

   Patient_ID  Age  Gender Smoking_Status  Hypertension  Diabetes  Obesity  \
0           1   55    Male     Non-Smoker             0         0        1   
1           2   66  Female         Smoker             1         0        0   
2           3   69  Female         Smoker             0         0        0   
3           4   45  Female         Smoker             0         1        0   
4           5   39  Female         Smoker             0         0        0   

  Cholesterol_Level Air_Pollution_Exposure Physical_Activity  ...  \
0            Normal                   High              High  ...   
1               Low                 Medium              High  ...   
2               Low                 Medium              High  ...   
3            Normal                 Medium               Low  ...   
4            Normal                 Medium            Medium  ...   

  Chronic_Kidney_Disease Previous_Heart_Attack CVD_Risk_Score Heart_Attack  \
0        

### With this new merged dataset we can:
- See where the region is most heart attack prone
- What ages they occur in
- What gender they occur in
- What are the contributing factors

### In simpler terms:
- Regional Risk Patterns – Identify which areas or provinces in China show the highest prevalence or risk of heart attacks.
- Age Distribution – Analyze the age groups most affected by heart attacks, helping to understand vulnerable populations.
- Gender Differences – Compare how heart attack rates differ between men and women.
- Contributing Factors – Examine how lifestyle and clinical factors (hypertension, smoking, obesity, diabetes, cholesterol levels) correlate with heart attack occurrence.

### Which file should we use?
- If we are focusing purly on individual patient level trends, I would suggest using `heart_attack_china_analyiss_ready.csv`
- If we are focusing population level context like the indicators in the WHO csv with gender, I would suggest using `heart_attack_china_with_who_latest_by_sex.csv`