# DSCI 511 – Term Project (Phase 1): Scoping a Dataset

> **Team:** Roy Phelps, Shad Scarboro, Leland Weeks, Evan Wessel  
> **Dataset:** `heart_attack_china.csv`  
> **Repo:** ()

## 1) Abstract (diverse audience)
A short 3–5 sentence explanation (non-technical) about the topic, why it matters, and what your dataset enables.

## 2) Team Background & Roles
- Member → skill focus → growth goals  
- Note who will work on data cleaning, enrichment, documentation, and notebooks.

## 3) Topic & Intended Uses
Briefly describe:
- What questions or tasks the dataset supports
- Why it's a good fit for Phase 1


## 4) Data Sample & Dictionary
Below is a 10-row preview of the dataset and a reference to the data dictionary located at:

`../docs/data_dictionary.md`


In [31]:
# Ignore warnings
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

# Dataset import and first 10 rows
import pandas as pd

df = pd.read_csv("../data/raw/heart_attack_china.csv", low_memory=False)
pd.set_option("display.max_columns", None)

df.head(10)



Unnamed: 0,Patient_ID,Age,Gender,Smoking_Status,Hypertension,Diabetes,Obesity,Cholesterol_Level,Air_Pollution_Exposure,Physical_Activity,Diet_Score,Stress_Level,Alcohol_Consumption,Family_History_CVD,Healthcare_Access,Rural_or_Urban,Region,Province,Hospital_Availability,TCM_Use,Employment_Status,Education_Level,Income_Level,Blood_Pressure,Chronic_Kidney_Disease,Previous_Heart_Attack,CVD_Risk_Score,Heart_Attack
0,1,55,Male,Non-Smoker,No,No,Yes,Normal,High,High,Moderate,Low,Yes,No,Good,Rural,Eastern,Beijing,Low,Yes,Unemployed,Primary,Low,104,Yes,No,78,No
1,2,66,Female,Smoker,Yes,No,No,Low,Medium,High,Healthy,Medium,No,Yes,Poor,Urban,Eastern,Qinghai,High,No,Unemployed,Secondary,Middle,142,No,No,49,No
2,3,69,Female,Smoker,No,No,No,Low,Medium,High,Moderate,Low,No,No,Poor,Rural,Eastern,Henan,Low,No,Unemployed,Primary,High,176,No,No,31,No
3,4,45,Female,Smoker,No,Yes,No,Normal,Medium,Low,Healthy,Medium,Yes,No,Poor,Rural,Central,Qinghai,Medium,Yes,Employed,Primary,Low,178,No,Yes,23,No
4,5,39,Female,Smoker,No,No,No,Normal,Medium,Medium,Healthy,Low,No,No,Moderate,Urban,Western,Guangdong,Low,No,Retired,Higher,Middle,146,Yes,No,79,No
5,6,76,Male,Smoker,No,No,No,Low,Low,Low,Poor,Medium,No,Yes,Moderate,Urban,Eastern,Sichuan,Low,Yes,Employed,Higher,Middle,92,No,No,49,No
6,7,37,Male,Smoker,No,No,No,Normal,Medium,Low,Poor,Medium,No,Yes,Poor,Urban,Eastern,Shanghai,Low,Yes,Employed,Higher,High,144,Yes,No,81,No
7,8,88,Male,Non-Smoker,Yes,No,Yes,Low,High,Low,Moderate,High,No,No,Moderate,Rural,Western,Shandong,High,No,Retired,,Low,162,No,No,27,No
8,9,54,Female,Smoker,No,No,Yes,Normal,Medium,Medium,Poor,Medium,No,No,Poor,Urban,Northern,Gansu,Medium,Yes,Unemployed,Secondary,Low,93,No,No,62,No
9,10,47,Female,Smoker,No,No,Yes,Low,High,Low,Moderate,Medium,Yes,Yes,Moderate,Rural,Eastern,Beijing,Low,No,Employed,,High,125,No,No,67,Yes


## 5) Derived Risk Flags (Optional but Planned)
To support analysis and enrichment in Phase 2, we will generate new columns such as:

- `has_hypertension`
- `has_diabetes`
- `has_dyslipidemia`
- `lifestyle_risk_score`
- optional: `record_date` (synthetic for trend visuals)

All logic is documented in:  
`../docs/derived_risk_flags.md`

These flags will be created in a later section and saved to:  
`data/processed/heart_attack_china_enriched.csv`


In [19]:
# Placeholder: derived risk flags will be implemented after initial Phase 1 write-up.
# Logic is documented in ../docs/derived_risk_flags.md

# Example structure:
# import pandas as pd
# df = pd.read_csv("../data/raw/heart_attack_china.csv", low_memory=False)
# ... flag creation here ...
# df.to_csv("../data/processed/heart_attack_china_enriched.csv", index=False)


## 6) Provenance & Access

The dataset `heart_attack_china.csv` is stored locally in:

`data/raw/heart_attack_china.csv`

For this phase, we are treating it as a publicly usable dataset for educational purposes.  
We will include a short statement about:

- **Where it originated** (e.g., synthetic sample, aggregate release, public health source, Kaggle-like dataset).
- **Any usage notes or licenses** (if applicable).
- **How others can access it** once the repository is pushed to GitHub.

When the repo is published, others will be able to pull the dataset from:
`data/raw/heart_attack_china.csv`

If additional sources are used in Phase 2 (e.g., WHO, NHANES, air quality, socioeconomic indicators), those sources and access methods will also be documented here.


## 7) Limitations & Improvements

At this stage, we recognize the following limitations:

- **No time dimension**: The dataset does not include explicit dates for patient events or records.
- **Geographic alignment**: While region/province fields exist, they may need standardization for enrichment or joins.
- **Potential missingness or imbalance**: Some columns may have null values or skewed distributions.
- **No explicit data dictionary from the source**: We created our own in `../docs/data_dictionary.md`.

### Planned Improvements (Phase 2 or later Phase 1 work):
- Add `record_date` (synthetic or real, if available).
- Use `derived_risk_flags.md` to generate flags such as hypertension, diabetes, lifestyle risk.
- Validate and standardize geographic categories for optional external merges.
- Consider enrichment with public sources (e.g., WHO, NHANES, socioeconomic indicators).


## 8) Enrichment Plan (Optional / Planned)

Because the dataset includes geographic fields (e.g., province or region), we may add external data 
to enhance analysis in later phases. Possible enrichment sources:

- **Regional health indicators** (e.g., WHO, national health statistics)
- **Environmental factors** (e.g., air quality, PM2.5, population density)
- **Socioeconomic data** (e.g., GDP per capita, insurance coverage)
- **Clinical benchmarks** (e.g., global cholesterol or hypertension prevalence)

Any external data used will:
1. Be documented clearly in the Phase 2 write-up.
2. Rely on consistent join keys (like standardized region/province names).
3. Be stored separately in `data/external/` or processed and merged into `data/processed/`.


## 9) Reproducibility

This section explains how someone can rebuild our processed dataset from raw files and re-run the notebook.

**Data locations**
- Raw: `../data/raw/heart_attack_china.csv`
- Processed (output): `../data/processed/heart_attack_china_enriched.csv`
- Docs: `../docs/data_dictionary.md`, `../docs/derived_risk_flags.md`

**Steps**
1. Clone the repo.
2. Open `notebooks/Phase1_Report.ipynb`.
3. Run the “Derived Risk Flags” code cell to create `data/processed/heart_attack_china_enriched.csv`.
4. Re-run remaining cells as needed.


In [34]:
# Rebuild processed dataset from raw (minimal version for Phase 1)
# For detailed logic, see ../docs/derived_risk_flags.md

import pandas as pd
from pathlib import Path

RAW = Path("../data/raw/heart_attack_china.csv")
OUT = Path("../data/processed/heart_attack_china_enriched.csv")
OUT.parent.mkdir(parents=True, exist_ok=True)

df = pd.read_csv(RAW, low_memory=False)

# --- Optional: split SBP/DBP if "Blood Pressure" is like "130/85"
if "Blood Pressure" in df.columns and df["Blood Pressure"].dtype == object:
    bp_split = df["Blood Pressure"].str.extract(r'(?P<SBP>\d+)[^0-9]+(?P<DBP>\d+)').astype(float)
    df["SBP"] = bp_split["SBP"]
    df["DBP"] = bp_split["DBP"]

# --- Flags (keep it simple for Phase 1)
df["has_hypertension"] = ((df.get("SBP", pd.Series([None]*len(df))).fillna(-1) >= 130) |
                          (df.get("DBP", pd.Series([None]*len(df))).fillna(-1) >= 80))

df["has_diabetes"] = ((df.get("HbA1c", 0) >= 6.5) |
                      (df.get("Glucose", 0) >= 126))

df["has_dyslipidemia"] = ((df.get("Cholesterol", 0) >= 240) |
                          (df.get("Triglycerides", 0) >= 200))

df.to_csv(OUT, index=False)
print("Wrote:", OUT.resolve())


Wrote: /Users/royphelps/Library/CloudStorage/OneDrive-DrexelUniversity/DSCI 511-900/Colab Notebooks/DSCI-511-Project/data/processed/heart_attack_china_enriched.csv


In [36]:
# Verify the enriched data is saved 
import pandas as pd

df_enriched = pd.read_csv("../data/processed/heart_attack_china_enriched.csv")
df_enriched.head()


Unnamed: 0,Patient_ID,Age,Gender,Smoking_Status,Hypertension,Diabetes,Obesity,Cholesterol_Level,Air_Pollution_Exposure,Physical_Activity,Diet_Score,Stress_Level,Alcohol_Consumption,Family_History_CVD,Healthcare_Access,Rural_or_Urban,Region,Province,Hospital_Availability,TCM_Use,Employment_Status,Education_Level,Income_Level,Blood_Pressure,Chronic_Kidney_Disease,Previous_Heart_Attack,CVD_Risk_Score,Heart_Attack,has_hypertension,has_diabetes,has_dyslipidemia
0,1,55,Male,Non-Smoker,No,No,Yes,Normal,High,High,Moderate,Low,Yes,No,Good,Rural,Eastern,Beijing,Low,Yes,Unemployed,Primary,Low,104,Yes,No,78,No,False,False,False
1,2,66,Female,Smoker,Yes,No,No,Low,Medium,High,Healthy,Medium,No,Yes,Poor,Urban,Eastern,Qinghai,High,No,Unemployed,Secondary,Middle,142,No,No,49,No,False,False,False
2,3,69,Female,Smoker,No,No,No,Low,Medium,High,Moderate,Low,No,No,Poor,Rural,Eastern,Henan,Low,No,Unemployed,Primary,High,176,No,No,31,No,False,False,False
3,4,45,Female,Smoker,No,Yes,No,Normal,Medium,Low,Healthy,Medium,Yes,No,Poor,Rural,Central,Qinghai,Medium,Yes,Employed,Primary,Low,178,No,Yes,23,No,False,False,False
4,5,39,Female,Smoker,No,No,No,Normal,Medium,Medium,Healthy,Low,No,No,Moderate,Urban,Western,Guangdong,Low,No,Retired,Higher,Middle,146,Yes,No,79,No,False,False,False
