# DSCI 511 – Term Project (Phase 1): Scoping a Dataset

## 1) Proposed Dataset of Interest

We want to construct a dataset that would enable a thorough analysis cardiovascular health and lifestyle score across nations. We expect to aggregate information from many government and private healthcare datasets resulting in a singular dataset representing many demographics (e.g., age, gender, air pollution exposure), health vitals (e.g., cholesterol, blood pressure, diabetes) and activities (e.g., smoking, diet, physical activity). The data will most likely need to be collected via several access methods (e.g., direct download, open and protected APIs) and formats (e.g. CSV, relational database, unstructred data). We plan to present the final dataset as a CSV to minimize ingestion friction and maximize simplicity and readability. 

## 2) Team Background & Roles
Our team consists of four members with complementary strengths. Below are our self-identified skills and skills identified individually for growth:

**• Roy Phelps** rp994@drexel.edu  
Current Skills: Python, Jupyter Notebooks, basic data cleaning, visualization  
Targeted Growth Skills:  
Contribution: Code development, documentation, Git/GitHub organization  

**• Shad Scarboro** srs359@drexel.edu  
Skills: Data sourcing, research, presentation formatting  
Targeted Growth Skills:  
Contribution: Lead on data acquisition planning and writeup support

**• Leland Weeks** lhw22@drexel.edu  
Current Skills: Product Management, Data Analytics, Business Intelligence  
Targeted Growth Skills: Big Data, Data Mining, Predictive Analytics  
Contribution: Support on initial dataset exploration and preprocessing

**• Evan Wessel** ew594@drexel.edu   
Skills: Writing, data summaries, editing  
Targeted Growth Skills:  
Contribution: Drafting sections of the proposal and summarizing findings

As a team, we will collaborate across tasks and take responsibility in these areas to ensure efficiency and clarity.

> **Repo:** (https://github.com/royphelps1/DSCI-511-Project)

## 3) Potential Users and Applications
**Users:**

Studying cardiovascular health is critical for understanding disease risk and developing effective treatments, so the potential users of the dataset include hospitals, cardiology departments, public health analysts, professors, and students.

**Applications:**

Applications for this dataset include developing predictive models for risk assessment and/or improving existing models. The dataset can assist with early diagnosis and treatment methods. It can also assist health care workers and patients understand the risk factors and to tailor recommendations for lifestyle changes. 

## 4) Data Sample & Dictionary
Below is a 10-row preview of the dataset (acquired programmatically) and a reference to the data dictionary located at:

`../docs/data_dictionary.md`

In [6]:
# ─── One-time setup, if needed ───
# Install dependencies:
# pip install kagglehub

In [12]:
import kagglehub
from kagglehub import KaggleDatasetAdapter
from pathlib import Path

# Set the path to the file you'd like to load
file_path = "heart_attack_china.csv"

# Load the latest version
df = kagglehub.dataset_load(
  KaggleDatasetAdapter.PANDAS,
  "ankushpanday2/heart-attack-risk-dataset-of-china",
  file_path,
  # Provide any additional arguments like 
  # sql_query or pandas_kwargs. See the 
  # documenation for more information:
  # https://github.com/Kaggle/kagglehub/blob/main/README.md#kaggledatasetadapterpandas
)
display(df.head(10))

# write the file to the local disk
output_path = Path("../data/raw/heart_attack_china.csv")
output_path.parent.mkdir(parents=True, exist_ok=True)
df.to_csv(output_path, index=False)

Unnamed: 0,Patient_ID,Age,Gender,Smoking_Status,Hypertension,Diabetes,Obesity,Cholesterol_Level,Air_Pollution_Exposure,Physical_Activity,Diet_Score,Stress_Level,Alcohol_Consumption,Family_History_CVD,Healthcare_Access,Rural_or_Urban,Region,Province,Hospital_Availability,TCM_Use,Employment_Status,Education_Level,Income_Level,Blood_Pressure,Chronic_Kidney_Disease,Previous_Heart_Attack,CVD_Risk_Score,Heart_Attack
0,1,55,Male,Non-Smoker,No,No,Yes,Normal,High,High,Moderate,Low,Yes,No,Good,Rural,Eastern,Beijing,Low,Yes,Unemployed,Primary,Low,104,Yes,No,78,No
1,2,66,Female,Smoker,Yes,No,No,Low,Medium,High,Healthy,Medium,No,Yes,Poor,Urban,Eastern,Qinghai,High,No,Unemployed,Secondary,Middle,142,No,No,49,No
2,3,69,Female,Smoker,No,No,No,Low,Medium,High,Moderate,Low,No,No,Poor,Rural,Eastern,Henan,Low,No,Unemployed,Primary,High,176,No,No,31,No
3,4,45,Female,Smoker,No,Yes,No,Normal,Medium,Low,Healthy,Medium,Yes,No,Poor,Rural,Central,Qinghai,Medium,Yes,Employed,Primary,Low,178,No,Yes,23,No
4,5,39,Female,Smoker,No,No,No,Normal,Medium,Medium,Healthy,Low,No,No,Moderate,Urban,Western,Guangdong,Low,No,Retired,Higher,Middle,146,Yes,No,79,No
5,6,76,Male,Smoker,No,No,No,Low,Low,Low,Poor,Medium,No,Yes,Moderate,Urban,Eastern,Sichuan,Low,Yes,Employed,Higher,Middle,92,No,No,49,No
6,7,37,Male,Smoker,No,No,No,Normal,Medium,Low,Poor,Medium,No,Yes,Poor,Urban,Eastern,Shanghai,Low,Yes,Employed,Higher,High,144,Yes,No,81,No
7,8,88,Male,Non-Smoker,Yes,No,Yes,Low,High,Low,Moderate,High,No,No,Moderate,Rural,Western,Shandong,High,No,Retired,,Low,162,No,No,27,No
8,9,54,Female,Smoker,No,No,Yes,Normal,Medium,Medium,Poor,Medium,No,No,Poor,Urban,Northern,Gansu,Medium,Yes,Unemployed,Secondary,Low,93,No,No,62,No
9,10,47,Female,Smoker,No,No,Yes,Low,High,Low,Moderate,Medium,Yes,Yes,Moderate,Rural,Eastern,Beijing,Low,No,Employed,,High,125,No,No,67,Yes


## 5) Derived Risk Flags (Optional but Planned)
To support analysis and enrichment in Phase 2, we will generate new columns such as:

- `has_hypertension`
- `has_diabetes`
- `has_dyslipidemia`
- `lifestyle_risk_score`
- optional: `record_date` (synthetic for trend visuals)

All logic is documented in:  
`../docs/derived_risk_flags.md`

These flags will be created in a later section and saved to:  
`data/processed/heart_attack_china_enriched.csv`


In [20]:
# Placeholder: derived risk flags will be implemented after initial Phase 1 write-up.
# Logic is documented in ../docs/derived_risk_flags.md

# Example structure:
# import pandas as pd
# df = pd.read_csv("../data/raw/heart_attack_china.csv", low_memory=False)
# ... flag creation here ...
# df.to_csv("../data/processed/heart_attack_china_enriched.csv", index=False)


## 6) Provenance & Access

The dataset `heart_attack_china.csv` is stored locally in:

`data/raw/heart_attack_china.csv`

For this phase, we are treating it as a publicly usable dataset for educational purposes.  
We will include a short statement about:

- **Where it originated** (Kaggle).
- **Licenses** (MIT).
- **How others can access** once the repository is pushed to GitHub.

When the repo is published, others will be able to pull the dataset from:
`data/raw/heart_attack_china.csv`

If additional sources are used in Phase 2 (e.g., WHO, NHANES, air quality, socioeconomic indicators), those sources and access methods will also be documented here.


## 7) Limitations & Improvements

At this stage, we recognize the following limitations:

- **No time dimension**: The dataset does not include explicit dates for patient events or records.
- **Geographic alignment**: While region/province fields exist, they may need standardization for enrichment or joins.
- **Potential missingness or imbalance**: Some columns may have null values or skewed distributions.
- **No explicit data dictionary from the source**: We created our own in `../docs/data_dictionary.md`.

### Planned Improvements (Phase 2 or later Phase 1 work):
- Add `record_date` (synthetic or real, if available).
- Use `derived_risk_flags.md` to generate flags such as hypertension, diabetes, lifestyle risk.
- Validate and standardize geographic categories for optional external merges.
- Consider enrichment with public sources (e.g., WHO, NHANES, socioeconomic indicators).


## 8) Enrichment Plan (Optional / Planned)

Because the dataset includes geographic fields (e.g., province or region), we may add external data 
to enhance analysis in later phases. Possible enrichment sources:

- **Regional health indicators** (e.g., WHO, national health statistics)
- **Environmental factors** (e.g., air quality, PM2.5, population density)
- **Socioeconomic data** (e.g., GDP per capita, insurance coverage)
- **Clinical benchmarks** (e.g., global cholesterol or hypertension prevalence)

Any external data used will:
1. Be documented clearly in the Phase 2 write-up.
2. Rely on consistent join keys (like standardized region/province names).
3. Be stored separately in `data/external/` or processed and merged into `data/processed/`.


## 9) Reproducibility

This section explains how someone can rebuild our processed dataset from raw files and re-run the notebook.

**Data locations**
- Raw: `../data/raw/heart_attack_china.csv`
- Processed (output): `../data/processed/heart_attack_china_enriched.csv`
- Docs: `../docs/data_dictionary.md`, `../docs/derived_risk_flags.md`

**Steps**
1. Clone the repo.
2. Open `notebooks/Phase1_Report.ipynb`.
3. Run the “Derived Risk Flags” code cell to create `data/processed/heart_attack_china_enriched.csv`.
4. Re-run remaining cells as needed.


In [31]:

# Imports
import pandas as pd
from pathlib import Path

# Paths (adjust if needed)
RAW = Path("../data/raw/heart_attack_china.csv")
OUT = Path("../data/processed/heart_attack_china_enriched.csv")
OUT.parent.mkdir(parents=True, exist_ok=True)

# Load
df = pd.read_csv(RAW, low_memory=False)

# Tidy column names and cells in keeping with tydy data
df.columns = [c.strip() for c in df.columns]
for col in df.select_dtypes(include="object").columns:
    df[col] = df[col].str.strip()

# Simple yes/not
yes_no_cols = [
    "Hypertension","Diabetes","Obesity","Alcohol_Consumption",
    "Family_History_CVD","Chronic_Kidney_Disease","Previous_Heart_Attack",
    "TCM_Use","Heart_Attack"
]
for col in yes_no_cols:
    if col in df.columns:
        df[col] = df[col].map({"Yes": 1, "No": 0})

# Make important numeric columns numeric to be sure
numeric_cols = ["Patient_ID", "Age", "Blood_Pressure", "CVD_Risk_Score"]
for col in numeric_cols:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors="coerce")

# SBP setup
bp_col = "Blood_Pressure" 

# Normalize and take the first number as SBP
bp_text = (
    df[bp_col]
    .astype(str)
    .str.strip()
    .str.replace("-", "/", regex=False)
)
sbp_only = bp_text.str.split("/", n=1, expand=True)[0]
df["SBP"] = pd.to_numeric(sbp_only, errors="coerce")


# SBP category, approx
# Readings buckets <120 Normal, 120–129 Elevated, 130–139 Stage 1, 140–179 Stage 2, >=180 Crisis
if "SBP" in df.columns:
    df["SBP_Category"] = pd.cut(
        df["SBP"],
        bins=[-float("inf"), 120, 130, 140, 180, float("inf")],
        labels=["Normal", "Elevated", "Stage 1", "Stage 2", "Crisis"]
    )

# Health flags that match columns
# Hypertension from yes/no 
hypert_from_yesno = (df["Hypertension"] == 1) if "Hypertension" in df.columns else pd.Series(False, index=df.index)
sbp_high = (df["SBP"] >= 130) if "SBP" in df.columns else pd.Series(False, index=df.index)
df["has_hypertension"] = hypert_from_yesno.fillna(False) | sbp_high.fillna(False)

# Diabetes yes/no to 1=yes, 0=no
df["has_diabetes"] = ((df["Diabetes"] == 1).fillna(False)) if "Diabetes" in df.columns else pd.Series(False, index=df.index)

# Dyslipidemia when cholesterol level is high
if "Cholesterol_Level" in df.columns:
    df["has_dyslipidemia"] = (df["Cholesterol_Level"] == "High").fillna(False)
else:
    df["has_dyslipidemia"] = False

# Ordinal code map
l_m_h_map = {"Low": 0, "Medium": 1, "High": 2}
for col in ["Air_Pollution_Exposure", "Physical_Activity", "Stress_Level", "Income_Level"]:
    if col in df.columns:
        df[col] = df[col].map(l_m_h_map)

chol_map = {"Low": 0, "Normal": 1, "High": 2}
if "Cholesterol_Level" in df.columns:
    df["Cholesterol_Level"] = df["Cholesterol_Level"].map(chol_map)

diet_map = {"Poor": 0, "Moderate": 1, "Healthy": 2}
if "Diet_Score" in df.columns:
    df["Diet_Score"] = df["Diet_Score"].map(diet_map)

ru_map = {"Rural": 0, "Urban": 1}
if "Rural_or_Urban" in df.columns:
    df["Rural_or_Urban"] = df["Rural_or_Urban"].map(ru_map)

# Missing value handling 
if "Education_Level" in df.columns:
    df["Education_Level"] = df["Education_Level"].fillna("Unknown")

# Check
print(df.head(10))
print("\nDTypes:\n", df.dtypes)
print("\nNA counts (top 10):\n", df.isna().sum().sort_values(ascending=False).head(10))

# Save processed file
df.to_csv(OUT, index=False)
print("\nWrote:", OUT.resolve())

   

   Patient_ID  Age  Gender Smoking_Status  Hypertension  Diabetes  Obesity  \
0           1   55    Male     Non-Smoker             0         0        1   
1           2   66  Female         Smoker             1         0        0   
2           3   69  Female         Smoker             0         0        0   
3           4   45  Female         Smoker             0         1        0   
4           5   39  Female         Smoker             0         0        0   
5           6   76    Male         Smoker             0         0        0   
6           7   37    Male         Smoker             0         0        0   
7           8   88    Male     Non-Smoker             1         0        1   
8           9   54  Female         Smoker             0         0        1   
9          10   47  Female         Smoker             0         0        1   

   Cholesterol_Level  Air_Pollution_Exposure  Physical_Activity  Diet_Score  \
0                  1                       2                  

In [33]:
# Verify the enriched data is saved 
import pandas as pd

# Read in the enriched data from output directory
df_enriched = pd.read_csv("../data/processed/heart_attack_china_enriched.csv")

# View the first 5 rows
df_enriched.head()


Unnamed: 0,Patient_ID,Age,Gender,Smoking_Status,Hypertension,Diabetes,Obesity,Cholesterol_Level,Air_Pollution_Exposure,Physical_Activity,Diet_Score,Stress_Level,Alcohol_Consumption,Family_History_CVD,Healthcare_Access,Rural_or_Urban,Region,Province,Hospital_Availability,TCM_Use,Employment_Status,Education_Level,Income_Level,Blood_Pressure,Chronic_Kidney_Disease,Previous_Heart_Attack,CVD_Risk_Score,Heart_Attack,SBP,SBP_Category,has_hypertension,has_diabetes,has_dyslipidemia
0,1,55,Male,Non-Smoker,0,0,1,1,2,2,1,0,1,0,Good,0,Eastern,Beijing,Low,1,Unemployed,Primary,0.0,104,1,0,78,0,104,Normal,False,False,False
1,2,66,Female,Smoker,1,0,0,0,1,2,2,1,0,1,Poor,1,Eastern,Qinghai,High,0,Unemployed,Secondary,,142,0,0,49,0,142,Stage 2,True,False,False
2,3,69,Female,Smoker,0,0,0,0,1,2,1,0,0,0,Poor,0,Eastern,Henan,Low,0,Unemployed,Primary,2.0,176,0,0,31,0,176,Stage 2,True,False,False
3,4,45,Female,Smoker,0,1,0,1,1,0,2,1,1,0,Poor,0,Central,Qinghai,Medium,1,Employed,Primary,0.0,178,0,1,23,0,178,Stage 2,True,True,False
4,5,39,Female,Smoker,0,0,0,1,1,1,2,0,0,0,Moderate,1,Western,Guangdong,Low,0,Retired,Higher,,146,1,0,79,0,146,Stage 2,True,False,False
