# DSCI 511 – Term Project (Phase 1): Scoping a Dataset

## 1) Proposed Dataset of Interest

We want to construct a dataset that would enable a thorough analysis cardiovascular health and lifestyle score across nations. We expect to aggregate information from many government and private healthcare datasets resulting in a singular dataset representing many demographics (e.g., age, gender, air pollution exposure), health vitals (e.g., cholesterol, blood pressure, diabetes) and activities (e.g., smoking, diet, physical activity). The data will most likely need to be collected via several access methods (e.g., direct download, open and protected APIs) and formats (e.g. CSV, relational database, unstructred data). We plan to present the final dataset as a CSV to minimize ingestion friction and maximize simplicity and readability. 

## 2) Team Background & Roles
Our team consists of four members with complementary strengths. Below are our self-identified skills and skills identified individually for growth:

**• Roy Phelps** rp994@drexel.edu  
- Current Skills: Python, Jupyter Notebooks, basic data cleaning, visualization  
- Targeted Growth Skills: Data cleaning with better Python code
- Contribution: Code development, documentation, Git/GitHub organization  

**• Shad Scarboro** srs359@drexel.edu  
- Skills: Python, software testing, data sourcing, research, presentation formatting  
- Targeted Growth Skills:  Jupyter usage, data cleaning
- Contribution: Lead on data acquisition planning and writeup support

**• Leland Weeks** lhw22@drexel.edu  
- Current Skills: Product Management, Data Analytics, Business Intelligence  
- Targeted Growth Skills: Big Data, Data Mining, Predictive Analytics  
- Contribution: Lead on data selection and development

**• Evan Wessel** ew594@drexel.edu   
- Skills: Writing, data summaries, editing, research  
- Targeted Growth Skills: Python scripting, ETL, documentation   
- Contribution: Drafting sections of the proposal and summarizing findings

As a team, we will collaborate across tasks and take responsibility in these areas to ensure efficiency and clarity.


## 3) Potential Users and Applications
**Users:**

Studying cardiovascular health is critical for understanding disease risk and developing effective treatments, so the potential users of the dataset include hospitals, cardiology departments, public health analysts, professors, and students.

**Applications:**

Applications for this dataset include developing predictive models for risk assessment and/or improving existing models. The dataset can assist with early diagnosis and treatment methods. It can also assist health care workers and patients understand the risk factors and to tailor recommendations for lifestyle changes. o

## 4) Sample Data Sources

We plan to collect data from several sources, using several different methods. 

| Source | Short Description | Type | Detail | 
|---|---|---|---|
| Kaggle.com heart_attack_china.csv | Cardio Vascular Health | CSV Download | Our main dataset that we will expand and enrich with other sources | 
| OpenAQ.org | Air Quality Data | API | Enrich our dataset with air quality data | 
| WHO Global Health Observatory | Blood Pressure | CSV via API | Enrich our dataset with blood pressure data | 
| Open Street Maps | GIS Data | API | Use to get geographic data for provinces to facilitate the usage of other APIs |


### A) Data Source #1 - Cardio Vascular Health

Below is a 10-row preview of simple structured health-related data and a reference to the data dictionary located at:

`../docs/data_dictionary.md`

In [7]:
# ─── One-time setup per session: set working directory to the project root ───
# Replace the path below with YOUR local path to the repo's root (notebooks' parent).
# Use forward slashes or a raw string to avoid backslash-escape issues on Windows.

# %cd r"C:/Users/yourname/.../DSCI-511-Project/notebooks"

# After running the %cd above once, all project-relative paths work like:
# pd.read_csv("../data/raw/heart_attack_china.csv", low_memory=False)



In [9]:
# --- OPTIONAL Kaggle Fetch  ---------------------------
# Why this is commented:
# - Our default workflow loads the CSV from the repo at ../data/raw/heart_attack_china.csv
#   so the notebook runs offline with no extra installs or keys.
# - Use this cell ONLY if you do not have the raw CSV locally.
#
# Requirements if you decide to use it:
# - Internet access
# - Install kagglehub (uncomment the pip line below)
# - Some datasets may require a Kaggle account and acceptance of terms
#
# How to use:
# 1) Remove the leading "# " from each line in this cell (uncomment everything).
# 2) (If needed) install:  %pip install kagglehub -q
# 3) Run once to download and save the CSV to ../data/raw/heart_attack_china.csv
# 4) Then run the standard local-read cell (below) for offline use thereafter.
# -------------------------------------------------------------------------------

# %pip install kagglehub -q

# import kagglehub
# from kagglehub import KaggleDatasetAdapter
# from pathlib import Path
#
# # Set the path to the file you'd like to load (inside the Kaggle dataset)
# file_path = "heart_attack_china.csv"
#
# # Load the latest version from Kaggle into a pandas DataFrame
# df = kagglehub.dataset_load(
#   KaggleDatasetAdapter.PANDAS,
#   "ankushpanday2/heart-attack-risk-dataset-of-china",
#   file_path,
#   # Provide any additional arguments like 
#   # sql_query or pandas_kwargs. See the 
#   # documentation for more information:
#   # https://github.com/Kaggle/kagglehub/blob/main/README.md#kaggledatasetadapterpandas
# )
#
# # Preview a small sample
# display(df.head(10))
#
# # Write the file to the local repo so future runs work offline
# output_path = Path("../data/raw/heart_attack_china.csv")
# output_path.parent.mkdir(parents=True, exist_ok=True)
# df.to_csv(output_path, index=False)
# print(f"Saved to {output_path.resolve()}")


In [11]:
import warnings
import pandas as pd

# Ignore warnings in case they pop-up
warnings.filterwarnings("ignore", category=FutureWarning)

# Import dataset from folder
df = pd.read_csv("../data/raw/heart_attack_china.csv", low_memory=False)

# Set all columns
pd.set_option("display.max_columns", None)

# Display the first 10 (0-9)
df.head(10)



Unnamed: 0,Patient_ID,Age,Gender,Smoking_Status,Hypertension,Diabetes,Obesity,Cholesterol_Level,Air_Pollution_Exposure,Physical_Activity,Diet_Score,Stress_Level,Alcohol_Consumption,Family_History_CVD,Healthcare_Access,Rural_or_Urban,Region,Province,Hospital_Availability,TCM_Use,Employment_Status,Education_Level,Income_Level,Blood_Pressure,Chronic_Kidney_Disease,Previous_Heart_Attack,CVD_Risk_Score,Heart_Attack
0,1,55,Male,Non-Smoker,No,No,Yes,Normal,High,High,Moderate,Low,Yes,No,Good,Rural,Eastern,Beijing,Low,Yes,Unemployed,Primary,Low,104,Yes,No,78,No
1,2,66,Female,Smoker,Yes,No,No,Low,Medium,High,Healthy,Medium,No,Yes,Poor,Urban,Eastern,Qinghai,High,No,Unemployed,Secondary,Middle,142,No,No,49,No
2,3,69,Female,Smoker,No,No,No,Low,Medium,High,Moderate,Low,No,No,Poor,Rural,Eastern,Henan,Low,No,Unemployed,Primary,High,176,No,No,31,No
3,4,45,Female,Smoker,No,Yes,No,Normal,Medium,Low,Healthy,Medium,Yes,No,Poor,Rural,Central,Qinghai,Medium,Yes,Employed,Primary,Low,178,No,Yes,23,No
4,5,39,Female,Smoker,No,No,No,Normal,Medium,Medium,Healthy,Low,No,No,Moderate,Urban,Western,Guangdong,Low,No,Retired,Higher,Middle,146,Yes,No,79,No
5,6,76,Male,Smoker,No,No,No,Low,Low,Low,Poor,Medium,No,Yes,Moderate,Urban,Eastern,Sichuan,Low,Yes,Employed,Higher,Middle,92,No,No,49,No
6,7,37,Male,Smoker,No,No,No,Normal,Medium,Low,Poor,Medium,No,Yes,Poor,Urban,Eastern,Shanghai,Low,Yes,Employed,Higher,High,144,Yes,No,81,No
7,8,88,Male,Non-Smoker,Yes,No,Yes,Low,High,Low,Moderate,High,No,No,Moderate,Rural,Western,Shandong,High,No,Retired,,Low,162,No,No,27,No
8,9,54,Female,Smoker,No,No,Yes,Normal,Medium,Medium,Poor,Medium,No,No,Poor,Urban,Northern,Gansu,Medium,Yes,Unemployed,Secondary,Low,93,No,No,62,No
9,10,47,Female,Smoker,No,No,Yes,Low,High,Low,Moderate,Medium,Yes,Yes,Moderate,Rural,Eastern,Beijing,Low,No,Employed,,High,125,No,No,67,Yes


### B) Data Source #2 - Air Quality

We will be able to get air quality data using latitude, longitude, and radius. Here is an example for Beijing. Using https://docs.openaq.org/examples/examples as an example. 

**Note about the API Key:** If needed we can provide the *.env* file with the key.

In [18]:
import json
import requests
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()
API_KEY = os.environ.get("API_KEY")

# Beijing coordinates
latitude = "39.9042"
longitude = "116.4074"
radius = "25000" # 25 km

max_records = "2"

url = "https://api.openaq.org/v3/locations?coordinates=" + latitude + "," + longitude + "&radius=" + radius + "&limit=" + max_records
headers = {"X-API-Key": API_KEY}

response = requests.get(url, headers=headers)
data = response.json()

items = {k: v for i, (k, v) in enumerate(data.items())}
output = json.dumps(items, indent=4)
print(output)

{
    "meta": {
        "name": "openaq-api",
        "website": "/",
        "page": 1,
        "limit": 2,
        "found": ">2"
    },
    "results": [
        {
            "id": 21,
            "name": "Beijing US Embassy",
            "locality": "Beijing",
            "timezone": "Asia/Shanghai",
            "country": {
                "id": 10,
                "code": "CN",
                "name": "China"
            },
            "owner": {
                "id": 4,
                "name": "Unknown Governmental Organization"
            },
            "provider": {
                "id": 231,
                "name": "Beijing US Embassy"
            },
            "isMobile": false,
            "isMonitor": true,
            "instruments": [
                {
                    "id": 2,
                    "name": "Government Monitor"
                }
            ],
            "sensors": [
                {
                    "id": 40,
                    "name": "pm25 \u00

### C) Data Source #3 - WHO Blood Pressure Data

We plan to get additional health data from WHO: The World Health Organization. Primarily documented at https://www.who.int/data/gho/info/gho-odata-api. In the example code below, we pull the Mean Blood Pressure field


In [21]:
import requests
import pandas as pd
from pathlib import Path

# WHO Global Health Observatory API
# Indicator: Mean blood pressure (BP_04)
url = "https://ghoapi.azureedge.net/api/BP_04"
params = {"$filter": "SpatialDim eq 'CHN'"}

response = requests.get(url, params=params)
data = response.json()

# Extract records
records = []
for item in data.get('value', []):
    records.append({
        'indicator': 'Mean Blood Pressure',
        'year': item.get('TimeDim'),
        'value': item.get('NumericValue'),
        'sex': item.get('Dim1', 'Both'),
        'country': 'China'
    })

df = pd.DataFrame(records)

print(df.head(10))

# write the file to the local disk
output_path = Path("../data/raw/who_health_china.csv")
output_path.parent.mkdir(parents=True, exist_ok=True)
df.to_csv(output_path, index=False)

             indicator  year      value       sex country
0  Mean Blood Pressure  1992  20.596123  SEX_FMLE   China
1  Mean Blood Pressure  2013  24.963709  SEX_BTSX   China
2  Mean Blood Pressure  2003  27.878979   SEX_MLE   China
3  Mean Blood Pressure  2005  28.290728   SEX_MLE   China
4  Mean Blood Pressure  1993  21.049009  SEX_FMLE   China
5  Mean Blood Pressure  2016  21.016523  SEX_FMLE   China
6  Mean Blood Pressure  1997  22.289787  SEX_FMLE   China
7  Mean Blood Pressure  2000  23.030458  SEX_FMLE   China
8  Mean Blood Pressure  1999  24.729555  SEX_BTSX   China
9  Mean Blood Pressure  2013  22.352140  SEX_FMLE   China


### C) Data Source #4 - GIS Data
GIS data will be used to facilitate the calling of other APIs (like air quality) that require latitude and longitude.  The data elements in the initial data set reference the province of the people and whether they are urban or rural.  For rural individuals, they can be assigned to the geodesic center of the province.  For the urban individuals, they can be assigned to the geodesic center of the largest city in the province.  The largest cities will need to be obtained from another data source.



In [24]:
%pip install overpy 
import overpy

osm_api = overpy.Overpass()

result = osm_api.query("node['name'='Beijing'];out;")

print("The latitudes and longitudes defining the outline of Beijing.")
print("  This can be used to calculate the geodesic center.")
for node in result.nodes:
    print(f"{node.lat}, {node.lon}")

Collecting overpy
  Downloading overpy-0.7-py3-none-any.whl.metadata (3.5 kB)
Downloading overpy-0.7-py3-none-any.whl (14 kB)
Installing collected packages: overpy
Successfully installed overpy-0.7
Note: you may need to restart the kernel to use updated packages.
The latitudes and longitudes defining the outline of Beijing.
  This can be used to calculate the geodesic center.
55.6770342, 12.5703318
35.8563904, 14.5592089
-29.7925353, 30.8330165
65.0134919, 25.4658178
46.1950796, 21.3096955
48.5827259, 7.7577184
26.2151029, 50.5919246
41.5524282, -8.4006084
32.6596604, -115.4520935
41.1768890, -73.2256891
-35.6178962, -61.3615011
51.3710441, -0.4920919
52.1242971, -0.4718589
39.9534937, -75.2615602
52.5390348, -2.8075648
40.7686619, -73.9886693
40.1796567, 44.5210421
42.4471768, -71.2263295
53.6977181, -2.6893596
-32.9387814, -60.6724758
53.0127873, 18.6113733
53.4300430, -2.8017304
-17.3644778, -66.1781471
-17.3642844, -66.1792336
37.3508844, -121.9999400
4.5994660, -74.0786228
40.7070

### 5) Normalized Clinical & Lifestyle Flags (Phase 1)

**Goal:**  
We will convert existing text-based columns into consistent boolean or ordinal fields. We are *not* creating new medical facts — just normalizing what is already present in the raw data so our EDA and later modeling steps are cleaner and reproducible.

**Output (after normalization):**  
`../data/processed/heart_attack_china_enriched.csv`

**Documentation:**  
`../docs/normalized_flags.md`

---

#### Flags From Data Source #1

These come from columns that already exist in the dataset:

- `has_hypertension` → `Hypertension == "Yes"`
- `has_diabetes` → `Diabetes == "Yes"`
- `has_dyslipidemia` → `Cholesterol_Level == "High"` 
- `is_smoker` → `Smoking_Status == "Smoker"`
- `is_obese` → `Obesity == "Yes"`
- `tcm_use` → `TCM_Use == "Yes"`
- `is_rural` → `Rural_or_Urban == "Rural"`

---

#### Ordinal Normalizations From Data Source #1

We will convert the following text fields into ordered categories. The numeric scales will be documented in `normalized_flags.md`:

- `Air_Pollution_Exposure`: Low(0) < Medium(1) < High(2)  
- `Physical_Activity`: Low(0) < Moderate(1) < High(2)  
- `Diet_Score`: Poor(0) < Moderate(1) < Healthy(2)  
- `Healthcare_Access`: Poor(0) < Moderate(1) < Good(2)  
- `Hospital_Availability`: Low(0) < Medium(1) < High(2)  

---

#### Quality Checks From Data Source #1

Before writing to the processed file, we will: 
  
- Standardize textual categories 
  
- Handle missing values:  
  - **Categorical:** fill with `"Unknown"` or mode (we will note which approach we use)  
  - **Numeric:** fill with median and document the distribution

- Run simple checks:  
  - Class balance for `Heart_Attack`  
  - Value ranges (e.g., `Blood_Pressure`)  
  - Category counts

---

#### Note on `CVD_Risk_Score`

We will keep this column in the dataset for EDA. However, if it was generated using any outcome-related information, we will exclude it from baseline modeling in Phase 2 to avoid data leakage.



In [23]:
# Placeholder: derived risk flags will be implemented after initial Phase 1 write-up.
# Logic is documented in ../docs/normalized_flags.md

# Example structure:
# import pandas as pd
# df = pd.read_csv("../data/raw/heart_attack_china.csv", low_memory=False)
# ... flag creation here ...
# df.to_csv("../data/processed/heart_attack_china_enriched.csv", index=False)


## 6) Provenance & Access

The dataset `heart_attack_china.csv` is stored locally in:

`data/raw/heart_attack_china.csv`

For this phase, we are treating it as a publicly usable dataset for educational purposes.

**Source & Licensing**  
- The dataset originally comes from **Kaggle**  
  Dataset title: *Heart Attack Risk Dataset of China*  
  License: **MIT License**  
- For Phase 1, we are using the **locally downloaded CSV version** rather than loading it dynamically via an API.
- We included a **commented-out code cell** earlier to show how the dataset could be loaded with `kagglehub` if API access is ever needed in future phases.

**Future Access (GitHub)**  
When we publish the repository to GitHub, others will be able to access the dataset here:  
`data/raw/heart_attack_china.csv`

If we add additional sources in Phase 2 (e.g., WHO, NHANES, air quality, socioeconomic indicators), we will document their origins, access methods, and licenses in this section as well.


## 7) Limitations & Improvements

At this stage, we recognize the following limitations based on the current dataset and Phase 1 scope:

- **No time dimension**: The dataset does not include dates for patient records or events.
- **Geographic standardization**: The region/province fields exist but may require normalization if we plan external joins or enrichment.
- **Missingness or imbalance**: Some attributes may contain null values or skewed distributions (to be profiled further in Phase 2).
- **No official data dictionary from the source**: We created our own in `../docs/data_dictionary.md` based on column names and inspection.

### Planned Improvements (Phase 2 or later in Phase 1):

- Add a `record_date` field (either synthetic or real if found in updated versions).
- Use the upcoming `normalized_flags.md` file to capture features such as hypertension, diabetes, lifestyle risk, etc.
- Validate and standardize geographic categories for potential joins with external datasets.
- Explore optional enrichment from public health and demographic sources (e.g., WHO, NHANES, socioeconomic indicators).



## 8) Enrichment Plan (Optional / Planned)

Because the dataset includes geographic fields (e.g., province/region), we may incorporate external data
in a later phase to deepen the analysis. Possible enrichment sources include:

- **Regional health indicators**  
  (e.g., WHO or national public health databases)
- **Environmental factors**  
  (e.g., air quality, PM2.5 levels, population density)
- **Socioeconomic data**  
  (e.g., GDP per capita, insurance coverage, income levels)
- **Clinical benchmarks**  
  (e.g., prevalence of cholesterol issues, hypertension, diabetes)

If external data is added, we will ensure that:

1. All sources are cited in the Phase 2 write-up (or later Phase 1 updates).  
2. Join keys (such as standardized region/province names) are consistent.  
3. Raw enrichment files are stored in `data/external/` and any merged/cleaned versions go to `data/processed/`.



## 9) Reproducibility

This section explains how someone can re-create our processed dataset from the raw file and re-run the notebook.

**Data Locations**
- **Raw data:** `../data/raw/heart_attack_china.csv`
- **Processed output (planned):** `../data/processed/heart_attack_china_enriched.csv`
- **Documentation:**  
  - `../docs/data_dictionary.md`  
  - `../docs/normalized_flags.md` 

**Steps to Reproduce**
1. **Clone** the repository.
2. Open the notebook: `notebooks/Phase1_Report.ipynb`.
3. Run the section that creates the risk/health flags (based on `normalized_flags.md`).
4. Export or save the resulting DataFrame to  
   `../data/processed/heart_attack_china_enriched.csv` (or the updated naming we finalize during Phase 1).
5. Re-run any remaining cells as needed for analysis or validation.


In [29]:
# Imports
import pandas as pd
from pathlib import Path

pd.set_option("display.max_columns", None)
pd.set_option("display.width", 0)  # optional: prevents line wrapping

# Paths (adjust if needed)
RAW = Path("../data/raw/heart_attack_china.csv")
OUT = Path("../data/processed/heart_attack_china_enriched.csv")
OUT.parent.mkdir(parents=True, exist_ok=True)

# Load
df = pd.read_csv(RAW, low_memory=False)

# Tidy column names and cells in keeping with tydy data
df.columns = [c.strip() for c in df.columns]
for col in df.select_dtypes(include="object").columns:
    df[col] = df[col].str.strip()

# Simple yes/not
yes_no_cols = [
    "Hypertension","Diabetes","Obesity","Alcohol_Consumption",
    "Family_History_CVD","Chronic_Kidney_Disease","Previous_Heart_Attack",
    "TCM_Use","Heart_Attack"
]
for col in yes_no_cols:
    if col in df.columns:
        df[col] = df[col].map({"Yes": 1, "No": 0})

# Make important numeric columns numeric to be sure
numeric_cols = ["Patient_ID", "Age", "Blood_Pressure", "CVD_Risk_Score"]
for col in numeric_cols:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors="coerce")

# SBP setup
bp_col = "Blood_Pressure" 

# Normalize and take the first number as SBP
bp_text = (
    df[bp_col]
    .astype(str)
    .str.strip()
    .str.replace("-", "/", regex=False)
)
sbp_only = bp_text.str.split("/", n=1, expand=True)[0]
df["SBP"] = pd.to_numeric(sbp_only, errors="coerce")


# SBP category, approx
# Readings buckets <120 Normal, 120–129 Elevated, 130–139 Stage 1, 140–179 Stage 2, >=180 Crisis
if "SBP" in df.columns:
    df["SBP_Category"] = pd.cut(
        df["SBP"],
        bins=[-float("inf"), 120, 130, 140, 180, float("inf")],
        labels=["Normal", "Elevated", "Stage 1", "Stage 2", "Crisis"]
    )

# Health flags that match columns
# Hypertension from yes/no 
hypert_from_yesno = (df["Hypertension"] == 1) if "Hypertension" in df.columns else pd.Series(False, index=df.index)
sbp_high = (df["SBP"] >= 130) if "SBP" in df.columns else pd.Series(False, index=df.index)
df["has_hypertension"] = hypert_from_yesno.fillna(False) | sbp_high.fillna(False)

# Diabetes yes/no to 1=yes, 0=no
df["has_diabetes"] = ((df["Diabetes"] == 1).fillna(False)) if "Diabetes" in df.columns else pd.Series(False, index=df.index)

# Dyslipidemia when cholesterol level is high
if "Cholesterol_Level" in df.columns:
    df["has_dyslipidemia"] = (df["Cholesterol_Level"] == "High").fillna(False)
else:
    df["has_dyslipidemia"] = False

# Ordinal code map
l_m_h_map = {"Low": 0, "Medium": 1, "High": 2}
for col in ["Air_Pollution_Exposure", "Physical_Activity", "Stress_Level"]:
    if col in df.columns:
        df[col] = df[col].map(l_m_h_map)

chol_map = {"Low": 0, "Normal": 1, "High": 2}
if "Cholesterol_Level" in df.columns:
    df["Cholesterol_Level"] = df["Cholesterol_Level"].map(chol_map)

diet_map = {"Poor": 0, "Moderate": 1, "Healthy": 2}
if "Diet_Score" in df.columns:
    df["Diet_Score"] = df["Diet_Score"].map(diet_map)

ru_map = {"Rural": 0, "Urban": 1}
if "Rural_or_Urban" in df.columns:
    df["Rural_or_Urban"] = df["Rural_or_Urban"].map(ru_map)

# Missing value handling 
if "Education_Level" in df.columns:
    df["Education_Level"] = df["Education_Level"].fillna("Unknown")

# Check
print(df.head(10))
print("\nDTypes:\n", df.dtypes)
print("\nNA counts (top 10):\n", df.isna().sum().sort_values(ascending=False).head(10))

# Save processed file
df.to_csv(OUT, index=False)
print("\nWrote processed file to:  data/processed/")


   

   Patient_ID  Age  Gender Smoking_Status  Hypertension  Diabetes  Obesity  \
0           1   55    Male     Non-Smoker             0         0        1   
1           2   66  Female         Smoker             1         0        0   
2           3   69  Female         Smoker             0         0        0   
3           4   45  Female         Smoker             0         1        0   
4           5   39  Female         Smoker             0         0        0   
5           6   76    Male         Smoker             0         0        0   
6           7   37    Male         Smoker             0         0        0   
7           8   88    Male     Non-Smoker             1         0        1   
8           9   54  Female         Smoker             0         0        1   
9          10   47  Female         Smoker             0         0        1   

   Cholesterol_Level  Air_Pollution_Exposure  Physical_Activity  Diet_Score  \
0                  1                       2                  

In [33]:
# Verify the enriched data is saved 
import pandas as pd

# Read in the enriched data from output directory
df_enriched = pd.read_csv("../data/processed/heart_attack_china_enriched.csv")

# View the first 5 rows
df_enriched.head()


Unnamed: 0,Patient_ID,Age,Gender,Smoking_Status,Hypertension,Diabetes,Obesity,Cholesterol_Level,Air_Pollution_Exposure,Physical_Activity,Diet_Score,Stress_Level,Alcohol_Consumption,Family_History_CVD,Healthcare_Access,Rural_or_Urban,Region,Province,Hospital_Availability,TCM_Use,Employment_Status,Education_Level,Income_Level,Blood_Pressure,Chronic_Kidney_Disease,Previous_Heart_Attack,CVD_Risk_Score,Heart_Attack,SBP,SBP_Category,has_hypertension,has_diabetes,has_dyslipidemia
0,1,55,Male,Non-Smoker,0,0,1,1,2,2,1,0,1,0,Good,0,Eastern,Beijing,Low,1,Unemployed,Primary,Low,104,1,0,78,0,104,Normal,False,False,False
1,2,66,Female,Smoker,1,0,0,0,1,2,2,1,0,1,Poor,1,Eastern,Qinghai,High,0,Unemployed,Secondary,Middle,142,0,0,49,0,142,Stage 2,True,False,False
2,3,69,Female,Smoker,0,0,0,0,1,2,1,0,0,0,Poor,0,Eastern,Henan,Low,0,Unemployed,Primary,High,176,0,0,31,0,176,Stage 2,True,False,False
3,4,45,Female,Smoker,0,1,0,1,1,0,2,1,1,0,Poor,0,Central,Qinghai,Medium,1,Employed,Primary,Low,178,0,1,23,0,178,Stage 2,True,True,False
4,5,39,Female,Smoker,0,0,0,1,1,1,2,0,0,0,Moderate,1,Western,Guangdong,Low,0,Retired,Higher,Middle,146,1,0,79,0,146,Stage 2,True,False,False
