# **About the Data**

**Dataset Source & Link:**  Kaggle, Indicators of Heart Diseases  
https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease/data

**Time Coverage:** 2022 CDC Annual Health Survey

**Shape (row x columns):** 40882 x 40

**Units:**
- BMI (Kg/m^2)
- HeightInMeters (m)
- WeightInKilograms (Kg)
- SleepHours (Hours)
- PhysicalHealthDays, MentalHealthDays (number of days, 0-30)

**Column Directory:**
- Demographics ('State', 'Sex' etc)
- General Health ('PhysicalHealthDays', 'MentalHealthDays', etc)
- Medical History ('HadHeartAttack', 'HadDiabetes', etc)
- Health Behaviour ('SmokerStatus', 'AlcoholDrinker', etc)
- Vaccination ('FluVaxLast12', 'PneumoVaxEver', etc)
- Dental Records ('RemovedTeeth')

**Missingness Snapshot:**
- Most categorical columns (Demographics, Yes/No Medical History, Behaviours) ≈ 0% missing
- A few numerical columns have some missing data:
    * 'WeightInKilograms' ≈ 4% missing
    * 'HeightInMeters' ≈ 4% missing
    * 'SleepInHours' ≈ 2-3% missing
    * 'BMI' < 1% missing

Overall the dataset is mostly complete, with only minor gaps in the numeric health measure.

**Known Quirks:**
- 'AgeCategory' is grouped into categories instead of raw ages
-  Self reported data (height, weight, etc)  may be inaccurate
-  Small percentage of missing data


In [None]:
# PROFESSOR'S SOLUTION
# !curl -L -o heart_disease.zip \
#   https://www.kaggle.com/api/v1/datasets/download/kamilpytlak/personal-key-indicators-of-heart-disease


# !unzip -o heart_disease.zip -d ./heart_disease_data

# #!ls heart_disease_data/2022/heart_2022_no_nans

import pandas as pd
hp = pd.read_csv("heart_2022_no_nans.csv")


hp.info()
hp.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 246022 entries, 0 to 246021
Data columns (total 40 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   State                      246022 non-null  object 
 1   Sex                        246022 non-null  object 
 2   GeneralHealth              246022 non-null  object 
 3   PhysicalHealthDays         246022 non-null  float64
 4   MentalHealthDays           246022 non-null  float64
 5   LastCheckupTime            246022 non-null  object 
 6   PhysicalActivities         246022 non-null  object 
 7   SleepHours                 246022 non-null  float64
 8   RemovedTeeth               246022 non-null  object 
 9   HadHeartAttack             246022 non-null  object 
 10  HadAngina                  246022 non-null  object 
 11  HadStroke                  246022 non-null  object 
 12  HadAsthma                  246022 non-null  object 
 13  HadSkinCancer              24

Unnamed: 0,State,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,RemovedTeeth,HadHeartAttack,...,HeightInMeters,WeightInKilograms,BMI,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos
0,Alabama,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,None of them,No,...,1.6,71.67,27.99,No,No,Yes,Yes,"Yes, received Tdap",No,No
1,Alabama,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,None of them,No,...,1.78,95.25,30.13,No,No,Yes,Yes,"Yes, received tetanus shot but not sure what type",No,No
2,Alabama,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,"6 or more, but not all",No,...,1.85,108.86,31.66,Yes,No,No,Yes,"No, did not receive any tetanus shot in the pa...",No,Yes
3,Alabama,Female,Fair,5.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,None of them,No,...,1.7,90.72,31.32,No,No,Yes,Yes,"No, did not receive any tetanus shot in the pa...",No,Yes
4,Alabama,Female,Good,3.0,15.0,Within past year (anytime less than 12 months ...,Yes,5.0,1 to 5,No,...,1.55,79.38,33.07,No,No,Yes,Yes,"No, did not receive any tetanus shot in the pa...",No,No


In [None]:
hp.columns

Index(['State', 'Sex', 'GeneralHealth', 'PhysicalHealthDays',
       'MentalHealthDays', 'LastCheckupTime', 'PhysicalActivities',
       'SleepHours', 'RemovedTeeth', 'HadHeartAttack', 'HadAngina',
       'HadStroke', 'HadAsthma', 'HadSkinCancer', 'HadCOPD',
       'HadDepressiveDisorder', 'HadKidneyDisease', 'HadArthritis',
       'HadDiabetes', 'DeafOrHardOfHearing', 'BlindOrVisionDifficulty',
       'DifficultyConcentrating', 'DifficultyWalking',
       'DifficultyDressingBathing', 'DifficultyErrands', 'SmokerStatus',
       'ECigaretteUsage', 'ChestScan', 'RaceEthnicityCategory', 'AgeCategory',
       'HeightInMeters', 'WeightInKilograms', 'BMI', 'AlcoholDrinkers',
       'HIVTesting', 'FluVaxLast12', 'PneumoVaxEver', 'TetanusLast10Tdap',
       'HighRiskLastYear', 'CovidPos'],
      dtype='object')

In [None]:
hp.columns = [c.strip().lower().replace(" ", "_") for c in hp.columns]

# normalize Yes/No columns to 1/0
yes_no_map = {"Yes": 1, "No": 0}
for col in hp.select_dtypes(include="object").columns:
    if set(hp[col].unique()) <= set(yes_no_map.keys()):
        hp[col] = hp[col].map(yes_no_map)

# check balance of target column (assuming "HadHeartAttack")
if "HadHeartAttack".lower() in hp.columns:
    target_col = "hadheartattack"
    print("Target distribution:")
    print(hp[target_col].value_counts())



hp.describe()



Target distribution:
hadheartattack
No     38522
Yes     2359
Name: count, dtype: int64


Unnamed: 0,physicalhealthdays,mentalhealthdays,sleephours,heightinmeters,weightinkilograms,bmi
count,40881.0,40881.0,40881.0,40881.0,40881.0,40881.0
mean,4.27832,4.302365,7.047112,1.705395,82.573911,28.308831
std,8.54168,8.231234,1.459023,0.10695,20.825863,6.369795
min,0.0,0.0,1.0,0.91,29.48,12.48
25%,0.0,0.0,6.0,1.63,68.04,24.0
50%,0.0,0.0,7.0,1.7,79.83,27.32
75%,3.0,5.0,8.0,1.78,92.99,31.38
max,30.0,30.0,24.0,2.36,292.57,97.65


In [None]:
#B.1 Vectorized boolean mask
if "physicalhealthdays" in hp.columns:
  hp["poor_physical_health"] = (hp["physicalhealthdays"] >= 15)
if "mentalhealthdays" in hp.columns:
  hp["poor_mental_health"] = (hp["mentalhealthdays"] >= 15)
if "bmi" in hp.columns:
  hp["obese_bmi"] = (hp["bmi"] >= 30)
if "sleephours" in hp.columns:
  hp["poor_sleep"] = (hp["sleephours"] <= 6)

In [None]:
#B.2 Single Column transformation using map or Series.apply
if "smokerstatus" in hp.columns and hp['smokerstatus'].dtype == "object":
  smap = {
      "Current smoker - now smokes evryday" : "current smoker",
      "Current smoker - now smokes some days" : "current smoker",
      "Former smoker" : "former",
      "Never smoked" : "never"
  }
hp["smoker_simple"] = hp["smokerstatus"].map(smap).fillna(hp["smokerstatus"])


In [None]:
#B.3 multi column logic
def composite_risk(row):
  score = 0
  for col in ["obese_bmi", "poor_physical_health", "poor_mental_health", "poor_sleep"]:
    if col in row and bool(row[col]):
      score += 1
  for col in ['hadheartattack', 'hadangina', 'hadstroke']:
    if col in row and row[col] in (1,1.0, True):
      score += 1
  return score

hp['composite_risk'] = hp.apply(composite_risk, axis=1)

In [None]:
#B.4 categorical bucketing
def sleep_bucket(x):
  if pd.isna(x): return "missing"
  if x < 6: return "short"
  if x <= 8: return "normal"
  return "long"

if "sleephours" in hp.columns:
  hp["sleep_bucket"] = hp["sleephours"].map(sleep_bucket)

def bmi_bucket(x):
  if pd.isna(x): return "missing"
  if x < 18.5: return "underweight"
  if x < 25: return "normal"
  if x < 30: return "overweight"
  return "obese"

if "bmi" in hp.columns:
  hp["bmi_bucket"] = hp["bmi"].map(bmi_bucket)

In [None]:
#B.5 Missing data handling
hp_imputed = hp.copy()
for c in hp_imputed.columns:
  if pd.api.types.is_numeric_dtype(hp_imputed[c]):
    hp_imputed[c] = hp_imputed[c].fillna(hp_imputed[c].median())
  elif hp_imputed[c].dtype == "object":
    hp_imputed[c] = hp_imputed[c].fillna("unknown")

hp_imputed.head()

Unnamed: 0,state,sex,generalhealth,physicalhealthdays,mentalhealthdays,lastcheckuptime,physicalactivities,sleephours,removedteeth,hadheartattack,...,highrisklastyear,covidpos,poor_physical_health,poor_mental_health,obese_bmi,poor_sleep,smoker_simple,composite_risk,sleep_bucket,bmi_bucket
0,Alabama,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,1,9.0,None of them,0,...,0,No,False,False,False,False,former,0,long,overweight
1,Alabama,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,1,6.0,None of them,0,...,0,No,False,False,True,True,former,2,normal,obese
2,Alabama,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,0,8.0,"6 or more, but not all",0,...,0,Yes,False,False,True,False,former,1,normal,obese
3,Alabama,Female,Fair,5.0,0.0,Within past year (anytime less than 12 months ...,1,9.0,None of them,0,...,0,Yes,False,False,True,False,never,1,long,obese
4,Alabama,Female,Good,3.0,15.0,Within past year (anytime less than 12 months ...,1,5.0,1 to 5,0,...,0,No,False,True,True,True,never,3,short,obese


In [None]:
#C.1 value_counts() with interpretation
y = "cvd_any" if "cvd_any" in hp_imputed.columns else target_col
if y and y in hp_imputed.columns:
  vc = hp_imputed[y].value_counts(dropna=False)
  print(f"{y} counts: \n{vc}")
  print(f"\nPrevalence: {(hp_imputed[y].mean()*100):.2f}%")

hadheartattack counts: 
hadheartattack
0    232587
1     13435
Name: count, dtype: int64

Prevalence: 5.46%


In [None]:
#C.2.a GroupBy
if {"agecategory",y}.issubset(hp_imputed.columns):
  g_age = (hp_imputed.groupby("agecategory", dropna=False)
          .agg(n=(y,"size"), prevalence=(y,"mean"))
          .assign(prevalence_pct=lambda t: (t["prevalence"]*100).round(2))
          .sort_values(["prevalence","n"], ascending=[False,False])
          .reset_index())
  display(g_age)

Unnamed: 0,agecategory,n,prevalence,prevalence_pct
0,Age 80 or older,17816,0.13617,13.62
1,Age 75 to 79,18136,0.113862,11.39
2,Age 70 to 74,25739,0.093555,9.36
3,Age 65 to 69,28557,0.075463,7.55
4,Age 60 to 64,26720,0.058945,5.89
5,Age 55 to 59,22224,0.050036,5.0
6,Age 50 to 54,19913,0.035304,3.53
7,Age 45 to 49,16753,0.02507,2.51
8,Age 40 to 44,16973,0.013433,1.34
9,Age 35 to 39,15614,0.009991,1.0


In [None]:
#C.2.b Groupby
need = {"sleep_bucket","sex", y}
if need.issubset(hp_imputed.columns):
  g_sleep_sex = (hp_imputed.groupby(["sleep_bucket","sex"], dropna=False)
                  .agg(n=(y,"size"), prevalence=(y,"mean"))
                  .assign(prevalence_pct=lambda t: (t["prevalence"]*100).round(2))
                  .sort_values(["prevalence","n"], ascending=[False,False])
                  .reset_index())
  display(g_sleep_sex)

Unnamed: 0,sleep_bucket,sex,n,prevalence,prevalence_pct
0,long,Male,8907,0.1226,12.26
1,short,Male,12162,0.099161,9.92
2,short,Female,13654,0.068258,6.83
3,normal,Male,97142,0.063876,6.39
4,long,Female,10944,0.05784,5.78
5,normal,Female,103213,0.032622,3.26
