# Heart Attack Risk Prediction

The dataset includes a wide range of features related to heart health and lifestyles. It covers individual details like age, gender, blood pressures, and BMI and also covers lifestyles choices like smoking, alcohol consumption, and sleep hours per day. The ultimate goal of this dataset and our applications is to predict the heart attack risk of a person. The features included and their respective brief details are following:
- Patient ID: Unique identifier for each patient
- Age: Age of the patient
- Sex: Gender of the patient (Male/Female)
- Cholesterol: Cholesterol levels of the patient
- Blood Pressure: Blood pressure of the patient (systolic/diastolic)
- Heart Rate: Heart rate of the patient
- Diabetes: Whether the patient has diabetes (Yes/No)
- Family History: Family history of heart-related problems (1: Yes, 0: No)
- Smoking: Smoking status of the patient (1: Smoker, 0: Non-smoker)
- Obesity: Obesity status of the patient (1: Obese, 0: Not obese)
- Alcohol Consumption: Level of alcohol consumption by the patient (None/Light/Moderate/Heavy)
- Exercise Hours Per Week: Number of exercise hours per week
- Diet: Dietary habits of the patient (Healthy/Average/Unhealthy)
- Previous Heart Problems: Previous heart problems of the patient (1: Yes, 0: No)
- Medication Use: Medication usage by the patient (1: Yes, 0: No)
- Stress Level: Stress level reported by the patient (1-10)
- Sedentary Hours Per Day: Hours of sedentary activity per day
- Income: Income level of the patient
- BMI: Body Mass Index (BMI) of the patient
- Triglycerides: Triglyceride levels of the patient
- Physical Activity Days Per Week: Days of physical activity per week
- Sleep Hours Per Day: Hours of sleep per day
- Country: Country of the patient
- Continent: Continent where the patient resides
- Hemisphere: Hemisphere where the patient resides
- Heart Attack Risk: Presence of heart attack risk (1: Yes, 0: No)

The dataset is retrieved from https://www.kaggle.com/competitions/heart-attack-risk-analysis/data?select=train.csv

## Import libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
print("Numpy Version:", np.__version__)
print("Pandas Version:", pd.__version__)
print("Matplotlib Version:", mpl.__version__)
print("Seaborn Version:", sns.__version__)

Numpy Version: 1.26.0
Pandas Version: 2.1.1
Matplotlib Version: 3.8.0
Seaborn Version: 0.13.0


## Import Data

In [4]:
# The data is from the competition at https://www.kaggle.com/competitions/heart-attack-risk-analysis/data?select=train.csv

data = pd.read_csv('./Dataset/Heart Attack Risk Analysis/train.csv')

### Get to know the data

In [14]:
data.head()

Unnamed: 0,Patient ID,Age,Sex,Cholesterol,Blood Pressure,Heart Rate,Diabetes,Family History,Smoking,Obesity,...,Sedentary Hours Per Day,Income,BMI,Triglycerides,Physical Activity Days Per Week,Sleep Hours Per Day,Country,Continent,Hemisphere,Heart Attack Risk
0,RDG0550,33,Male,200,129/90,48,0,1,1,1,...,0.138443,184066,30.449815,63,6,7,Argentina,South America,Southern Hemisphere,1
1,NMA3851,56,Female,262,159/105,46,1,0,1,0,...,0.369552,211755,34.973685,333,7,8,Nigeria,Africa,Northern Hemisphere,1
2,TUI5807,19,Female,140,161/109,54,0,1,0,0,...,8.646334,252203,30.554246,537,2,10,Thailand,Asia,Northern Hemisphere,0
3,YYT5016,50,Female,163,120/62,53,0,1,1,1,...,1.107884,121954,35.390265,591,0,9,Spain,Europe,Southern Hemisphere,1
4,ZAC5937,89,Female,144,153/110,92,1,0,1,0,...,1.33757,180121,39.575483,145,2,5,Germany,Europe,Northern Hemisphere,1


In [6]:
# Data consists of 26 features (label included) and 7010 rows

data.shape

(7010, 26)

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7010 entries, 0 to 7009
Data columns (total 26 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Patient ID                       7010 non-null   object 
 1   Age                              7010 non-null   int64  
 2   Sex                              7010 non-null   object 
 3   Cholesterol                      7010 non-null   int64  
 4   Blood Pressure                   7010 non-null   object 
 5   Heart Rate                       7010 non-null   int64  
 6   Diabetes                         7010 non-null   int64  
 7   Family History                   7010 non-null   int64  
 8   Smoking                          7010 non-null   int64  
 9   Obesity                          7010 non-null   int64  
 10  Alcohol Consumption              7010 non-null   int64  
 11  Exercise Hours Per Week          7010 non-null   float64
 12  Diet                

In [8]:
# There is no nan in data

data.isna().sum().sum()

0

#### The number of unique value in each column

In [18]:
print(f'Unique values in each column \n{"="*50}')

for c in data.columns:
    print(f'{c}: {data[c].unique().shape[0]}')

Unique values in each column 
Patient ID: 7010
Age: 73
Sex: 2
Cholesterol: 281
Blood Pressure: 3590
Heart Rate: 71
Diabetes: 2
Family History: 2
Smoking: 2
Obesity: 2
Alcohol Consumption: 2
Exercise Hours Per Week: 7010
Diet: 3
Previous Heart Problems: 2
Medication Use: 2
Stress Level: 10
Sedentary Hours Per Day: 7010
Income: 6921
BMI: 7010
Triglycerides: 771
Physical Activity Days Per Week: 8
Sleep Hours Per Day: 7
Country: 20
Continent: 6
Hemisphere: 2
Heart Attack Risk: 2
