# Heart Disease Predictor

### Life cycle of Machine earning project
- Understand Problem Statement
- Data Collection
- Data Checks to perform
- Exploratory data analysis
- Data Pre-processing
- Model Training
- Choose best model
- Tune model

## 1) Problem statement
- This project understandss how heart disease is affected by variables such as sex, health, physical activity, sleep, mental health, physical health, what state they live in and many more.
## 2) Data Collection
- Dataset Source: [Click Here]
- The data I will be using is from 2022 containing 40 columns and 445,132 rows.

## 2.1 Import Data and Required packages

In [38]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [39]:
df = pd.read_csv('data/heart_2022_with_nans.csv')
print(df.shape)
df.head()

(445132, 40)


Unnamed: 0,State,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,RemovedTeeth,HadHeartAttack,...,HeightInMeters,WeightInKilograms,BMI,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos
0,Alabama,Female,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,,No,...,,,,No,No,Yes,No,"Yes, received tetanus shot but not sure what type",No,No
1,Alabama,Female,Excellent,0.0,0.0,,No,6.0,,No,...,1.6,68.04,26.57,No,No,No,No,"No, did not receive any tetanus shot in the pa...",No,No
2,Alabama,Female,Very good,2.0,3.0,Within past year (anytime less than 12 months ...,Yes,5.0,,No,...,1.57,63.5,25.61,No,No,No,No,,No,Yes
3,Alabama,Female,Excellent,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,7.0,,No,...,1.65,63.5,23.3,No,No,Yes,Yes,"No, did not receive any tetanus shot in the pa...",No,No
4,Alabama,Female,Fair,2.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,,No,...,1.57,53.98,21.77,Yes,No,No,Yes,"No, did not receive any tetanus shot in the pa...",No,No


## 3 Data Checks to perform
- Check missing values
- Check for duplicates
- Check number of unique values in each column
- Check data types
- Check statistics of dataset
- Check various categories present in the different categrical column

## 3.1 Check msising values

In [40]:
for column in df.columns:
    percentage = df[column].isnull().mean()
    print(column + ": " + str(round(percentage*100,2)))

State: 0.0
Sex: 0.0
GeneralHealth: 0.27
PhysicalHealthDays: 2.45
MentalHealthDays: 2.04
LastCheckupTime: 1.87
PhysicalActivities: 0.25
SleepHours: 1.23
RemovedTeeth: 2.55
HadHeartAttack: 0.69
HadAngina: 0.99
HadStroke: 0.35
HadAsthma: 0.4
HadSkinCancer: 0.71
HadCOPD: 0.5
HadDepressiveDisorder: 0.63
HadKidneyDisease: 0.43
HadArthritis: 0.59
HadDiabetes: 0.24
DeafOrHardOfHearing: 4.64
BlindOrVisionDifficulty: 4.84
DifficultyConcentrating: 5.45
DifficultyWalking: 5.39
DifficultyDressingBathing: 5.37
DifficultyErrands: 5.76
SmokerStatus: 7.97
ECigaretteUsage: 8.01
ChestScan: 12.59
RaceEthnicityCategory: 3.16
AgeCategory: 2.04
HeightInMeters: 6.44
WeightInKilograms: 9.45
BMI: 10.96
AlcoholDrinkers: 10.46
HIVTesting: 14.86
FluVaxLast12: 10.59
PneumoVaxEver: 17.31
TetanusLast10Tdap: 18.54
HighRiskLastYear: 11.37
CovidPos: 11.4


In [41]:
#drop NaNs
df = df.dropna()

In [42]:
df.shape

(246022, 40)

## 3.2 Check Duplicates

In [43]:
df.duplicated().sum()

np.int64(9)

In [44]:
# drop duplicates
df.drop_duplicates(inplace=True)
df.duplicated().sum()

np.int64(0)

In [45]:
df.shape

(246013, 40)

## 3.3 Check number of unique values of each column

In [46]:
column = 'State'

# Print the unique values
print(df[column].unique())

df.nunique()


['Alabama' 'Alaska' 'Arizona' 'Arkansas' 'California' 'Colorado'
 'Connecticut' 'Delaware' 'District of Columbia' 'Florida' 'Georgia'
 'Hawaii' 'Idaho' 'Illinois' 'Indiana' 'Iowa' 'Kansas' 'Kentucky'
 'Louisiana' 'Maine' 'Maryland' 'Massachusetts' 'Michigan' 'Minnesota'
 'Mississippi' 'Missouri' 'Montana' 'Nebraska' 'Nevada' 'New Hampshire'
 'New Jersey' 'New Mexico' 'New York' 'North Carolina' 'North Dakota'
 'Ohio' 'Oklahoma' 'Oregon' 'Pennsylvania' 'Rhode Island' 'South Carolina'
 'South Dakota' 'Tennessee' 'Texas' 'Utah' 'Vermont' 'Virginia'
 'Washington' 'West Virginia' 'Wisconsin' 'Wyoming' 'Guam' 'Puerto Rico'
 'Virgin Islands']


State                          54
Sex                             2
GeneralHealth                   5
PhysicalHealthDays             31
MentalHealthDays               31
LastCheckupTime                 4
PhysicalActivities              2
SleepHours                     23
RemovedTeeth                    4
HadHeartAttack                  2
HadAngina                       2
HadStroke                       2
HadAsthma                       2
HadSkinCancer                   2
HadCOPD                         2
HadDepressiveDisorder           2
HadKidneyDisease                2
HadArthritis                    2
HadDiabetes                     4
DeafOrHardOfHearing             2
BlindOrVisionDifficulty         2
DifficultyConcentrating         2
DifficultyWalking               2
DifficultyDressingBathing       2
DifficultyErrands               2
SmokerStatus                    4
ECigaretteUsage                 4
ChestScan                       2
RaceEthnicityCategory           5
AgeCategory   

## 3.4 Check data types

In [47]:
df.dtypes

State                         object
Sex                           object
GeneralHealth                 object
PhysicalHealthDays           float64
MentalHealthDays             float64
LastCheckupTime               object
PhysicalActivities            object
SleepHours                   float64
RemovedTeeth                  object
HadHeartAttack                object
HadAngina                     object
HadStroke                     object
HadAsthma                     object
HadSkinCancer                 object
HadCOPD                       object
HadDepressiveDisorder         object
HadKidneyDisease              object
HadArthritis                  object
HadDiabetes                   object
DeafOrHardOfHearing           object
BlindOrVisionDifficulty       object
DifficultyConcentrating       object
DifficultyWalking             object
DifficultyDressingBathing     object
DifficultyErrands             object
SmokerStatus                  object
ECigaretteUsage               object
C

## 3.5 Check statistics of dataset

In [48]:
df.describe()

Unnamed: 0,PhysicalHealthDays,MentalHealthDays,SleepHours,HeightInMeters,WeightInKilograms,BMI
count,246013.0,246013.0,246013.0,246013.0,246013.0,246013.0
mean,4.119055,4.167292,7.021312,1.70515,83.615522,28.668258
std,8.405803,8.102796,1.440698,0.106654,21.323232,6.514005
min,0.0,0.0,1.0,0.91,28.12,12.02
25%,0.0,0.0,6.0,1.63,68.04,24.27
50%,0.0,0.0,7.0,1.7,81.65,27.46
75%,3.0,4.0,8.0,1.78,95.25,31.89
max,30.0,30.0,24.0,2.41,292.57,97.65


## 3.6 Define numerical and categorical columns

In [50]:
numerical_col = [col for col in df.columns if df[col].dtype != 'O']
categorical_col = [col for col in df.columns if df[col].dtype == 'O']

#print columns
print('We have {} numerical columns : {}'.format(len(numerical_col), numerical_col))
print('We have {} categorical columns : {}'.format(len(categorical_col), categorical_col))

We have 6 numerical columns : ['PhysicalHealthDays', 'MentalHealthDays', 'SleepHours', 'HeightInMeters', 'WeightInKilograms', 'BMI']
We have 34 categorical columns : ['State', 'Sex', 'GeneralHealth', 'LastCheckupTime', 'PhysicalActivities', 'RemovedTeeth', 'HadHeartAttack', 'HadAngina', 'HadStroke', 'HadAsthma', 'HadSkinCancer', 'HadCOPD', 'HadDepressiveDisorder', 'HadKidneyDisease', 'HadArthritis', 'HadDiabetes', 'DeafOrHardOfHearing', 'BlindOrVisionDifficulty', 'DifficultyConcentrating', 'DifficultyWalking', 'DifficultyDressingBathing', 'DifficultyErrands', 'SmokerStatus', 'ECigaretteUsage', 'ChestScan', 'RaceEthnicityCategory', 'AgeCategory', 'AlcoholDrinkers', 'HIVTesting', 'FluVaxLast12', 'PneumoVaxEver', 'TetanusLast10Tdap', 'HighRiskLastYear', 'CovidPos']


## 4 Exploring Data Visualization