# Analysis of Kaggle's Heart Failure dataset

Required libraries for data wrangling, statistical analysis and plotting. Plotnine was chosen for similarity to ggplot2 in R.

In [51]:
import numpy as np
import pandas as pd
import plotnine as pn

## Data loading, inspection and curation
First, load the heart failure dataset supplied by Kaggle and perform some basic introspection on the overall shape of the data and the type of features it contains.

In [52]:
df = pd.read_csv('heart.csv')
print(df.shape)
print(df.dtypes)
df.head()

(918, 12)
Age                 int64
Sex                object
ChestPainType      object
RestingBP           int64
Cholesterol         int64
FastingBS           int64
RestingECG         object
MaxHR               int64
ExerciseAngina     object
Oldpeak           float64
ST_Slope           object
HeartDisease        int64
dtype: object


Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


This learns us that there are a total of 918 rows and 12 features. Several features have object datatype, but I would prefer category datatype. So for those features I'll check whether the number of distinct objects per feature is small. HeartDisease could be transformed to a boolean, perhaps ExerciseAngina nad FastingBS as well.

In [53]:
for col in df[["Sex","ChestPainType","FastingBS","RestingECG","ExerciseAngina"]]:
    print(col, df[col].unique())

Sex ['M' 'F']
ChestPainType ['ATA' 'NAP' 'ASY' 'TA']
FastingBS [0 1]
RestingECG ['Normal' 'ST' 'LVH']
ExerciseAngina ['N' 'Y']


Above suspicions are confirmed; the respective features have few distinct values each and some can be turned to a boolean. Below the colums are assigned their new data types and a check is done afterwards to assure the conversion was done correctly, such that every feature retains all of its values.

In [54]:
# Convert to category
for col in df[["Sex","ChestPainType","RestingECG"]]:
    df[col] = df[col].astype('category')
# Convert to boolean
for col in df[["FastingBS","HeartDisease"]]:
    df[col] = df[col].astype('bool')
df["ExerciseAngina"] = df['ExerciseAngina'].replace({'N': 0, 'Y': 1})
df["ExerciseAngina"] = df["ExerciseAngina"].astype('bool')
print(df.dtypes)
for col in df[["Sex","ChestPainType","FastingBS","RestingECG","ExerciseAngina","HeartDisease"]]:
    print(col, df[col].unique())

Age                  int64
Sex               category
ChestPainType     category
RestingBP            int64
Cholesterol          int64
FastingBS             bool
RestingECG        category
MaxHR                int64
ExerciseAngina        bool
Oldpeak            float64
ST_Slope            object
HeartDisease          bool
dtype: object
Sex ['M', 'F']
Categories (2, object): ['F', 'M']
ChestPainType ['ATA', 'NAP', 'ASY', 'TA']
Categories (4, object): ['ASY', 'ATA', 'NAP', 'TA']
FastingBS [False  True]
RestingECG ['Normal', 'ST', 'LVH']
Categories (3, object): ['LVH', 'Normal', 'ST']
ExerciseAngina [False  True]
HeartDisease [False  True]


Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,False,Normal,172,False,0.0,Up,False
1,49,F,NAP,160,180,False,Normal,156,False,1.0,Flat,True
2,37,M,ATA,130,283,False,ST,98,False,0.0,Up,False
3,48,F,ASY,138,214,False,Normal,108,True,1.5,Flat,True
4,54,M,NAP,150,195,False,Normal,122,False,0.0,Up,False
