# **Heart Disease Identification with Decision Trees**

The code below is taken from Pablo M Gomez's submission on [kaggle.com](https://www.kaggle.com/tentotheminus9/what-causes-heart-disease-explaining-the-model).

You are encouraged to go to the link above and check the full code. In this lab, you will do the necessary steps to explore the data and prepare it for sklearn algorithms.

**About the data set**

The Cleveland database is the only one that has been used by ML researchers to
this date to predict the presence of heart disease in a patient. It is integer valued from 0 (no presence) to 4.


**Import libraries**

In [14]:
#loading dataset
import pandas as pd
import numpy as np

#visualisation
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# data splitting
from sklearn.model_selection import train_test_split

# data modeling
from sklearn.tree import DecisionTreeClassifier

# Acquire data

In [15]:
# Read in the data using panda's read_csv method
dt = pd.read_csv("SupervisedLearning/HeartDiseaseIdentification/heart.csv")

#TODO: Write code to inspect the first five rows of the dataframe
dt.head(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


# Inspect data

In [16]:
#TODO: Write code to inspect the shape of the data frame
dt.shape

(303, 14)

In [17]:
#TODO: Write code to display information about the data frame
dt.dtypes

age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
target        int64
dtype: object

In [18]:
#TODO: Write code to display statistics about the data frame
dt.statistics

# Clean data

**Correcting**

Let's change the column names to be a bit clearer

In [6]:
dt.columns = ['age', 'sex', 'chest_pain_type', 'resting_blood_pressure', 'cholesterol', 'fasting_blood_sugar', 'rest_ecg', 'max_heart_rate_achieved',
       'exercise_induced_angina', 'st_depression', 'st_slope', 'num_major_vessels', 'thalassemia', 'target']

**Converting**

Let's change the values of the categorical variables, to improve the interpretation later on

In [13]:
# Convert features 'female' and 'male' to 0 and 1
dt['sex'][dt['sex'] == 0] = 'female'
dt['sex'][dt['sex'] == 1] = 'male'

# Convert chest_pain_type features to 0,1,2,3 and 4
dt['chest_pain_type'][dt['chest_pain_type'] == 1] = 'typical angina'
dt['chest_pain_type'][dt['chest_pain_type'] == 2] = 'atypical angina'
dt['chest_pain_type'][dt['chest_pain_type'] == 3] = 'non-anginal pain'
dt['chest_pain_type'][dt['chest_pain_type'] == 4] = 'asymptomatic'

#TODO: Write code to convert fasting_blood_sugar features
#Hint: 'lower than 120mg/ml' should be 0, and 
#'greater than 120mg/ml' should be 1
dt['fasting_blood_sugar'][dt['fasting_blood_sugar'] == 0] = 'lower than 120mg/mL'
dt['fasting_blood_sugar'][dt['fasting_blood_sugar'] == 1] = 'greater than 120mg/mL'

#TODO: Write code to convert rest_ecg features
#Hint: 'normal' should be 0, and 
#'ST-T wave abnormality' should be 1
#'left ventricular hypertrophy' should be 2
dt['rest_ecg'][dt['rest_ecg'] == 0] = 'normal'
dt['rest_ecg'][dt['rest_ecg'] == 1] = 'ST-T wave abnormality'
dt['rest_ecg'][dt['rest_ecg'] == 2] = 'left ventricular hypertrophy'

#TODO: Write code to convert exercise_induced_angina features
#Hint: 'no' should be 0, and 
#'yes' should be 1
dt['exercise_induced_angina'][dt['exercise_induced_angina'] == 0] = 'no'
dt['exercise_induced_angina'][dt['exercise_induced_angina'] == 1] = 'yes'

#TODO: Write code to convert st_slope features
#Hint: 'upsloping' should be 0, and 
#'flat' should be 1
#'downsloping' should be 2
dt['st_slope'][dt['st_slope'] == 0] = 'upsloping'
dt['st_slope'][dt['st_slope'] == 1] = 'flat'
dt['st_slope'][dt['st_slope'] == 2] = 'downsloping'

#TODO: Write code to convert thalassemia features
#Hint: 'normal' should be 0, and 
#'fixed defect' should be 1
#'reversable defect' should be 2
dt['thalassemia'][dt['thalassemia'] == 0] = 'normal'
dt['thalassemia'][dt['thalassemia'] == 1] = 'fixed defect'
dt['thalassemia'][dt['thalassemia'] == 2] = 'reversable defect'



Check the data types

In [11]:
dt.dtypes

age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
target        int64
dtype: object

Some of those aren't quite right. The code below changes them into categorical variables

In [13]:
dt['sex'] = dt['sex'].astype('object')
dt['chest_pain_type'] = dt['chest_pain_type'].astype('object')
dt['fasting_blood_sugar'] = dt['fasting_blood_sugar'].astype('object')
dt['rest_ecg'] = dt['rest_ecg'].astype('object')
dt['exercise_induced_angina'] = dt['exercise_induced_angina'].astype('object')
dt['st_slope'] = dt['st_slope'].astype('object')
dt['thalassemia'] = dt['thalassemia'].astype('object')


KeyError: 'chest_pain_type'

In [12]:
#TODO: Write code to check the data types again see the change
dt.dtypes

age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
target        int64
dtype: object

**Creating**

For the categorical varibles, we need to create dummy variables and drop the first category of each. 

For example, rather than having 'male' and 'female', we'll have 'male' with values of 0 or 1 (1 being male, and 0 therefore being female).

In [11]:
dt = pd.get_dummies(dt, drop_first = True)

Inspect the data frame

In [12]:
dt.head()

# Earn Your Wings

Use a decision tree classifier on the cleaned data set to predict 'Target' for the given data. Report the accuracy score. Add comments in your code to explain each step that you take in your implementation.