# **Heart Disease Identification with Decision Trees**

The code below is taken from Pablo M Gomez's submission on [kaggle.com](https://www.kaggle.com/tentotheminus9/what-causes-heart-disease-explaining-the-model).

You are encouraged to go to the link above and check the full code. In this lab, you will do the necessary steps to explore the data and prepare it for sklearn algorithms.

**About the data set**

The Cleveland database is the only one that has been used by ML researchers to
this date to predict the presence of heart disease in a patient. It is integer valued from 0 (no presence) to 4.


**Import libraries**

In [19]:
#loading dataset
import pandas as pd
import numpy as np

#visualisation
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# data splitting
from sklearn.model_selection import train_test_split

# data modeling
from sklearn.tree import DecisionTreeClassifier

# Acquire data

In [20]:
# Read in the data using panda's read_csv method
dt = pd.read_csv("SupervisedLearning/HeartDiseaseIdentification/heart.csv")

#TODO: Write code to inspect the first five rows of the dataframe
dt.head()


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


# Inspect data

In [21]:
#TODO: Write code to inspect the shape of the data frame
dt.shape


(303, 14)

In [22]:
#TODO: Write code to display information about the data frame
dt.describe

<bound method NDFrame.describe of      age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  \
0     63    1   3       145   233    1        0      150      0      2.3   
1     37    1   2       130   250    0        1      187      0      3.5   
2     41    0   1       130   204    0        0      172      0      1.4   
3     56    1   1       120   236    0        1      178      0      0.8   
4     57    0   0       120   354    0        1      163      1      0.6   
..   ...  ...  ..       ...   ...  ...      ...      ...    ...      ...   
298   57    0   0       140   241    0        1      123      1      0.2   
299   45    1   3       110   264    0        1      132      0      1.2   
300   68    1   0       144   193    1        1      141      0      3.4   
301   57    1   0       130   131    0        1      115      1      1.2   
302   57    0   1       130   236    0        0      174      0      0.0   

     slope  ca  thal  target  
0        0   0     1  

In [23]:
#TODO: Write code to display statistics about the data frame
dt.values

array([[63.,  1.,  3., ...,  0.,  1.,  1.],
       [37.,  1.,  2., ...,  0.,  2.,  1.],
       [41.,  0.,  1., ...,  0.,  2.,  1.],
       ...,
       [68.,  1.,  0., ...,  2.,  3.,  0.],
       [57.,  1.,  0., ...,  1.,  3.,  0.],
       [57.,  0.,  1., ...,  1.,  2.,  0.]])

# Clean data

**Correcting**

Let's change the column names to be a bit clearer

In [24]:
dt.columns = ['age', 'sex', 'chest_pain_type', 'resting_blood_pressure', 'cholesterol', 'fasting_blood_sugar', 'rest_ecg', 'max_heart_rate_achieved',
       'exercise_induced_angina', 'st_depression', 'st_slope', 'num_major_vessels', 'thalassemia', 'target']

**Converting**

Let's change the values of the categorical variables, to improve the interpretation later on

In [25]:
# Convert features 'female' and 'male' to 0 and 1
dt['sex'][dt['sex'] == 0] = 'female'
dt['sex'][dt['sex'] == 1] = 'male'

# Convert chest_pain_type features to 0,1,2,3 and 4
dt['chest_pain_type'][dt['chest_pain_type'] == 1] = 'typical angina'
dt['chest_pain_type'][dt['chest_pain_type'] == 2] = 'atypical angina'
dt['chest_pain_type'][dt['chest_pain_type'] == 3] = 'non-anginal pain'
dt['chest_pain_type'][dt['chest_pain_type'] == 4] = 'asymptomatic'

#TODO: Write code to convert fasting_blood_sugar features
dt['fasting_blood_sugar'][dt['fasting_blood_sugar']==0] = 'lower than 120mg/ml'
dt['fasting_blood_sugar'][dt['fasting_blood_sugar']==1] = 'greater than 120mg/ml'
#Hint: 'lower than 120mg/ml' should be 0, and 
#'greater than 120mg/ml' should be 1


#TODO: Write code to convert rest_ecg features
dt['rest_ecg'][dt['rest_ecg']== 0]= 'normal'
dt['rest_ecg'][dt['rest_ecg']== 1]= 'ST-T wave abnormality'
dt['rest_ecg'][dt['rest_ecg']== 2]= 'left ventricular hypertrophy'
#Hint: 'normal' should be 0, and 
#'ST-T wave abnormality' should be 1
#'left ventricular hypertrophy' should be 2


#TODO: Write code to convert exercise_induced_angina features
dt['exercise_induced_angina'][dt['exercise_induced_angina']==0]= 'no'
dt['exercise_induced_angina'][dt['exercise_induced_angina']==1]= 'yes'
#Hint: 'no' should be 0, and 
#'yes' should be 1


#TODO: Write code to convert st_slope features
dt['st_slope'][dt['st_slope']==0]='upsloping'
dt['st_slope'][dt['st_slope']==1]='flat'
dt['st_slope'][dt['st_slope']==2]='downsloping'
#Hint: 'upsloping' should be 0, and 
#'flat' should be 1
#'downsloping' should be 2


#TODO: Write code to convert thalassemia features
dt['thalassemia'][dt['thalassemia']==0]= 'normal'
dt['thalassemia'][dt['thalassemia']==1]= 'fixed defect'
dt['thalassemia'][dt['thalassemia']==2]= 'reversable defect'
#Hint: 'normal' should be 0, and 
#'fixed defect' should be 1
#'reversable defect' should be 2



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dt['sex'][dt['sex'] == 0] = 'female'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dt['chest_pain_type'][dt['chest_pain_type'] == 1] = 'typical angina'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dt['fasting_blood_sugar'][dt['fasting_blood_sugar']==0] = 'lower than 120mg/ml'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-vie

Check the data types

In [26]:
dt.dtypes

age                          int64
sex                         object
chest_pain_type             object
resting_blood_pressure       int64
cholesterol                  int64
fasting_blood_sugar         object
rest_ecg                    object
max_heart_rate_achieved      int64
exercise_induced_angina     object
st_depression              float64
st_slope                    object
num_major_vessels            int64
thalassemia                 object
target                       int64
dtype: object

Some of those aren't quite right. The code below changes them into categorical variables

In [27]:
dt['sex'] = dt['sex'].astype('object')
dt['chest_pain_type'] = dt['chest_pain_type'].astype('object')
dt['fasting_blood_sugar'] = dt['fasting_blood_sugar'].astype('object')
dt['rest_ecg'] = dt['rest_ecg'].astype('object')
dt['exercise_induced_angina'] = dt['exercise_induced_angina'].astype('object')
dt['st_slope'] = dt['st_slope'].astype('object')
dt['thalassemia'] = dt['thalassemia'].astype('object')


In [28]:
#TODO: Write code to check the data types again see the change
dt

Unnamed: 0,age,sex,chest_pain_type,resting_blood_pressure,cholesterol,fasting_blood_sugar,rest_ecg,max_heart_rate_achieved,exercise_induced_angina,st_depression,st_slope,num_major_vessels,thalassemia,target
0,63,male,non-anginal pain,145,233,greater than 120mg/ml,normal,150,no,2.3,upsloping,0,fixed defect,1
1,37,male,atypical angina,130,250,lower than 120mg/ml,ST-T wave abnormality,187,no,3.5,upsloping,0,reversable defect,1
2,41,female,typical angina,130,204,lower than 120mg/ml,normal,172,no,1.4,downsloping,0,reversable defect,1
3,56,male,typical angina,120,236,lower than 120mg/ml,ST-T wave abnormality,178,no,0.8,downsloping,0,reversable defect,1
4,57,female,0,120,354,lower than 120mg/ml,ST-T wave abnormality,163,yes,0.6,downsloping,0,reversable defect,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,female,0,140,241,lower than 120mg/ml,ST-T wave abnormality,123,yes,0.2,flat,0,3,0
299,45,male,non-anginal pain,110,264,lower than 120mg/ml,ST-T wave abnormality,132,no,1.2,flat,0,3,0
300,68,male,0,144,193,greater than 120mg/ml,ST-T wave abnormality,141,no,3.4,flat,2,3,0
301,57,male,0,130,131,lower than 120mg/ml,ST-T wave abnormality,115,yes,1.2,flat,1,3,0


**Creating**

For the categorical varibles, we need to create dummy variables and drop the first category of each. 

For example, rather than having 'male' and 'female', we'll have 'male' with values of 0 or 1 (1 being male, and 0 therefore being female).

In [29]:
dt = pd.get_dummies(dt, drop_first = True)

Inspect the data frame

In [30]:
dt

Unnamed: 0,age,resting_blood_pressure,cholesterol,max_heart_rate_achieved,st_depression,num_major_vessels,target,sex_male,chest_pain_type_atypical angina,chest_pain_type_non-anginal pain,chest_pain_type_typical angina,fasting_blood_sugar_lower than 120mg/ml,rest_ecg_left ventricular hypertrophy,rest_ecg_normal,exercise_induced_angina_yes,st_slope_flat,st_slope_upsloping,thalassemia_fixed defect,thalassemia_normal,thalassemia_reversable defect
0,63,145,233,150,2.3,0,1,1,0,1,0,0,0,1,0,0,1,1,0,0
1,37,130,250,187,3.5,0,1,1,1,0,0,1,0,0,0,0,1,0,0,1
2,41,130,204,172,1.4,0,1,0,0,0,1,1,0,1,0,0,0,0,0,1
3,56,120,236,178,0.8,0,1,1,0,0,1,1,0,0,0,0,0,0,0,1
4,57,120,354,163,0.6,0,1,0,0,0,0,1,0,0,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,140,241,123,0.2,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0
299,45,110,264,132,1.2,0,0,1,0,1,0,1,0,0,0,1,0,0,0,0
300,68,144,193,141,3.4,2,0,1,0,0,0,0,0,0,0,1,0,0,0,0
301,57,130,131,115,1.2,1,0,1,0,0,0,1,0,0,1,1,0,0,0,0


# Earn Your Wings

Use a decision tree classifier on the cleaned data set to predict 'Survived' for the given data. Report the accuracy score. Add comments in your code to explain each step that you take in your implementation.