**HEART DISEASE**


Heart disease is a class of diseases that involve the heart or blood vessels. Cardiovascular diseases are the leading cause of death globally. This is true in all areas of the world except few. Deaths, at a given age, from CVD are more common and have been increasing in much of the developing world, while rates have declined in most of the developed world since the 1970s. This heart disease classifier uses historical dataset of patients to classify heart disease. Various features available in the dataset are used for prediction of heart disease.



About the Dataset

This database contains 76 attributes, but all published experiments refer to using a subset of 13 of them.The "target" field refers to the presence of heart disease in the patient. It is integer valued 0 which means presence of no hear disease and 1 means presence of disease. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence i.e value 1 from absence (value 0).

The data includes 303 patient level features including if they have heart disease at the end or not. Features are like; Age: Obvious one.

    Sex:

    0: Female
    1: Male

    Chest Pain Type:

    0: Typical Angina
    1: Atypical Angina
    2: Non-Anginal Pain
    3: Asymptomatic

    Resting Blood Pressure: Person's resting blood pressure.
    Cholesterol: Serum Cholesterol in mg/dl
    Fasting Blood Sugar:

    0:Less Than 120mg/ml
    1: Greater Than 120mg/ml

    Resting Electrocardiographic Measurement:

    0: Normal
    1: ST-T Wave Abnormality
    2: Left Ventricular Hypertrophy

    Max Heart Rate Achieved: Maximum Heart Rate Achieved
    Exercise Induced Angina:

    1: Yes
    0: No

    ST Depression: ST depression induced by exercise relative to rest.
    Slope: Slope of the peak exercise ST segment:

    0: Upsloping
    1: Flat
    2: Downsloping

    Thalassemia: A blood disorder called 'Thalassemia':

    0: Normal
    1: Fixed Defect
    2: Reversable Defect

    Number of Major Vessels: Number of major vessels colored by fluoroscopy.


In [None]:
import pandas as pd
import numpy as np

#for plotting
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

#for data splitting
from sklearn.model_selection import train_test_split

#for the model prediction
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier


In [None]:
# We are reading our data
df = pd.read_csv("../input/heart-disease-uci/heart.csv")

# First 5 rows of our data
df.head()

In [None]:
#Change the column names for better understanding
df.columns = ['age', 'sex', 'chest_pain_type', 'resting_blood_pressure', 'cholesterol', 'fasting_blood_sugar', 'rest_ecg', 'max_heart_rate_achieved',
       'exercise_induced_angina', 'st_depression', 'st_slope', 'num_major_vessels', 'thalassemia', 'target']


In [None]:
#Finding the shape of the dataframe
df.shape

In [None]:
# Describing the Dataframe
df.describe()

In [None]:
#Finding information of dataframe
df.info()

In [None]:
#Finding the missing values. In this dataframe there is no missing value.
df.isnull().sum()

Let's see the percentage of patients who really have heart disease and the one without any heart disease. 

In [None]:
#Finding the percentage of patients with heart disease and the one without heart disease

countNoDisease = len(df[df.target == 0])
countHaveDisease = len(df[df.target == 1])
print("Percentage of Patients Haven't Heart Disease: {:.2f}%".format((countNoDisease / (len(df.target))*100)))
print("Percentage of Patients Have Heart Disease: {:.2f}%".format((countHaveDisease / (len(df.target))*100)))


As target = 1 means the patient has a heart disease so we will use age parameter to check the frequency of heart disease using matplotlib.

In [None]:
#Finding heart disease frequency with the age parameter

pd.crosstab(df.age,df.target).plot(kind="bar",figsize=(20,6))
plt.title('Heart Disease Frequency for Ages')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.savefig('heartDiseaseAndAges.png')
plt.show()

In [None]:
#Finding all the unique value of 'cheat_pain_type' parameter
df.chest_pain_type.unique()

Now we will use plotly to check the count of 'chest_pain_type'. There are more advantage of plotly over seaborn or matplotlib. Just move the cursor over the plot below.

In [None]:
#Finding the count of all the type of 'chest_pain_type' 
c = df["chest_pain_type"].value_counts()
labels = c.index
fig = px.bar(c, title = "Chest pain type", text = c)
fig.show()

Now in the graph below matplotlib is used and when you move the cursor over the plot the properties won't be shown as in the plotly.

In [None]:
#Relation of heart disease with chest_pain_type. Here the chest_pain_type 2 has highest chance to have heart disease
pd.crosstab(df.chest_pain_type,df.target).plot(kind="bar",figsize=(15,6))
plt.title('Heart Disease Frequency for chest_pain_type')
plt.xlabel('Chest_pain_type (0, 1, 2, 3)')
plt.xticks(rotation=0)
plt.legend(["Haven't Disease", "Have Disease"])
plt.ylabel('Frequency')
plt.show()

In [None]:
#Relation of heart disease with sex
pd.crosstab(df.sex,df.target).plot(kind="bar",figsize=(15,6))
plt.title('Heart Disease Frequency for Sex')
plt.xlabel('Sex (0 = Female, 1 = Male)')
plt.xticks(rotation=0)
plt.legend(["Haven't Disease", "Have Disease"])
plt.ylabel('Frequency')
plt.show()

Here the scatter plot of plotly is used to find relation between max_heart_rate and the age parameter. As we can see high heart rate is achieved by younger people with heart disease and with older people the heart rate is not that high with heart disease.

In [None]:
#Using scatter plot to find relation of max_heart_rate achieved with age. Here yellow color is used for target = 1 and purple for target = 0
fig = px.scatter(df, x="age", y="max_heart_rate_achieved", color = "target") 
fig.update_traces(marker=dict(size=12,
                              line=dict(width=2,
                                        color='MediumPurple')),
                  selector=dict(mode='markers'))
fig.show()

In [None]:
#Finding frequency of heart disease with Fasting blood sugar. 
pd.crosstab(df.fasting_blood_sugar 	,df.target).plot(kind="bar",figsize=(15,8))
plt.title('Heart Disease Frequency According To Fasting Blood Sugar')
plt.xlabel('FBS - (Fasting Blood Sugar < 120 mg/dl) (1 = true; 0 = false)')
plt.xticks(rotation = 0)
plt.legend(["Haven't Disease", "Have Disease"])
plt.ylabel('Frequency of Disease or Not')
plt.show()

In [None]:
#Finding number of fasting blood sugar in patients
fb = df["fasting_blood_sugar"].value_counts()
labels = fb.index
fig = px.bar(fb, title = "fasting blood sugar", text = fb)
fig.show()



Take the cursor over the plot to see the exact properties like age and cholestrol for target.  

In [None]:
#Using scatter plot to find relation between Cholesterol and Age with yellow being target = 0 and blue being target = 1
fig = px.scatter(df, x="age", y="cholesterol", color = "target") 
fig.update_traces(marker=dict(size=12,
                              line=dict(width=2,
                                        color= 'DarkSlateGrey')),
                  selector=dict(mode='markers'))
fig.show()

As we can see the cholestrol value is mostly confined between 200-400 with maximum heart rate value is 200. And maximum patients with heart disease have heart rate between 140-180.

In [None]:
#Relation between cholestrol and maximum heart rate achieved with chances of target either 0 or 1
plt.figure(figsize=(8,6))
sns.scatterplot(x='cholesterol',y='max_heart_rate_achieved',data=df,hue='target')
plt.show()


The plot below is split for both male and female to draw the plot between cholestrol and resting blood pressure and also using target parameter.

In [None]:
#Faceting is used for splitting the plot into multiple subplots based on the values of a particular row/column. Here we are using facet_col and thus, we end up with 2 subplots each representing 'sex'
fig = px.scatter(df, x = 'resting_blood_pressure', y = 'cholesterol', title='Cholestrol vs Blood Pressure', 
                 facet_col = 'sex', # the name of the column in the dataframe whose values are used for creating subplots
                 color = 'target')
fig.show()


In [None]:
#Relation of major_vessels with target = 1
fig = px.bar(df, x = "num_major_vessels", y = "target")
fig.show()

In [None]:
#'chest_pain_type', 'thalassemia' and 'st_slope' are categorical variables we'll turn them into dummy variables.
a = pd.get_dummies(df['chest_pain_type'], prefix = "chest_pain_type")
b = pd.get_dummies(df['thalassemia'], prefix = "thalassemia")
c = pd.get_dummies(df['st_slope'], prefix = "st_slope")
frames = [df, a, b, c]
df = pd.concat(frames, axis = 1)
df.head()

In [None]:
#Dropping the categorical variable
df = df.drop(columns = ['chest_pain_type', 'thalassemia', 'st_slope'])
df.head()

In [None]:
#Alloting target as 'y'
y = df.target.values
x_data = df.drop(['target'], axis = 1)
#Normalize the data
x = (x_data - np.min(x_data)) / (np.max(x_data) - np.min(x_data)).values

#Finding the shape of 'x'
x.shape



In [None]:
#We will split our data. 80% of our data will be train data and 20% of it will be test data
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.2,random_state=0)



In [None]:
#transpose matrices
x_train = x_train.T
y_train = y_train.T
x_test = x_test.T
y_test = y_test.T

#Using decision tree algorithm for prediction
dtc = DecisionTreeClassifier()
dtc.fit(x_train.T, y_train.T)

acc = dtc.score(x_test.T, y_test.T)*100
print("Decision Tree Test Accuracy {:.2f}%".format(acc))



Let's improve our test accuracy by using random forest classifier

In [None]:
#Using Random Forest Classifier for prediction since decision tree classifier has low accuracy
rf = RandomForestClassifier(n_estimators = 1000, random_state = 1)
rf.fit(x_train.T, y_train.T)

acc = rf.score(x_test.T,y_test.T)*100

print("Random Forest Algorithm Accuracy Score : {:.2f}%".format(acc))

If you like this notebook, please kindly upvote! :)