# **Predicting Heart Disease** *by Suparna Kompalli*

Cardiovascular disease is the number one cause of death globally, taking over 17 million lives in a year. Heart failure is a common symptom of cardiovascular disease. Early detection in critical in treating heart diseases. This project uses python to create a classifier from scratch that will predict heart failure in an individual.

The first thing we want to do is import all of the libraries we need to perform our analysis.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns 
import matplotlib.pyplot as plt
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

The first thing we want to do is read in our data.

In [None]:
data = pd.read_csv("/kaggle/input/heart-failure-prediction/heart.csv")

Now time to do some basic data exploration:

In [None]:
data.info()

In [None]:
data.describe()

Here we are breaking up our data into numerical and categorical data.

In [None]:
data_num = data[['Age','RestingBP','Cholesterol','MaxHR','Oldpeak',"HeartDisease"]]
data_cat = data[['Sex','RestingECG','ExerciseAngina','FastingBS','ST_Slope']]

Now we plot histograms to see the distribution of the numerical data in our dataset.

In [None]:
for i in data_num.columns:
    plt.hist(data_num[i])
    plt.title(i)
    plt.show()

In [None]:
pd.pivot_table(data, index = 'HeartDisease',values = ['Age','RestingBP','Cholesterol','MaxHR','Oldpeak'])

Hello

ST- basically a length of time in the heartbeat where it is flat 
people with heart disease have a higher time here
HR - lower for people woth HD
CHolesterol is higher in those wirhout HR
Age and BP - around the same

In [None]:
for i in data_cat.columns:
    sns.barplot(data_cat[i].value_counts().index,data_cat[i].value_counts()).set_title(i)
    plt.show()

Now time to do some feature engineering:

In [None]:
data['RestingECG_adv'] = data.RestingECG.apply(lambda x: str(x)[0])
print(data.RestingECG_adv.value_counts())
pd.pivot_table(data,index='HeartDisease',columns='RestingECG_adv', values = 'Age', aggfunc='count')

In [None]:
print(pd.pivot_table(data, index = 'HeartDisease', columns = 'RestingECG', values = 'Age' ,aggfunc ='count'))
print()
print(pd.pivot_table(data, index = 'HeartDisease', columns = 'Sex', values = 'Age' ,aggfunc ='count'))
print()
print(pd.pivot_table(data, index = 'HeartDisease', columns = 'ST_Slope', values = 'Age' ,aggfunc ='count'))
print()
print(pd.pivot_table(data, index = 'HeartDisease', columns = 'ExerciseAngina', values = 'Age' ,aggfunc ='count'))
print()
print(pd.pivot_table(data, index = 'HeartDisease', columns = 'FastingBS', values = 'Age' ,aggfunc ='count'))

Creating a classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import statistics
from statistics import mode

In [None]:
#80% train 20% test
num_rows = len(data[['Age']])
train_rows = round(num_rows*0.8)
test_rows = num_rows - train_rows
train_rows, test_rows

In [None]:
#spliting the data frame into train and test
#modifying data_num to most defining features
data_num = data_num[['Cholesterol','MaxHR','Oldpeak',"HeartDisease"]]
data_num_rearranged = data_num.sample(918)
features_train = data_num_rearranged.iloc[:734]
features_test = data_num_rearranged.iloc[734:]
train_labels = features_train.columns
train_labels

In [None]:
#HeartDisease: output class [1: heart disease, 0: Normal]
#calculate euclidena distance between two rows
def euclidean_dist(row1, row2):
    distance = 0.0
    for i in range(len(row1)):
        distance += (row1[i] - row2[i])**2
    return (distance)**(1/2)
    

In [None]:
def most_common(lst):
    return(mode(lst))
   
    
def find_knn(test_row, df, k):
    distances = []
    for i in range(len(df["HeartDisease"])):
        row = df.iloc[i]
        distances.append( euclidean_dist(test_row, row))
    df["dist"] = distances
    df = df.sort_values(by=['dist'])
    df = df.iloc[0:k]
    heart_disease = df["HeartDisease"]
    return most_common(heart_disease)
        

Time to test the classifier.

In [None]:
def test_classifier(k): 
    accuracy = []
    for i in range(184):
        test_row = features_test.iloc[i].tolist()
        test_val = test_row[-1]
        prediction_val = find_knn(test_row, features_train, k)
        if (test_val == prediction_val):
            accuracy.append(1.0)
        else:
            accuracy.append(0.0)
    prop_correct =  sum(accuracy)/len(accuracy)
    return prop_correct

In [None]:
print(test_classifier(5))

As we can see, our classifier is about 65% accurate. From here on we can modify the feautres chosen and even the *k-value* to see if we can improve the classifier to suit our needs.

Lets try changing the k-value:

In [None]:
print(test_classifier(7))

We can see that changing our k-value actually increased the accuracy of our classifier. In fact, there are many different ways to change the accuracy of the classifier.

In conclusion, this classifier can be extrapolated to predict heart diease in patients. This tool can be critical for early detection and prevention of heart diease, since many do not know until the later stages. Such prediction tools can be widely used to detect more that just heart diease. Fully utilizing Data Science in the medical field, brings a world of possibilities too profound not to explore.