# Heart Disease Prediction

## Import Libraries and Load Data

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.inspection import permutation_importance

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Load data
df = pd.read_csv("../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv")
df.head()

## Clean the Data

In [None]:
df.isnull().sum()

In [None]:
# Drop rows with NaN values
df = df.dropna()

In [None]:
#drop variables that will not be used in model
df =df.drop(['id','work_type','ever_married'],axis=1)

In [None]:
# View statistics of numerical data
df.describe(include = 'all')

To include categorical data into the model. use dummies to turn these fields into binary true false values with dummies. This changes the data from 9 columns to 15 columns.

In [None]:
#Create non-numerical data dummies
df = pd.get_dummies(df)

df.describe(include='all')

To get a better image of the data, this look creates a distribution plot of all the features. It shows the frequency histogram of our numerical variables and how our categorical data is separated.

In [None]:
#Plotting the distribution plot.
plt.figure(figsize=(20,25))
plotnumber=1

for column in df:
  if plotnumber<15:
    ax=plt.subplot(4,4,plotnumber)
    sns.distplot(df[column])
    plt.xlabel(column,fontsize=20)
    plt.ylabel('Values',fontsize=20)
  plotnumber+=1
plt.show()

Before we build our model, it is important to check the correlation coefficients of our features. When a correlation coefficient is close to 0, there is a weaker relationship. When a correlation is strong, it affects the models' ability to estimate the relationship between each independent and dependant variable

In [None]:
#Correlation matrix

plt.figure(figsize = (16, 8))

corr = df.corr()
mask = np.triu(np.ones_like(corr, dtype = bool))
sns.heatmap(corr, cmap='YlGnBu',mask = mask, annot = True, fmt = '.2g', linewidths = 1)
plt.show()

To train and test the accuracy of our model, we split the data. 80% of the records will be used to train our model to evaluate if an individual has heart disease or not. Then the remaining 20% of records are used to test how well the model performs.

To ensure our numeric features are being measured on a common scale, normalize the data set to before training the model.

In [None]:
# Split data into inputs and target result
input = df.drop(["heart_disease"], axis=1)
target = df["heart_disease"]

In [None]:
# Normalize the data
scaler = StandardScaler()
input_scaled = scaler.fit_transform(input)

In [None]:
# Split the data into train and test sets with a 80/20 ratio
X_train, X_test, y_train, y_test = train_test_split(input_scaled, target, test_size=0.2, random_state=0)

## Train Classifier

Train the model, because our target output is to determine if an indivdual has heart disease or not, we use a classifier to predict which category to assign. After training the model, we can see which features we included have the most significant impact. features being closer to 1 having the most importance. this can be used to simplify the model if a feature appears irrelevant.

In [None]:
# Fit a random forest classifier on train set and predict on test set
clf = RandomForestClassifier(random_state=0)
clf.fit(X_train, y_train)

target_predict = clf.predict(X_test)

In [None]:
#Show Feature Importance
clf_summary = pd.DataFrame(input.columns.values, columns=['Features'])
clf_summary['weight'] = clf.feature_importances_
clf_summary.sort_values

## Model Accuracy

Now test the accuracy of our logic on the test data. we can see the current model is 95% accurate

In [None]:
# Output classification metrics
print(classification_report(y_test, target_predict))

Below we can see where the model miscategorized the data. This model was wrong on 4 records. 3 of the records were labeled for no heart disease and only 1 recorded was wrongly predicted to have heart disease.

In [None]:
conf_matrix = pd.crosstab(y_test,target_predict)
conf_matrix