**I believe that this is where the data comes from since it has all the same features: http://rstudio-pubs-static.s3.amazonaws.com/24341_184a58191486470cab97acdbbfe78ed5.html**

**I also believe that the same dataset was posted here on Kaggle 3 years ago. If it is then this is a bit worrying because its a copy of another dataset and in the Discussions of the other dataset people were saying that the target values were actually swapped.
 https://www.kaggle.com/ronitf/heart-disease-uci**




You can use this link to see the definitions of all the features as well, just scroll down a tiny bit and they should all be there. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
df = pd.read_csv('/kaggle/input/heart-attack-analysis-prediction-dataset/heart.csv')
df.head()

In [None]:
df.info()

No null values in the dataset. Lets see a statistical summary of the data.

In [None]:
df.drop(['sex', 'output', 'fbs', 'restecg', 'exng'], axis = 1).describe()

Lastly lets look at if there is a balance in values for the target output.

In [None]:
plt.figure(figsize = (10, 6))
df_o = df.copy()
df_o['output'] = df_o['output'].map({0:'Lower chance of Heart Attack', 1:'Higher chance of Heart Attack'})
sns.countplot(data = df_o, x = 'output')

Seems like the target feature is fairly balanced so we won't need to worry about imbalanced data.

**First let's take a look at the distribution of age and gender of everyone in the dataset so we can figure out what we're dealing with**

In [None]:
plt.figure(figsize = (14, 8),  dpi = 200)
ax = sns.countplot(data = df, x = 'age', )
ax.set(ylim = (0, 20))
plt.yticks(np.arange(0, 20));

Seems like most ages are in the range of 51 - 59 years old. 

In [None]:
plt.figure(figsize = (12, 7), dpi = 100)
sns.countplot(data = df, x = 'sex', palette='spring')

It's not clear from the data which gender is which but there is an imbalance of about 110 meaning that the data may be more biased towards one gender. 

**What type of chest pain causes an exercise induced angina? How correlated are these features with the Chance of a Heart Attack?**

In [None]:
df_c = df[['cp', 'exng', 'output']].copy()
df_c['cp'] = df_c['cp'].map({0:'typical angina', 1:'atypical angina', 2:'non-anginal pain', 3:'asymptomatic'})
df_c['exng'] = df_c['exng'].map({0:'No', 1:'Yes'})
df_c.columns = ['Type of Chest Pain', 'Exercise Induced Angina', 'Chance of Heart Attack']
df_c['Chance of Heart Attack'] = df_c['Chance of Heart Attack'].map({0:'Lower', 1:'Higher'})
df_c.head()

In [None]:
plt.figure(figsize=(12, 7), dpi = 150)
sns.countplot(data = df_c, x = 'Type of Chest Pain', hue = 'Exercise Induced Angina', palette = 'Accent');

So clearly there is a significant correlation between type of chest pain and if angina is induced by exercise. If someone has a typical angina they will most likely induce angina again from exercising, but if they have any other type of chest pain, they most likely will not ending up inducing angina from exercise.

In [None]:
fig, ax = plt.subplots(ncols = 2, dpi = 100, figsize = (13, 6))
sns.countplot(data = df_c, x = 'Type of Chest Pain', hue = 'Chance of Heart Attack', ax = ax[0], palette = 'autumn');
ax[0].set_xticklabels(ax[0].get_xticklabels(), fontsize = 8)
sns.countplot(data = df_c, x = 'Exercise Induced Angina', hue = 'Chance of Heart Attack', ax = ax[1], palette = 'autumn');

... I think this is starting to support the theory that the data for the chance of a Heart Attack is actually swapped. According to the first graph, people who experience typical angina have a lower chance of a heart attack than those who do experience typical angina. Not only that, but according to the second graph, people who do not get angina from exercising are more likely to have a heart attack. Both of these conclusions are the opposite of what they should be.

**Does age affect cholestrol and blood pressure levels? How do these features relate to the chance of a heart attack?  **

In [None]:
df_age = df[['age', 'trtbps', 'chol', 'output']].copy()
df_age.columns = ['Age', 'Blood Pressure', 'Cholesterol in mg/dl', 'Chance of Heart Attack']
df_age['Chance of Heart Attack'] = df_age['Chance of Heart Attack'].map({0:'Lower', 1:'Higher'})
df_age = df_age.sort_values(by = 'Chance of Heart Attack', ascending=False)
df_age.head()

In [None]:
sns.jointplot(data = df_age, x = 'Age', y = 'Cholesterol in mg/dl', hue = 'Chance of Heart Attack', palette='dark', height = 10, s = 100, alpha = 0.5)
sns.jointplot(data = df_age, x = 'Age', y = 'Blood Pressure', hue = 'Chance of Heart Attack', palette='dark', height = 10, s = 100, alpha = 0.5)
sns.jointplot(data = df_age, x = 'Cholesterol in mg/dl', y = 'Blood Pressure', hue = 'Chance of Heart Attack', palette='dark', height = 10, s = 100, alpha = 0.5)


It seems that these features have little correlation with each others. 

From the first graph we can see that age does not seem to have much of a relation with cholestrol and both of these variables have little effect on whether the person has a higher or lower chance to have a heart attack.

The same result applies to the second graph. Age and Blood Pressure have little correlation and don't seem to really affect the chance of a heart attack.

The third graph shows us that cholestrol and blood pressure also have little relation with each other.

I do want to mention that this is fairly surprising. It's common knowledge that high blood pressure and high cholestrol levels increase the chance for a heart attack and yet the graphs don't seem to support this.

Something interesting to note here, though, is the patient with a cholestrol level of about 580 mg! Normal levels of cholesterol are 200 or less so 580 is a signifiantly high amount of cholesterol. We'll keep this patient because cholestrol levels of 580 mg are possible.

**Lets move on to seeing how the maximum heart rate achieved and electrocardiographic results affect the chances of someone having a heart attack.**

In [None]:
df_heart = df[['thalachh', 'restecg', 'output']].copy()
df_heart.columns = ['Maximum Heart Rate Achieved', 'Resting Electrocardiographic Results', 'Chance of Heart Attack']
df_heart['Chance of Heart Attack'] = df_heart['Chance of Heart Attack'].map({0:'Lower', 1:'Higher'})
df_heart = df_heart.sort_values(by = 'Chance of Heart Attack', ascending=False)
df_heart['Resting Electrocardiographic Results'] = df_heart['Resting Electrocardiographic Results'].map({0:'Normal', 1:'ST-T Wave Abnormality', 2:'Probable or Definite Left Ventricular Hypertrophy'})
df_heart.head()

First lets just see the relation between 'Maximum Heart Rate Achieved' and 'Resting Electrocardiographic Results

In [None]:
plt.figure(figsize=(11, 7), dpi = 150)
sns.barplot(data = df_heart, x = 'Resting Electrocardiographic Results', y = 'Maximum Heart Rate Achieved', palette = 'Dark2')

There doesn't seem to be much of a correlation between the electrocardiographic results and the maximum heart rate. Lets move on to comparing the 2 features to the chances of having a heart attack.

In [None]:
plt.figure(figsize=(12, 8), dpi = 160)
sns.boxplot(data = df_heart, x = 'Resting Electrocardiographic Results', y = 'Maximum Heart Rate Achieved', hue = 'Chance of Heart Attack', palette = 'rainbow')

Resting Electrocardiographic Results does not appear to have much of a relation to the chance of heart attack. However, the maximum heart rate achieved has a significant correlation with the chance of a heart attack. As the maximum heart rate increases, there is a higher chance that the patient will have a heart attack.

One question though is why does data for 'Probable or Definite Left Ventricular Hypertrophy' have such small boxplots? Lets take a closer look at the data for these rows.

In [None]:
df_heart[df_heart['Resting Electrocardiographic Results'] == 'Probable or Definite Left Ventricular Hypertrophy']

It seems that there are only 4 rows in the entire dataset that fall under 'Probable or Definite Left Ventricular Hypertrophy' in the electrocardiographic results. Without enough data for left ventricular hypertrophy, we cannot make any conclusions from the boxplot with this value.

**Lets now check how the features 'Slope' and 'oldpeak' affect the chance of a heart attack. Since these 2 features are not defined in the description, I will be using the definition from the link I gave at the top.**

Oldpeak: ST depression induced by exercise relative to rest.

Slope: Slope of the peak exercise ST segment (1 = upsloping; 2 = flat; 3 = downsloping).

In [None]:
df_exer = df[['oldpeak', 'slp', 'output']].copy()
df_exer.columns = ['ST depression induced by exercise relative to rest', 'Slope of the peak exercise ST segment', 'Chance of Heart Attack']
df_exer['Chance of Heart Attack'] = df_exer['Chance of Heart Attack'].map({0:'Lower', 1:'Higher'})
df_exer = df_exer.sort_values(by = 'Chance of Heart Attack', ascending=False)
df_exer['Slope of the peak exercise ST segment'] = df_exer['Slope of the peak exercise ST segment'].map({0:'Upsloping', 1:'Flat', 2:'Downsloping'})
df_exer.head()

We'll start, again, by comparing the 2 features and then moving on to seeing if they have a strong correlation with 'Chance of Heart Attack'.

In [None]:
plt.figure(figsize = (9, 6), dpi = 100)
sns.barplot(data = df_exer, x = 'Slope of the peak exercise ST segment', y = 'ST depression induced by exercise relative to rest', order=['Downsloping', 'Flat', 'Upsloping'])

There does seem to be a significant correlation between the slope of the peak and Depression induced by exercise. As we can see, Downsloping has a low ST depression level, Flat has a medium ST depression level, and Upsloping has a high ST Depression level. 

In [None]:
plt.figure(figsize = (11, 7), dpi = 150)
sns.violinplot(data = df_exer, x = 'Slope of the peak exercise ST segment', y = 'ST depression induced by exercise relative to rest', hue = 'Chance of Heart Attack', palette = 'flare', order=['Downsloping', 'Flat', 'Upsloping'])

Starting with the flat slope we can see that chances of a heart attack are lower when there is a higher ST Depression induced.

For the down slope it seems that the ST Depression level does not affect the chances of a heart attack. One thing to notice though is the high amount of outliers in downsloping, we can see how thin the violinplot stretches as ST Depression gets higher.

Lastly for the up slope, ther is a clear correlation, meaning that the higher the ST Depression level, the lower the chance of heart attack. The up slope in peak exercise also has a mostly normal distribution and both median ST depression levels are higher then most of the other slopes.

In conclusion, it seems the ST Depression levels are a fairly good predictor of the chances of a Heart Attack (higher ST Depression Levels mean less of a chance of Heart Attacks). The Slope is not as accurate of a predictor but we can see that downsloping causes a fairly high increase in Heart Attacks (We can see this by looking at how stretched the downslope for higher chance of a heart attack is)

**So out of all the features we've analyzed it looks like ST Depression, Maximum Heart Rate, Exercise Induced Angina, and Chest pain type have a strong correlation with 'Chances of a Heart Attack'. Are there any other features that have a high correlation as well? If so, lets graph it.**

In [None]:
plt.figure(figsize = (10, 6), dpi = 150)
sns.heatmap(df.corr().drop('output', axis = 1).loc[['output'], :].transpose(), annot = True)

So we've already found that the features oldpeak, thalachh, cp, and exng are highly correlated. The only one that we still need to look at is caa (number of major vessels (0-3) colored by flourosopy)

In [None]:
df_ca = df[['caa', 'output']].copy()
df_ca.columns = ['Number of Major Vessels (0-3) Colored by Flourosopy', 'Chance of Heart Attack']
df_ca['Chance of Heart Attack'] = df_ca['Chance of Heart Attack'].map({0:'Lower', 1:'Higher'})
df_ca = df_ca.sort_values(by = 'Chance of Heart Attack', ascending=False)
df_ca.head()

In [None]:
df_ca['Number of Major Vessels (0-3) Colored by Flourosopy'].value_counts()

Note how some of the values are 4. From what I got from the other Kaggle dataset, 4 means that the value is missing. So we will not include these rows in the graph and drop them from our dataset.

In [None]:
df = df[df['caa'] != 4]
df_ca['Number of Major Vessels (0-3) Colored by Flourosopy'] = df_ca[df_ca['Number of Major Vessels (0-3) Colored by Flourosopy'] != 4]
plt.figure(figsize = (12, 6), dpi = 100)
sns.countplot(data = df_ca, x = 'Number of Major Vessels (0-3) Colored by Flourosopy', hue = 'Chance of Heart Attack')

It appears that having 0 major vessels that are colored by flourosopy is a sign that a patient may have a higher chance of a heart attack. The more major vessels that are colored, however, the more likely a patient has a lower chance of a heart attack.

**Alright that's enough EDA for now. Let's move on to classifying the chance of a heart attack.**

Note that I'm going to be filling in the parameters with values I've already tested with GridSearchCV. It would take too long to run with Grid Search which is why I'm doing this.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, Pipeline

X_l = df.drop('output', axis = 1)
y_l = df['output'].map({0:'Less chance of Heart Attack', 1:'More chance of a Heart Attack'})



dums = ['cp', 'restecg', 'slp', 'caa', 'thall']
for i in dums:
    col = pd.get_dummies(df[i], drop_first=True)
    X_l = pd.concat([X_l, col], axis = 1)
    X_l = X_l.drop(i, axis = 1)


X_train_l, X_test_l, y_train_l, y_test_l = train_test_split(X_l, y_l, test_size=0.25, random_state=42)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

pipe = Pipeline([('sc', StandardScaler()), ('clf', LogisticRegression(C = 0.1, l1_ratio = 0.1, max_iter = 1000, penalty = 'elasticnet', solver = 'saga'))])
pipe.fit(X_train_l, y_train_l)
#param_grid = {'clf__C':[0.1, 1, 10, 100, 1000], 'clf__penalty':['l1', 'l2', 'elasticnet'], 'clf__class_weight': [None, 'balanced'], 'clf__solver':['saga'], 'clf__max_iter':[1000], 'clf__l1_ratio':[0, 0.1, 0.5, 0.9, 1] }
#grid = GridSearchCV(pipe, param_grid)
#grid.fit(X_train_l, y_train_l)
#grid.best_params_

In [None]:
from sklearn.metrics import classification_report, plot_confusion_matrix
fig, ax = plt.subplots(figsize = (10, 6), dpi = 100)
plot_confusion_matrix(pipe, X_test_l, y_test_l, ax = ax)

In [None]:
y_pred = pipe.predict(X_test_l)
print(classification_report(y_test_l, y_pred))

So Logistic Regression performs fairly well with an 88% accuracy rate. There do seem to be more false positives than false negatives but all in all Logistic Regression did fairly well on the dataset.

In [None]:
from sklearn.svm import SVC
pipe = Pipeline([('sc', StandardScaler()), ('sv', SVC())])
pipe.fit(X_train_l, y_train_l)
#param_grid = {'sv__C':[0.001, 0.1, 1.0, 10, 100], 'sv__kernel':['poly', 'rbf', 'sigmoid'], 'sv__degree':[2,3,4,5], 'sv__gamma':['scale', 'auto'], 'sv__class_weight':[None, 'balanced']}
#grid = GridSearchCV(pipe, param_grid)
#grid.fit(X_train_l, y_train_l)
#grid.best_params_

In [None]:
fig, ax = plt.subplots(figsize = (10, 6), dpi = 100)
plot_confusion_matrix(pipe, X_test_l, y_test_l, ax = ax);

In [None]:
y_pred = pipe.predict(X_test_l)
print(classification_report(y_test_l, y_pred))

SVM's have an 89% accuracy which is better than the accuracy of Logistic Regression but only by 1%. It's also much closer to being equal in the amount of false negatives and false positives.

In [None]:
X_tree = df.drop('output', axis = 1)
y_tree = df['output'].map({0:'Less chance of Heart Attack', 1:'More chance of a Heart Attack'})
X_train_t, X_test_t, y_train_t, y_test_t = train_test_split(X_tree, y_tree, test_size=0.25, random_state=42)

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(bootstrap = True, class_weight= 'balanced_subsample', max_features = 'auto', min_samples_leaf = 3, min_samples_split =  4 )
#param_grid = {'n_estimators':[100, 300, 500, 1000], 'criterion':['gini', 'entropy'], 'min_samples_split':[2,3,4], 'min_samples_leaf':[1,2,3], 'max_features':['auto', 'sqrt', 'log2'], 'bootstrap':[True, False], 'class_weight':['balanced', 'balanced_subsample', None]}
#grid = GridSearchCV(rfc, param_grid)
#grid.fit(X_train_t, y_train_t)
#grid.best_params_
rfc.fit(X_train_t, y_train_t)

In [None]:
fig, ax = plt.subplots(figsize = (10, 6), dpi = 100)
plot_confusion_matrix(rfc, X_test_t, y_test_t, ax = ax);

In [None]:
y_pred = rfc.predict(X_test_t)
print(classification_report(y_test_t, y_pred))

So Random Forest ends up with an 85% accuracy which is worse than all of the other models, this seems to be mainly due to having more false negatives than other models.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(criterion = 'friedman_mse', learning_rate = 0.01, loss = 'deviance', max_depth = 2, max_features = 'sqrt', min_samples_leaf = 1, min_samples_split = 3)
#param_grid = {'loss':['deviance', 'exponential'], 'learning_rate':[0.01, 0.1, 1, 10, 100], 'n_estimators':[100, 300, 500], 'criterion':['friedman_mse', 'mse'], 'max_depth':[2,3,4,5], 'max_features':['auto', 'sqrt', 'log2'], 'min_samples_split':[2,3,4], 'min_samples_leaf':[1,2,3]}
#grid = GridSearchCV(gbc, param_grid)
#grid.fit(X_train_t, y_train_t)
#grid.best_params_
gbc.fit(X_train_t, y_train_t)

In [None]:
fig, ax = plt.subplots(figsize = (10, 6), dpi = 100)
plot_confusion_matrix(gbc, X_test_t, y_test_t, ax = ax);

In [None]:
y_pred = gbc.predict(X_test_t)
print(classification_report(y_test_t, y_pred))

So while most of the models end up having 1 more in the false negative section, Gradient Boosting ended up having 1 more in the false positive section getting an 87% accuracy.

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
adc = AdaBoostClassifier(algorithm = 'SAMME', learning_rate = 0.1, n_estimators = 100)
#param_grid = {'n_estimators':[50, 100, 300, 500], 'learning_rate':[0.1, 1, 10, 100, 1000], 'algorithm':['SAMME', 'SAMME.R']}
#grid = GridSearchCV(adc, param_grid)
#grid.fit(X_train_t, y_train_t)
#grid.best_params_
adc.fit(X_train_t, y_train_t)

In [None]:
fig, ax = plt.subplots(figsize = (10, 6), dpi = 100)
plot_confusion_matrix(adc, X_test_t, y_test_t, ax = ax);

In [None]:
y_pred = adc.predict(X_test_t)
print(classification_report(y_test_t, y_pred))

Adaboosting seems to do as well as most of the others with an 87% accuracy and, like Gradient boosting, has 1 more false negative than false positive

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

In [None]:
from sklearn.preprocessing import MinMaxScaler
mc = MinMaxScaler()
X_n = df.drop('output', axis = 1)
y_n = df['output']

dums = ['cp', 'restecg', 'slp', 'caa', 'thall']
for i in dums:
    col = pd.get_dummies(df[i], drop_first=True)
    X_n = pd.concat([X_n, col], axis = 1)
    X_n = X_n.drop(i, axis = 1)


X_train_n, X_test_n, y_train_n, y_test_n = train_test_split(X_n, y_n, test_size=0.25, random_state=42)

X_train_n = mc.fit_transform(X_train_n)
X_test_n = mc.transform(X_test_n)

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout

In [None]:
model = Sequential()

model.add(Dense(80, activation = 'relu'))
model.add(Dropout(rate = 0.5))

model.add(Dense(40, activation = 'relu'))
model.add(Dropout(rate = 0.5))

model.add(Dense(20, activation = 'relu'))
model.add(Dropout(rate = 0.5))

model.add(Dense(1, activation = 'sigmoid'))

model.compile(loss = 'binary_crossentropy', optimizer = 'adam')

In [None]:
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor = 'val_loss', mode = 'min', patience = 25, verbose=1)
model.fit(x = X_train_n, y = y_train_n, epochs = 600, validation_data=(X_test_n, y_test_n), callbacks=[early_stop])

In [None]:
y_pred = model.predict_classes(X_test_n)
print(classification_report(y_test_n, y_pred))

Neural Networks ends up with an 91% accuracy which is the best out of all models! A lower false positive and false negative rate than most of the models as well.

So in the end, Neural Networks is the model that does best with an 91% accuracy. Thank you for your time!