# Heart Disease Prediction

### Goal, Limits and Dataset Content

**Goal**

The goal of this notebook is to analyze the heart disease data obtained from [UCI](https://archive.ics.uci.edu/ml/datasets/Heart+Disease), and show which features have the most affect in the occurrence of heart disease.

**Limits**

This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date.

**Content**

- age
- sex
- chest pain type (4 values)
- resting blood pressure
- serum cholestoral in mg/dl
- fasting blood sugar > 120 mg/dl
- resting electrocardiographic results (values 0,1,2)
- maximum heart rate achieved
- exercise induced angina
- oldpeak = ST depression induced by exercise relative to rest
- the slope of the peak exercise ST segment
- number of major vessels (0-3) colored by flourosopy
- thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

## Importing Packages

We need some packages to read the data which is given as csv file, to visualize and to operate.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

## Overview

In [None]:
heart = pd.read_csv("../input/heart-disease-uci/heart.csv")
data = heart.copy()
data.head()

In [None]:
data.info()

In [None]:
data.describe().T.style.background_gradient(subset = ['count'], cmap = 'viridis') \
    .bar(subset = ['mean', '50%']) \
    .bar(subset = ['std'])

Now I'm creating subsets to make easier to visualize.

In [None]:
for col in data.columns:
    print("------------------------------------")
    print("{}\n{}".format(col,data[col].value_counts()))
    print("Unique value counts: ",len(data[col].unique()))

In [None]:
fig = plt.figure(figsize=(12,12))
i = 1
for col in data.columns:
    if len(data[col].unique()) <= 5:
        plt.subplot(3,3,i)
        data[col].value_counts().plot.bar()
        plt.title(col)
        i = i+1
plt.show()

### Features and target

In [None]:
cols = []

for col in data.columns:
    if len(data[col].unique()) >= 5:
        cols.append(col)
        
        
fig = plt.figure(figsize=(18,12))
i = 1
for col in cols:
        plt.subplot(2,3,i)
        sns.histplot(data=data, x=col, hue="target", kde=True)
        i = i+1
plt.show()

### Features, features, target

Using seaborn's pairplot, not only being able to plot feature/target relations, it's possible to plot all feature relations with each others at once. It's huge but useful. I will use only columns having more than 4 unique values that I created as a list before.

In [None]:
sns.pairplot(pd.concat([data[cols], data["target"]], axis=1), hue="target")
plt.show()

## Heart Disease Analysis by Gender

I suggest [this notebook](https://www.kaggle.com/asimislam/tutorial-python-subplots) if you need help with subplots. The code cell below had gotten from it. Luckily, it's the same dataset.

In [None]:
heart_NUM = ['age', 'trestbps', 'thalach', 'oldpeak']

#  plot Numerical Data
a = 4  # number of rows
b = 3  # number of columns
c = 1  # initialize plot counter

fig = plt.figure(figsize=(14,22))

for i in heart_NUM:
    plt.subplot(a,b,c)
    plt.xlabel(i)
    sns.distplot(data[i])
    c = c+1

    plt.subplot(a,b,c)
    plt.xlabel(i)
    plt.boxplot(x=data[i])
    c = c+1

    plt.subplot(a,b,c)
    plt.xlabel(i)
    sns.scatterplot(data=data, x=i, y='chol', hue='sex')
    c = c+1

plt.show()

In [None]:
f = data[data["sex"] == 0] #female
m = data[data["sex"] == 1] #male

f_p = f[f["target"] == 1] #female with heart disease
f_np = f[f["target"] == 0] 

m_p = m[m["target"] == 1] #male with heart disease
m_np = m[m["target"] == 0]

In [None]:
fig = plt.figure(figsize=(12,6))
plt.subplot(121)
plt.pie(x=[len(f),len(m)], labels=["Female","Male"], colors=['#009ACD', '#ADD8E6'], autopct='%1.1f%%', startangle=0, pctdistance=1.1,labeldistance=1.25, explode=(0.03,0))
plt.title("Gender distribution of whole dataset")
plt.legend(frameon=False, bbox_to_anchor=(1,0.8))

plt.subplot(122)
plt.pie(x=[len(f_p),len(m_p)], labels=["Female","Male"], colors=['#009ACD', '#ADD8E6'], autopct='%1.1f%%', startangle=0, pctdistance=1.1,labeldistance=1.25, explode=(0.03,0))
plt.title("Gender distribution of patients")
plt.legend(frameon=False, bbox_to_anchor=(1,0.8))
plt.show()

56,4% of people having heart disease is male, but keep in mind that the dataset has more male entries than it has for females. So, how many patients are there in male and female observations?

In [None]:
fig = plt.figure(figsize=(12,6))
plt.subplot(121)
plt.pie(x=[len(f_p),len(f_np)], labels=["Having Heart Disease","Not having"], colors=['#b566ff', '#e6ccff'], autopct='%1.1f%%', startangle=0, pctdistance=1.1,labeldistance=1.25, explode=(0.05,0))
plt.title("Female")

plt.subplot(122)
plt.pie(x=[len(m_p),len(m_np)], labels=["Having Heart Disease", "Not having"], colors=['#80ff80', '#ccffcc'], autopct='%1.1f%%', startangle=0, pctdistance=1.1,labeldistance=1.25, explode=(0.04,0))
plt.title("Male")

plt.show()

## Correlation Matrix

In [None]:
corr = data.corr()
plt.figure(figsize=(15,15))
sns.heatmap(corr, annot=True, linewidths=.5, cmap="YlGnBu")
plt.show()

# Model

I want to use 3 classification methods and compare their scores. The models:
- SVC
- Random Forest Classifier
- Gradient Boosting Classifier

Since, labels are imbalance in dataset StratifiedKFold will be used to get better predictions and reduce overfit/underfit risks.

Also, feature values have different ranges than each other I will scale data.

To compare model results accuracy_score and confusion_matrix will help us.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC

from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix

In [None]:
X = data.drop("target", axis=1).values
y = data["target"].values

scaler = StandardScaler()
X = scaler.fit_transform(X)

models = [SVC(), RandomForestClassifier(), GradientBoostingClassifier()]

In [None]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=17)

for train_index, test_index in skf.split(X, y):
    X_train_fold, X_test_fold = X[train_index], X[test_index]
    y_train_fold, y_test_fold = y[train_index], y[test_index]
    acc = []
    cm = []
    for model in models:        
        model.fit(X_train_fold, y_train_fold)
        pred =  model.predict(X_test_fold)
        acc.append(accuracy_score(y_test_fold, pred))
        cm.append(confusion_matrix(y_test_fold, pred))

In [None]:
score = {"model": ["SVC","RandomForestClassifier","GradientBoostingClassifier"], "accuracy ": acc}
result = pd.DataFrame(score)
result

## Confusion Matrices

![](https://miro.medium.com/max/445/1*Z54JgbS4DUwWSknhDCvNTQ.png)

In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix,[9] is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each row of the matrix represents the instances in an actual class while each column represents the instances in a predicted class, or vice versa ‚Äì both variants are found in the literature.[10] The name stems from the fact that it makes it easy to see whether the system is confusing two classes (i.e. commonly mislabeling one as another).

It is a special kind of contingency table, with two dimensions ("actual" and "predicted"), and identical sets of "classes" in both dimensions (each combination of dimension and class is a variable in the contingency table).

1. [Image](https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62)
2. [Definiton](https://en.wikipedia.org/wiki/Confusion_matrix)

In [None]:
for i,model in enumerate(["SVC","RandomForestClassifier","GradientBoostingClassifier"]):
    sns.heatmap(cm[i], annot=True)
    plt.title(model)
    plt.show()

If you find this notebook useful, don't forget to upvote. üëç
If you have suggestions, I'm waiting to read them. ü§ì