# **Heart Attack Analysis & Prediction**

## About this dataset

- Age : Age of the patient
- Sex : Sex of the patient
- exang: exercise induced angina (1 = yes; 0 = no)
- ca: number of major vessels (0-3)
- cp : Chest Pain type chest pain type
    - Value 1: typical angina
    - Value 2: atypical angina
    - Value 3: non-anginal pain
    - Value 4: asymptomatic
- trtbps : resting blood pressure (in mm Hg)
- chol : cholestoral in mg/dl fetched via BMI sensor

- fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
- rest_ecg : resting electrocardiographic results
    - Value 0: normal
    - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
    - thalach : maximum heart rate achieved

- target : 0= less chance of heart attack 1= more chance of heart attack


In [None]:
# importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# setting sns backgroud style
sns.set_style('darkgrid')

In [None]:
# import warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Read the data set in `heart_data`
heart_data = pd.read_csv('../input/heart-attack-analysis-prediction-dataset/heart.csv')
heart_data.head()

## Data Inspection

In [None]:
heart_data.shape

In [None]:
heart_data.isnull().sum()

In [None]:
heart_data.describe()

In [None]:
heart_data.info()

In [None]:
heart_data.nunique()

In [None]:
heart_data.thall.value_counts()

In [None]:
heart_data.slp.value_counts()

In [None]:
# output percentages of 1s and 0s
heart_data.output.value_counts(normalize=True)*100

In [None]:
k = ['sex', 'cp', 'restecg']
for i in k:
    print(heart_data[i].value_counts())

**There are 3 categorical columns, we make them as string so that it would be easy for our analysis. They are *'sex', 'cp', 'restecg'***

In [None]:
# map the categorical columns

def gender(x):
    if x==1:
        return "male"
    else:
        return "female"

def chest_pain(x):
    if x==0:
        return "typical angina"
    elif x==1:
        return "atypical angina"
    elif x==2:
        return "non-anginal pain"
    else:
        return "asymptomatic"

def resting_ecg(x):
    if x==0:
        return "normal"
    elif x==1:
        return "ST-T wave abnormality"
    else:
        return "left ventricular hypertrophy(lvh)"


In [None]:
heart_data['sex'] = heart_data.sex.apply(gender)
heart_data['cp'] = heart_data.cp.apply(chest_pain)
heart_data['restecg'] = heart_data.restecg.apply(resting_ecg)

In [None]:
heart_data.head()

## Data Visualization

In [None]:
# age distribution among outputs
plt.figure(figsize=[20,6])
plt.subplot(1,2,1)
sns.distplot(heart_data[heart_data.output == 1]['age'], color='g')
plt.subplot(1,2,2)
sns.distplot(heart_data[heart_data.output == 0]['age'], color='b')

plt.show()

In [None]:
# check 'outputs'

sns.countplot(data=heart_data, x='output')
plt.show()

In [None]:
# check outputs based on sex

# plt.figure(figsize=[10,6])
sns.countplot(data=heart_data, x='sex', hue='output')
plt.show()

In [None]:
# check Chest pain

plt.figure(figsize=[10,6])
sns.countplot(data=heart_data, x='cp', hue='output')
plt.show()

In [None]:
# check resting electrocardiographic

plt.figure(figsize=[10,6])
sns.countplot(data=heart_data, x='restecg', hue='output')
plt.show()

In [None]:
# pairplot on numerical columns

sns.pairplot(heart_data)
plt.show()

## Data Preprocessing

In [None]:
heart_data.head()

### Creating Dummies to categorical columns

In [None]:
# creating dummies to 'sex', 'cp', 'restecg'

gender = pd.get_dummies(heart_data.sex, drop_first=True)
chestPain = pd.get_dummies(heart_data.cp, drop_first=True)
rest = pd.get_dummies(heart_data.restecg, drop_first=True)

# concatinate all new dummy dateframes
heart_data = pd.concat([heart_data, gender, chestPain, rest], axis=1)
heart_data.head()

In [None]:
# drop original columns
heart_data.drop(['sex', 'cp', 'restecg'], axis=1, inplace=True)

In [None]:
heart_data.head()

In [None]:
heart_data.shape

### Scaling numerical columns

In [None]:
num_cols = ['age', 'trtbps', 'chol', 'thalachh', 'oldpeak',]

# importing scaling library
from sklearn.preprocessing import StandardScaler

# create scaler object
scaler = StandardScaler()

heart_data[num_cols] = scaler.fit_transform(heart_data[num_cols])

heart_data.head()

## Spliting to Train and Test

In [None]:
# importing train test split library
from sklearn.model_selection import train_test_split

In [None]:
heart_data.columns

In [None]:
X = heart_data.drop('output', axis=1)
y = heart_data.output

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=100)

## Building Logistic Regression Model

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
loglm = LogisticRegression()

loglm.fit(X_train, y_train)

### Model Evaluation

In [None]:
y_train_pred = loglm.predict(X_train)

In [None]:
from sklearn import metrics

In [None]:
print(metrics.confusion_matrix(y_train, y_train_pred))
print(metrics.classification_report(y_train, y_train_pred))
print(metrics.accuracy_score(y_train, y_train_pred))

### Prediction

In [None]:
y_test_pred = loglm.predict(X_test)

In [None]:
print(metrics.confusion_matrix(y_test, y_test_pred))
print(metrics.classification_report(y_test, y_test_pred))
print(metrics.accuracy_score(y_test, y_test_pred))

- **Accuracy for Logistic Regression model is 87%**

## Building K-Nearset Neighbors Model

In [None]:
# importing library
from sklearn.neighbors import KNeighborsClassifier

In [None]:
# Let's give k=1 at intital 
knn = KNeighborsClassifier(n_neighbors=1)

In [None]:
knn.fit(X_train, y_train)

### Model Prediction and Evaluation

In [None]:
y_test_pred_knn = knn.predict(X_test)

In [None]:
# creating confusion matrix 
# calculating accurary, precision and recall

print(metrics.confusion_matrix(y_test, y_test_pred_knn))
print('\n')
print(metrics.classification_report(y_test, y_test_pred_knn))
print(metrics.accuracy_score(y_test, y_test_pred_knn))

### Choosing a K Value 

Let's go ahead and use the elbow method to pick a good K Value:

In [None]:
# iterating knn model for all values of k from 1-39
error_rate = []

for i in range(1,40):
    
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))

In [None]:
# ploting the error_rate

plt.figure(figsize=[10,6])
plt.plot(range(1,40), error_rate, color='blue', linestyle='dashed',
         marker='o', markerfacecolor='red', markersize=10)
plt.title('Error rate vs K')
plt.xlabel('K')
plt.ylabel('Error Rate')

- We can see that error rate is low at k=28.

#### Let's implement KNN again at k=28

In [None]:
# building knn and fit the model
knn = KNeighborsClassifier(n_neighbors=28)
knn.fit(X_train, y_train)

# prediction
pred = knn.predict(X_test)

# model evaluation
print(metrics.confusion_matrix(y_test, pred))
print('\n')
print(metrics.classification_report(y_test, pred))
print(metrics.accuracy_score(y_test, pred))

- **Accuracy for K-Nearest Neighbors model is 86%**

- **Accuracy of the models:**
    - Logistic Regression Model: 87%
    - K-Nearest Neighbors Model: 86%

## Random Forest Classifier Model

In [None]:
# import 
from sklearn.ensemble import RandomForestClassifier

In [None]:
rfm = RandomForestClassifier(n_estimators=20, n_jobs=-1, random_state=42)
rfm.fit(X_train, y_train)

In [None]:
# prediction
pred = rfm.predict(X_test)

# model evaluation
print(metrics.confusion_matrix(y_test, pred))
print('\n')
print(metrics.classification_report(y_test, pred))
print(metrics.accuracy_score(y_test, pred))

### Using Hyper-Parameter Tuning

In [None]:
# using GridSearchCV
from sklearn.model_selection import GridSearchCV

In [None]:
X_train.shape

In [None]:
params = {
    'max_depth': [1, 5, 10, 20, 30, 50],
    'min_samples_leaf': [5, 10, 20, 50, 100],
    'max_features': [2, 5, 8, 12, 16],
    'n_estimators': [10, 30, 50, 100, 200]
}

In [None]:
rfm_basic = RandomForestClassifier(random_state=42, oob_score=True)

In [None]:
grid_search = GridSearchCV(estimator=rfm_basic, param_grid=params,
                          cv=5, n_jobs=-1, verbose=1, scoring="accuracy")

In [None]:
%%time
grid_search.fit(X_train, y_train)

In [None]:
rfm_best = grid_search.best_estimator_
rfm_best

In [None]:
# prediction
pred = rfm_best.predict(X_test)

# model evaluation
print(metrics.confusion_matrix(y_test, pred))
print('\n')
print(metrics.classification_report(y_test, pred))
print(metrics.accuracy_score(y_test, pred))

- **Accuracy for Random Forest Classifier model is 86%**

- **Accuracy of the models:**
    - Logistic Regression Model: 87%
    - K-Nearest Neighbors Model: 86%
    - Random Forest Model: 85%