## Heart Data Prediction

## Data Introduction

Data from https://www.kaggle.com/rashikrahmanpritom/heart-attack-analysis-prediction-dataset?select=heart.csv

### About this dataset
age - Age of the patient

sex - Sex of the patient

cp - Chest pain type ~ 0 = Typical Angina, 1 = Atypical Angina, 2 = Non-anginal Pain, 3 = Asymptomatic

trtbps - Resting blood pressure (in mm Hg)

chol - Cholestoral in mg/dl fetched via BMI sensor

fbs - (fasting blood sugar > 120 mg/dl) ~ 1 = True, 0 = False

restecg - Resting electrocardiographic results ~ 0 = Normal, 1 = ST-T wave normality, 2 = Left ventricular hypertrophy

thalachh - Maximum heart rate achieved

oldpeak - Previous peak

slp - Slope

caa - Number of major vessels

thall - Thalium Stress Test result ~ (0,3)

exng - Exercise induced angina ~ 1 = Yes, 0 = No

output - Target variable


### Task
Pedict if a person is prone to heart attack or not.

## Preparation

### Import packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

### Understanding data

Import data

In [None]:
df=pd.read_csv('../input/heart-attack-analysis-prediction-dataset/heart.csv')
print('Data shape - ', df.shape)

Create lists of colums - categorical and 

In [None]:
categorical_columns=['sex', 'cp','fbs', 'restecg', 'caa', 'thall', 'exng', 'slp']
continuous_columns=['age', 'trtbps', 'chol', 'thalachh', 'oldpeak']

Check description of non-categorical columns

In [None]:
df[continuous_columns].describe().transpose()

Check if there are any null-values

In [None]:
df.isnull().sum()

## Exploratory Data Analysis

In [None]:
sns.pairplot(df)

In [None]:
# check if there are similar output for classes

sns.countplot(df['output'],  palette='coolwarm')

In [None]:
cat=df.columns.values[1:-1]
print(cat)

fig, axes=plt.subplots(4,3,figsize=(16,16))
for i in range(4):
    for j in range(3):
        sns.distplot(df[cat[i*3+j]], ax=axes[i,j], kde=False)
        


In [None]:
for col in categorical_columns:
    plt.figure(figsize=(8,4))
    sns.countplot(x=col, data=df, hue='output', palette='coolwarm')
    plt.show()

In [None]:
# age
plt.figure(figsize=(12,4))
sns.countplot(df['age'], palette='coolwarm')

In [None]:
sns.distplot(df['age'], bins=10)

In [None]:
for con in continuous_columns:
    plt.figure(figsize=(8,4))
    sns.boxenplot(y=con, data=df, x='output', palette='coolwarm')
    plt.show()

In [None]:
plt.figure(figsize=(12,12))

ax = sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)

In [None]:
df.corr()['output'].sort_values()[:-1]

In [None]:
df.corr()['output'].sort_values()[:-1].plot(kind='bar')

### Conclusions

There are data from 303 people

There are no missing values

The size of target groups is similar

There are much more people with "sex"=1 than 'sex'=0

Most people feel 'typical angina' as a chest pain'

There is little correlation with 'output' (>-0,1 and <0,1) for cholesterol and fasting blood sugar

The most correlated parameters with the output are: thalachh (0,42), cp (0,43),  exng (-0,44), oldpeak (-0,43)

In that dataset, most people are between 50-60 years old

According to the age distribution, based on that data we can't say that the heart attack is more possible for older people, as intuition shows

People with a Maximum heart rate achieved has a higher probability to have a heart attack

People with lower old peak has a higher chance to have a heart attack

## Data Preprocessing

In [None]:
df.head()

Create a list of categories that have to be encoded. If a category has more than 2 unique values add them to the list 

In [None]:
cat_for_dummies=df[categorical_columns].nunique()[df[categorical_columns].nunique()>2]
cat_for_dummies

Creating dummies variable for columns in cat_for_dummies: cp, restecg, caa, thall, slp

In [None]:
def encoding_data(df, cat_for_dummies):
    data=df.copy()

    for cat in cat_for_dummies.index:

        dummies=pd.get_dummies(data[cat], drop_first=True)
        col=[]
        for i in dummies.columns:
            col.append(cat+"_"+str(i))
        dummies.columns=col

        data=pd.concat([data.drop([cat], axis=1), dummies], axis=1)
    
    return data


data=encoding_data(df, cat_for_dummies)
data

### Scaling data and split into train and test

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [None]:
X=data.drop('output', axis=1).values
y=data['output'].values
    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=101)

scaler=MinMaxScaler()
X_train_scaled=scaler.fit_transform(X_train)
X_test_scaled=scaler.transform(X_test)

## Modeling

### Sequential Model

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.callbacks import EarlyStopping

In [None]:
model=Sequential()

model.add(Dense(22, activation="relu"))
model.add(Dense(11, activation="relu"))
model.add(Dense(5, activation="relu"))
model.add(Dense(1, activation="relu"))

early_stop=EarlyStopping(monitor='val_loss', patience=1)

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])

model.fit(X_train_scaled, y_train, validation_data=(X_test_scaled, y_test), callbacks=[early_stop], epochs=250, verbose=0)

metrics=pd.DataFrame(model.history.history)
metrics[['loss', 'val_loss']].plot()
plt.show()

score=model.evaluate(X_test_scaled, y_test, verbose=0)
print("Sequential model score: \n\tloss: ", score[0], "\n\taccuracy: ", score[1], "\n")

predictions=model.predict_classes(X_test_scaled)

print(classification_report(y_test, predictions))

### Logistic Regression Model

In [None]:
from sklearn.linear_model import LogisticRegression

logmodel=LogisticRegression(max_iter=1000)
logmodel.fit(X_train_scaled, y_train)

predictions=logmodel.predict(X_test_scaled)

print(classification_report(y_test, predictions))

### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

treemodel=DecisionTreeClassifier()
treemodel.fit(X_train_scaled, y_train)

predictions=treemodel.predict(X_test_scaled)

print(classification_report(y_test, predictions))

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

forestmodel=RandomForestClassifier()
forestmodel.fit(X_train_scaled, y_train)

predictions=forestmodel.predict(X_test_scaled)

print(classification_report(y_test, predictions))

### Support Vector Machine

In [None]:
from sklearn.svm import SVC

svmmodel=SVC()
svmmodel.fit(X_train_scaled, y_train)

predictions=svmmodel.predict(X_test_scaled)

print(classification_report(y_test, predictions))

## Conclusions

There was analyzed: Seqential Model, Logistic Regression Model, Decision Tree Model, Random Forest Model and Support Vector Machine Model. 
The best accuracy got Logistic Regression Model, which was 87%

## Train model for all data, then save model and scaler

In [None]:
scaler=MinMaxScaler()
X_scaled=scaler.fit_transform(X)

logmodel=LogisticRegression(max_iter=1000)
logmodel.fit(X_scaled, y)


import joblib
joblib.dump(scaler, 'heart_attack_scaler.pkl')
joblib.dump(logmodel, 'heart_attack_model.h5')