# Heart Disease Model

According to CDC, heart disease is the leading cause of death in the United States. Wouldn't it be great if we tried to diagnose heart disease before it becomes
severe? My model predicts whether a patient has heart disease or not based on the patient's medical reports.

## Dataset Specifics
In the data, you are given several attributes: 

 1. age
 
 2. sex
 
 3. chest pain type (4 values)
 
 4. resting blood pressure
 
 5. serum cholesterol in mg/dl
 
 6.  fasting blood sugar > 120 mg/dl
 
 7. resting electrocardiographic results (values 0, 1, 2)
 
 8. maximum heart rate achieved
 
 9. exercise induced angina
 
 10. oldpeak = ST depression induced by exercise relative to rest 
 
 11. the slope of the peak exercise ST segment
 
 12.  number of major vessels (0-3) colored by flourosopy
 
 13.   thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

## Algorithm 
This is a classification problem (binary classification) and the results can be interpreted as 0 and 1 (0 = without heart disease, 1 = with heart disease). I used two methods: a [neural network]( https://en.wikipedia.org/wiki/Artificial_neural_network) using **Keras**, and [Logistic Regression](https://en.wikipedia.org/wiki/Logistic_regression#:~:text=Logistic%20regression%20is%20a%20statistical,a%20form%20of%20binary%20regression). My neural network involves the use of [Early Stopping](https://en.wikipedia.org/wiki/Early_stopping) and [Dropout Layers](https://keras.io/api/layers/regularization_layers/dropout/) to prevent overfitting of the data. Logistic Regression is used when dealing with categorical data (in this case, patients with and without heart disease).


| Type | Accuracy |  Precision| Recall|F1-Score|
|--|--|--|--|--|
| Logistic Regression | 85% | 0 = 88%, 1 = 82% | 0 = 80%, 1 = 89% |0 = 83%, 1 = 86%   |
| Neural Network|  87%| 0 = 90%, 1 = 83%| 0 = 80%, 1 = 91%| 0 = 84%, 1 = 87%

**[My Github](https://github.com/anyaiyer/heart-disease-predictor) for this project**





In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
heart = pd.read_csv('/kaggle/input/heart-disease-uci/heart.csv')

In [None]:
heart.head()

In [None]:
heart.info()

In [None]:
heart.describe()

## Exploratory Data Analysis

In [None]:
sns.set_theme()

Data visualization is a useful tool in comparing these features of patients to find the most correlated attributes with the presence of heart disease.
Various plot types such as heatmaps, countplots, barplots, and histplots help find common patterns between patients with and without heart disease. My code
includes a few of these plots to compare and contrast patients. 

More people have heart disease.
More females have heart disease than males; more females are included in this dataset

In [None]:
sns.countplot(x='target',data=heart,hue='sex') 

Most patients are ages 50-60.

In [None]:
plt.figure(figsize=(12,6))
heart['age'].plot(kind='hist',bins=40)

In [None]:
heart.corr()

Attribute info: 
- age
- sex
-  pain type (4 values)
- resting blood pressure
- serum cholestoral in mg/dl
- fasting blood sugar > 120 mg/dl
- resting electrocardiographic results (values 0,1,2)
- maximum heart rate achieved
- exercise induced angina
- oldpeak = ST depression induced by exercise relative to rest
- the slope of the peak exercise ST segment
- number of major vessels (0-3) colored by flourosopy
- thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

In [None]:
plt.figure(figsize=(15,8))
sns.heatmap(heart.corr(),cmap='viridis',annot=True)

Using a heatmap, we can get the most correlated features with target.

Most correlated features:
- slope (slope of peak exercise ST segment) -> 35% correlated

- thalach (max heart rate achieved) -> 42% correlated

- restecg (resting electrocadiographic results) -> 14% correlated

- cp (chest pain type) -> 43% correlated (most correlated feature with target) 

In [None]:
plt.figure(figsize=(10,6))
sns.barplot(x='cp',y='target',data=heart)

Chest pain of 1 is the most common.

In [None]:
plt.figure(figsize=(10,6))
sns.countplot(x='restecg',data=heart,hue='target')

Most people who had heart disease have a restcg of 1.

In [None]:
heart.corr()['target'][:-1].sort_values().plot(kind='bar')
plt.tight_layout

 Visual representation (bar chart) showing most correlated features with target column.

In [None]:
plt.figure(figsize=(10,6))
heart['thalach'].plot(kind='hist',bins=40)

Most people have a thalach between 140 and 170.

In [None]:
plt.figure(figsize=(10,6))
sns.countplot(x='slope',data=heart,hue='target')

Most affected people have a slope of 2

## Data PreProcessing 

In [None]:
plt.figure(figsize=(12,6))
heart.isnull().sum()

No null values

In [None]:
heart.head()

Data is already cleaned -> no need to fill in missing data or convert data to numerical data.

## Train Test Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
heart.columns

In [None]:
X = heart.drop('target',axis=1).values
y = heart['target'].values

In [None]:
print(len(heart)) # data size is small

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

[Features scaling](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#:~:text=Transform%20features%20by%20scaling%20each,e.g.%20between%20zero%20and%20one) (also known as Standardization) helps normalise the data within a specific range. This ensures
more accurate results as the model does not have to process large ranges of data. MinMaxScaler transforms the data
such that it is all within a given range.


In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()

In [None]:
X_train = scaler.fit_transform(X_train)

In [None]:
X_test = scaler.transform(X_test)

## Create Model

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout

In [None]:
model = Sequential()

model.add(Dense(40,activation='relu'))
model.add(Dropout(0.2))

model.add(Dense(20,activation='relu'))
model.add(Dropout(0.2))

# BINARY CLASSIFICATION so use sigmoid for the last layer
model.add(Dense(1,activation='sigmoid'))

model.compile(loss='binary_crossentropy',optimizer='adam')

In [None]:
from tensorflow.keras.callbacks import EarlyStopping

Use of [EarlyStopping](https://en.wikipedia.org/wiki/Early_stopping) and [Dropout layers](https://keras.io/api/layers/regularization_layers/dropout/) prevents overfitting of the data.


In [None]:
early_stop = EarlyStopping(monitor='val_loss',mode='min',verbose=1,patience=25)

In order to fit the model, we pass in X_train, y_train, the number of epochs (number of times the model will 
work through the entire dataset), validation data (testing data), batch size (number of samples to work through 
before updating the model parameters), and early stopping.

In [None]:
model.fit(x=X_train,y=y_train,epochs=200,validation_data=(X_test,y_test),batch_size=30,callbacks=[early_stop])

## Model Evaluation

In [None]:
model_loss = pd.DataFrame(model.history.history)
model_loss.plot()

Eventually, the validation loss goes below the loss. This is ideal as the loss is reaching a minimum point, and overfitting is not occuring.

In [None]:
predictions = model.predict_classes(X_test)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
print(classification_report(y_test,predictions))
print(confusion_matrix(y_test,predictions))

In [None]:
sns.countplot(x='target',data=heart) # fairly balanced 

Recall is most important because we need to detect all the true positives of heart disease. It is the most important that recall is high for all positive cases. Accuracy is ok because the data set is fairly balanced. Precision is less important than recall in this case.

In [None]:
from tensorflow.keras.models import load_model

In [None]:
neural_net_model = model.save('heart-disease-predictor.h5') # save model

In [None]:
model_loss # loss vs. val loss 

# Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logmodel = LogisticRegression()

In [None]:
logmodel.fit(X_train,y_train)

In [None]:
predictions = logmodel.predict(X_test)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
print(classification_report(y_test,predictions))
confusion_matrix(y_test,predictions)

In [None]:
acc = logmodel.score(X_test, y_test)*100

print("Test Accuracy {:.2f}%".format(acc))

In [None]:
import pickle

In [None]:
filename = "heart-disease-LR.pkl"  # save model with pickle

with open(filename, 'wb') as file:  
    pickle.dump(logmodel, file)