# Predicting Heart Disease
This notebook uses several Python based Machine Leanrning and data science tools to predict whether a patient has heart disesase. This is not a diagnostic tool.

The following approach will be used
1. Problem Definition (see above)
2. Data Exploration
3. Evaluation 
4. What features contribute most
5. Modelling the data
6. Validation and Improvement

Original data from: https://archive.ics.uci.edu/ml/datasets/heart+disease

Data in CSV form from: https://www.kaggle.com/ronitf/heart-disease-uci

The data has 76 total columns, however only 14 of them are used in published experiments. I will also be using this 14 column subset.

1. age
2. sex
3. chest pain type (4 values)
4. resting blood pressure
5. serum cholestoral in mg/dl
6. fasting blood sugar > 120 mg/dl
7. resting electrocardiographic results (values 0,1,2)
8. maximum heart rate achieved
9. exercise induced angina
10. oldpeak = ST depression induced by exercise relative to rest
11. the slope of the peak exercise ST segment
12. number of major vessels (0-3) colored by flourosopy
13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
14. target

# Imports

In [None]:
#EDA and plotting libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

#Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Model Evaluation
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve

## Load Data

In [None]:
df = pd.read_csv("../input/heart-disease-uci/heart.csv")
df.head()

# EDA

### EDA Checklist

1. What questions are we trying to solve?
2. What kind of data do we have?
3. What's missing?
4. What are the outliers?

In [None]:
# What do the classes we're trying to predict look like?
ax = df["target"].value_counts().plot(kind="bar", color=["blue", "orange"], title="Prevalence of Heart Disease in the data set")

df["target"].value_counts()

It looks like the two classes are (relatively) even. 0 = no heart disease, 1 = heart disease. Can we do better than roughly a coinflip?

What kind of data do we have?

In [None]:
df.info()

This implies that all the columns are numeric. This is technically true, but some columns are categorically represented by numbers. 'cp' (chest pain) and 'thal' (Thalium stress test result) are categorical columns, and we'll need to encode them as such later on. ***** Categorical Transformation not implemented as of 4/12/21 at 11:43 PM, I completely forgot and now I am very tired. Should be fixed by this time on 4/13/21****

Any missing values?

In [None]:
df.isna().sum()

No missing values, which saves us a bit of imputing later on. 

In [None]:
df.describe()

How does heart disease scale with the sex of the patients in the study? Male = 1, Female = 0

In [None]:
pd.crosstab(df.target, df.sex)
pd.crosstab(df.target, df.sex).plot(kind="bar", color=["blue", "orange"], title="Prevalence of heart disease in Males vs. Females")
plt.ylabel("Occurences of heart disease")
plt.xlabel("0 - No Disease, 1 - Disease")
plt.legend(["Female", "Male"]);

If a patient in the dataset is female, there is a (72/24+72) = 75% chance that patient has heart disease. 
For males (93/114+93) = 45% chance that a male in the dataset has heart disease.

## Age vs Max Heart Rate for Heart Disease

For the cell below, thalach is maximum heart rate achieved during the study

In [None]:
plt.figure(figsize=(10,6))

#Scatter plot with positive examples
plt.scatter(df.age[df.target==1],
           df.thalach[df.target==1], 
           color = ["blue"])
plt.title("Maximum Heart Rate Achieved vs. Age of patients (Positive results)")

#Scatter with negative examples
plt.scatter(df.age[df.target==0],
           df.thalach[df.target==0],
           color = ["Orange"])
                    
                    
                    
plt.title("Maximum Heart Rate Achieved vs. Age of patients (Negative results)")
plt.legend(["Male","Female"])
plt.xlabel("Age of patient")
plt.ylabel("Max heart rate achieved");

In this case, it's tough to say that any correlation exists. There appears to be a general downward trend for both positve cases and negative cases.

## Does chest pain correlate to heart disease?

In [None]:
pd.crosstab(df.cp, df.target)
pd.crosstab(df.target, df.cp).plot(kind="bar", color=["blue", "orange", "Red", "Green"])
plt.legend(["Typical Angina", "Atypical Angina", "Non-Anginal", "Asymptomatic"])
plt.title("Occurences of heart disesase per chest pain type");

It looks like there are a lot of patients without heart disease who suffer from Angina, while the most common form of pain for those with heart disease is Non-Anginal Pain

### Correlation Matrix

In [None]:
corr_matrix = df.corr()
plt.subplots(figsize=(15,10))
ax = sns.heatmap(corr_matrix, annot=True, fmt=".2f")

Based on the correlation matrix, there are only a few features that don't correlate somewhat (positively or negatively). Maybe we can remove chol and fbs from the training and test data later? This is something to look into, at the very least.

# Now let's actually do some modelling

In [None]:
X = df.drop("target", axis=1)
y = df["target"]

#split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)




# One Hot Encoding to handle categorical data

In [None]:
one_hot = OneHotEncoder(handle_unknown='ignore')
cat_features = ["cp", "thal"]
transformer = ColumnTransformer([("one_hot", one_hot, cat_features)], remainder="passthrough")
X_train_onehot = pd.DataFrame(transformer.fit_transform(X_train))
X_test_onehot = pd.DataFrame(transformer.fit_transform(X_test))


I'm going to scale the data to a 0-1 scale. While technically not needed for all algorithms, it doesn't hurt and will allow other algorithms to be used if we so choose. Be sure to scale train and test splits separately, to avoid leakage.

In [None]:
scaler = MinMaxScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train_onehot))
X_test_scaled = pd.DataFrame(scaler.fit_transform(X_test_onehot))

Testing on 3 different models
1. K Nearest Neighbors
2. Random Forest
3. Logistic Regression

Scaling was really done for the sake of K Nearest Neighbors, which I believe requires some sort of scaling

In [None]:
models = {"Logistic Regression": LogisticRegression(),
         "KNN": KNeighborsClassifier(),
         "Random Forest": RandomForestClassifier()}

def fit_and_score(models, X_train, X_test, y_train, Y_test):
    """
    Fit and scores given machine learning models on the given input data
    """
    
    #set random seed for reproducibility
    np.random.seed(42)
    #make dictionary to keeo scores
    model_scores = {}
    for name, model in models.items():
        model.fit(X_train, y_train)
        model_scores[name] = model.score(X_test, y_test)
    return model_scores

In [None]:
model_scores = fit_and_score(models, X_train_scaled, X_test_scaled, y_train, y_test)
model_scores

### Quick plot of these results

In [None]:
model_compare = pd.DataFrame(model_scores, index=["accuracy"])
model_compare.T.plot.bar();

Accuracy is not necessarily the best measure of a model, but it isn't a waste of time to look at it either. Initially, this shows RandomForest doing the "best" and Logistic Regression doing the "worst". Now with a baseline, I can tune the models.

# Hyperparameter tuning

In [None]:
#Tuning KNN
train_scores = []
test_scores = []
np.random.seed(42)
#Create a different list for n neighbors
neighbors = range(1,21)
knn = KNeighborsClassifier()

#loop over values for neighbors
for i in neighbors:
    knn.set_params(n_neighbors=i)
    
    knn.fit(X_train_scaled, y_train)
    
    train_scores.append(knn.score(X_train_scaled, y_train))
    
    test_scores.append(knn.score(X_test_scaled, y_test))

plt.plot(neighbors, train_scores, label = "Train Score")
plt.plot(neighbors, test_scores, label = "Test Score")
plt.legend();
print(f"The max accuracy for KNN is {max(test_scores)*100:.2f}% ")

By changing the number of neighbors, I'm able to improve the accuracy of the KNN algorithm a bit but not incredibly much. This is largely inefficient hyperparamter tuneing. We'll come back to this in a bit. Let's take a look at the other two models now.

# Using RandomizedSearchCV to tune hyperparameters

In [None]:
# create a hyperparameter grid for logisticRegression

log_reg_grid = {"C":np.logspace(-4, 4, 20),
               "solver": ["liblinear"]}

#create grid for RandomForest

rf_grid = {"n_estimators":np.arange(10, 1000, 50),
          "max_depth":[None, 3, 5, 10],
          "min_samples_split":np.arange(2,20,2),
          "min_samples_leaf": np.arange(1,20,2)}

In [None]:
#Tune Logistic Regression

np.random.seed(42)

rs_log_reg = RandomizedSearchCV(LogisticRegression(),
                               param_distributions = log_reg_grid,
                               cv=5, 
                               n_iter=20,
                               verbose=True)
#fit hyperparamter search model for Logistic Regression
rs_log_reg.fit(X_train_scaled, y_train)

In [None]:
rs_log_reg.score(X_test_scaled, y_test)

It looks like just about the same score. Trying RandomForest now

In [None]:
np.random.seed(42)

rs_rf = RandomizedSearchCV(RandomForestClassifier(), 
                          param_distributions=rf_grid, 
                          cv=5,
                          n_iter=20,
                          verbose=2)
rs_rf.fit(X_train_scaled, y_train)


In [None]:
rs_rf.best_params_
rs_rf.score(X_test_scaled, y_test)

Again, about the same score as before the tuning. This doesn't mean that tuning isn't going to help, just that it hasn't helped in this specific case