# Predicting Heart Disease Using Scikit-Learn

This notebook uses a logistic regression scikit-learn pipeline to predict heart disease in a patient given a set of clinical measurements.

## Problem Definition

In a statement,
> Given clinical parameters about a patient, can we predict whether or not they have heart disease?

## Data

The original data came from the Cleavland data (processed.cleveland.data) from the UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/Heart+Disease

There is also a version of it available on Kaggle, but there are issues with the `thal` field not matching the original. https://www.kaggle.com/ronitf/heart-disease-uci

## Evaluation

> If we can reach 95% accuracy at predicting whether or not a patient has heart disease during the proof of concept, we'll deploy the pipeline.

## Features
  
1. age - age in years
2. sex - (1 = male; 0 = female)
3. cp - chest pain type
    * 1 - typical angina
    * 2 - atypical angina
    * 3 - non-anginal pain
    * 4 - asymptomatic
4. trestbps - resting blood pressure (in mm Hg on admission to the hospital)
5. chol - serum cholestoral in mg/dl
6. fbs - (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
7. restecg - resting electrocardiographic results
    * 0 - normal
    * 1 - having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) 
    * 2 - showing probable or definite left ventricular hypertrophy by Estes' criteria
8. thalach - maximum heart rate achieved
9. exang - exercise induced angina (1 = yes; 0 = no)
10. oldpeak - ST depression induced by exercise relative to rest
11. slope - the slope of the peak exercise ST segment
    * 1 - upsloping
    * 2 - flat
    * 3 - downsloping
12. ca - number of major vessels (0-3) colored by flourosopy
13. thal
    * 3 - normal
    * 6 - fixed defect
    * 7 - reversable defect
14. target - 0,1,2,3,4 (where > 0 indicates heart disease)


## Preparing the tools

We're going to use standard tools in the Data Scientist's toolbox.

In [89]:
# Import all the tools we need
# Regular EDA and plotting libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use("seaborn")
import seaborn as sns

# Data Preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Binarizer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Model from Scikit-Learn
from sklearn.linear_model import LogisticRegression

# Model Evaluations
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix

The Google Cloud will require scikit-learn version 0.22.1. Let's double check that we have the right version.

In [2]:
import sklearn
print(f"sklearn version: {sklearn.__version__}")

sklearn version: 0.22.1


## Load Data

We've already downloaded the data set and saved it in the data folder. If you would like to download directly, the data set can be found [here](https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data)

In [215]:
col_names = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg",
             "thalach", "exang", "oldpeak", "slope","ca", "thal", "target"]

df = pd.read_csv("../data/processed.cleveland.data", names = col_names, na_values = "?")

## Data Transformations Required
There are a few data transformations that need to take place in order to
Based on the data type descriptions on the UCI Machine Learning Repository, we will convert some of the fields to be categorical. Namely,  
* Chest Pain Type (cp)
* Resting Electrocardiographic Results (restecg)
* thal

We'll create a dictionary and replace each of the numeric values with its categorical counterpart. We also need to re-code our target so that it is binary. 

In [216]:
def num_to_label_conversion(df):
    
    cp_dict = {1: "typical angina",
               2: "atypical angina",
               3: "non-anginal pain", 
               4: "asymptomatic"}

    restecg_dict = {0: "normal", 
                    1: "wave abnormality", 
                    2: "ventricular hypertrophy"}

    thal_dict = {3 : "normal",
                 6 : "fixed defect",
                 7 : "reversable defect"}
    
    df["cp"].replace(cp_dict, inplace = True)
    df["restecg"].replace(restecg_dict, inplace = True)
    df["thal"].replace(thal_dict, inplace = True)
    
    return df

In [217]:
# Split data into X and y
X = df.drop("target", axis = 1)
y = (df["target"] > 0).astype(int)

In [232]:
# Split into train & test
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.2,
                                                    random_state = 123)

## Hyperparameter Tuning with GridSearchCV

Next, we'll try to improve the baseline `LogisticRegression` model using GridSearchCV. We'll also use the pipeline operator to string together a simple pre-processing step and our model.

In [242]:
# Let's split up our features into three different groups that will undergo separate transformations:
cat_vars = ["cp","restecg","thal"]
num_vars = ["age", "trestbps", "chol", "thalach", "oldpeak", "slope","ca"]
bin_vars = ["sex", "fbs", "exang"]


cat_transformer = Pipeline(steps = [("cat_encoding", FunctionTransformer(num_to_label_conversion)),
                                    ("impute", SimpleImputer(strategy = "most_frequent")),
                                    ("ohe", OneHotEncoder())])
                                   

num_transformer = Pipeline(steps = [("impute", SimpleImputer(strategy = "most_frequent")),
                                    ("scaler", StandardScaler())])

bin_transformer = Pipeline(steps = [("impute", SimpleImputer(strategy = "most_frequent"))])

target_transformer = Pipeline(steps = [("binarize", FunctionTransformer(binarize_y, pass_y = True))])

preprocessor = ColumnTransformer(transformers = [('cat', cat_transformer, cat_vars),
                                                  ('num', num_transformer, num_vars),
                                                  ('bin', bin_transformer, bin_vars)],
                                  remainder = "drop")

log_model_pipeline = Pipeline(steps = [
    ("preprocessing", preprocessor),
    ("model", LogisticRegression())])

# Different hyperparameters for the LogisticRegression model
log_param_grid = {"model__C": np.logspace(-4, 4, 30)}

# Fit grid hyperparameter search model
gs_log_model = GridSearchCV(log_model_pipeline, log_param_grid, cv = 5)
gs_log_model.fit(X_train, y_train)

TypeError: __init__() got an unexpected keyword argument 'pass_y'

In [241]:
# Check best cross-validated score
gs_log_model.best_score_

0.8511904761904763

In [235]:
# Check the best hyperparameters
gs_log_model.best_params_

{'model__C': 0.20433597178569418}

In [236]:
# Evaluate the grid search LogisticRegression model
gs_log_model.score(X_test, y_test)

0.819672131147541

In [238]:
y_preds = gs_log_model.predict(X_test)
confusion_matrix(y_test, y_preds)

array([[29,  4],
       [ 7, 21]])

## Export Model

In [12]:
import pickle

In [None]:
with open('gs_log_model_v1.pkl', 'wb') as model_file:
    pickle.dump(gs_log_model, model_file)