# Predicting heart disease using machine learning

This notebook looks into various Python-based machine learning and data science libraries in an attempt to build a machine learning model capable of predicting weather or not someone has heart disease based on there medical attributes.

we are going to take following approch:
1. problem defination
2. data 
3. evaluation
4. features
5. modelling
6. exparimantation

# problem
in a statement, 
8> given clinical parameters about a patient , can we predict weather he have heart disease or not

# data
> originol data came from the cleaveland data from UCI machine learning repository

# evaluation
> if we can reach 95% accuracy at predicting weather or not a patient has heart disease during a proof of concept , we'll persue the project

# features
to get info abpout each feature of ur data


**Create data dictionary**

* age: age in years
* sex: sex (1 = male; 0 = female)
* cp: chest pain type
   1: typical angina
   2: atypical angina
   3: non-anginal pain
   4: asymptomatic
* trestbps: resting blood pressure (in mm Hg on admission to the hospital) above 130 - 140 is a typical cause of concern
* chol: serum cholestoral in mg/dl
   1. serum = LDL + HDL + .2* triglycerides
   2. above 200 is cause for concern
*  fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
   1. 126 mg/dL signals diabetes
* restecg: resting electrocardiographic results
   0: normal
   1: having ST-T wave abnormality 
     * can range from mild symptoms to severe problems
     * signals non-normal heart beat
    
   2: showing probable or definite left ventricular hypertrophy 
*  thalach: maximum heart rate achieved
*  exang: exercise induced angina (1 = yes; 0 = no)
* oldpeak = ST depression induced by exercise relative to rest
* slope: the slope of the peak exercise ST segment
    1: upsloping : better heart rate exercise ST segment
    2: flat : minimal change (typical healthy heart)
    3: downsloping : signs of unhealthy heart 
* ca: number of major vessels (0-3) colored by flourosopy
   1. colored vessels means the dr. can see the blood passing through
   2. the more blood movement the better 
*  thal: thalium stress rate
   1. 3 = normal; 
   2. 6 = fixed defect; 
   3. 7 = reversable defect : no proper blood movement when exercising
*  target - have heart disease or not (1= yea, 0= no )

# Preparing the tools

we are going to use pandas, numpy, matplotlib for data analysis and manipulation

In [None]:
# import all the tools we need

# regular EDA and plotting libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# modles of sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

#  model evaluation
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV , GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve 


# load data

In [None]:
df = pd.read_csv('heart-disease.csv')
df.head()

## data exploration (exploratory data analysis or EDA)

the goal here is to find more about the data and become a subject matter export on the

1. what questions r u trying to solve
2. what kind of data do we have and how do we treat different types
3. whats missing data and do u deal with it
4. where are the outliers and why sould u care about them
5. how can uh add , change or remove to get more out of ur data

In [None]:
# lets calc. how many of each classes there are
df['target'].value_counts()

In [None]:
df['target'].value_counts().plot(kind = 'bar', color = ['salmon', 'lightblue']);

In [None]:
df.info()

In [None]:
df.isna().sum()

# Heart disease frequency according to sex

In [None]:
df.sex.value_counts()

In [None]:
# compare target column with sex column
pd.crosstab(df.target, df.sex)

In [None]:
# create a plot of crosstab
pd.crosstab(df.target, df.sex).plot(kind = 'bar', 
                                    figsize = (10, 6), 
                                    color =['salmon', 'lightblue'])
plt.title('heart disease frequency for sex')
plt.xlabel('0 = no disease , 1 = disease')
plt.ylabel('amount')
plt.legend(['feamale', 'male'])
plt.xticks(rotation =0)

# age vs. max heart rate for heart disease

In [None]:
# creating an another figure
plt.figure(figsize =(10, 6))

# scatter with positive cases
plt.scatter(df.age[df.target ==1],
            df.thalach[df.target == 1],
            c = 'salmon')

# scatter for negative cases
plt.scatter(df.age[df.target ==0],
            df.thalach[df.target == 0],
            c = 'blue')


plt.title('age vs. max heart rate for heart disease')
plt.xlabel('age')
plt.ylabel('max heart rate')
plt.legend(['disease', 'not disease'])

In [None]:
# check distribution of age with histagram
df.age.plot.hist()

# heart disease frequency per chest pain type

cp: chest pain type 
1. typical angina : cp due to decrease blood supply to the heart
2. atypical angina : cp not related to heart
3. non-anginal pain : typically esthophageal spasms
4. asymptomatic : cp not showing signs of disease

In [None]:
df.cp.value_counts()

In [None]:
pd.crosstab(df.target, df.cp)

In [None]:
pd.crosstab(df.cp, df.target).plot(kind = 'bar',
                                   figsize=(10,6),
                                   color = ['salmon', 'pink'])

plt.title("heart disease frequency per chest pain type")
plt.xlabel('chest pain')
plt.ylabel('amount')
plt.legend(['not disease', 'disease'])

In [None]:
df.head()

# make a corelation matrix

In [None]:
df.corr()

In [None]:
corr_matrix = df.corr()
fig, ax = plt.subplots(figsize=(15, 10))
ax = sns.heatmap(corr_matrix,
                 annot = True,
                 linewidths= 0.5,
                 fmt = '.2f',
                 cmap = 'YlGnBu')

here if value is +ve i.e having positive corellation : value of 1 variable increases then other variable's value also insrease
for ex cp vs target as value of cp increases for target = 1 (consider previous graph)
nd if value is -ve i.e negative corellation : value of 1 variable increases then value of other variable dicreses
for ex exang vs target as value of exand inreases i.e is 1 then target value dicreases i.e is 0 

# 5. modelling

In [None]:
df.head()

In [None]:
# split data into x and y
x = df.drop('target', axis =1)
y = df.target

x.head()

In [None]:
y.head()

In [None]:
# split data into train and test
np.random.seed(42)

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = 0.2)



now we have got our data split ito train and test data, now it's time to build a machine learning model

we'll train it (find patterns) on train set

we'll test it (using patterns) on test set

We are going to try 3 different machine learning models:
1. logistic regression
2. k-nearest neighbour classifier
3. random forest classifier

In [None]:
# put models in a dictionary
clf = { 'logistic regressor' : LogisticRegression(),
         'k-n neighbour' : KNeighborsClassifier(),
         'random forest' : RandomForestClassifier()}

# create a function to fit and score models
def fit_and_score(clf, xtrain, xtest, ytrain, ytest):
    """
    fits and evaluates machine learning models
    """
    # set random seed
    
    np.random.seed(42)
    # make dictionary to store model scores
    model_scores = {}
    # loop through models
    for name, model in clf.items():
        # fit the model to the data
        model.fit(xtrain, ytrain)
        #evaluate the model and append it's score into model_scores
        model_scores[name] = model.score(xtest, ytest)
    return model_scores


In [None]:
model_scores = fit_and_score(clf = clf,
                             xtrain = xtrain,
                             xtest = xtest, 
                             ytrain = ytrain, ytest=ytest)
model_scores

In [None]:
model_compare = pd.DataFrame(model_scores, index =['accuracy'])
model_compare.T.plot.bar();

Now we have got a baseline model.... and we know a model's first predictions aren't always what we should base our next steps off. what should we do?

now let's look at the following
* Hyperperameter tuning
* feature importance
* confusion matrix
* cross-validation
* precision
* recall
* f1 score
* classification report
* ROC curve
* area under the curve (AUC)

## hyperperameter tuning (by hands)

In [None]:
# let's tune KNN

train_scores = []
test_scores = []

# create a list of different values of n_neighbours
neighbours = range(1, 21)

# set up KNN instance
KNN = KNeighborsClassifier()

# loop through different n_neighbours
for i in neighbours:
    KNN.set_params(n_neighbors = i)

    # fit the algorithom
    KNN.fit(xtrain, ytrain)
    
    # update the training score list
    train_scores.append(KNN.score(xtrain, ytrain))
    
    # update the test score list
    test_scores.append(KNN.score(xtest, ytest))

In [None]:
train_scores

In [None]:
test_scores

In [None]:
plt.plot(neighbours, train_scores, test_scores)
plt.xlabel('no. of neighbours')
plt.ylabel('model score')
plt.xticks(np.arange(1, 21, 1))
plt.legend(['train scores', 'test scores'])

print(f'max knn score on the test data : {max(test_scores) * 100 :.2f}%')

# hyperperameter tuning with RandomizedSearchCV

we're going to tune:
* LogisticRegression
* randomforest

In [None]:
# create a hyperperameter grid for logistic regression
lr_grid = {'C' : np.logspace(-4, 4, 20),
            'solver': ['liblinear']}

# create a hyperperameter grid for random forest
rf_grid = {'n_estimators' : np.arange(10, 1000, 50),
           'max_depth': [None, 3, 5, 10],
           'min_samples_split' : np.arange(2, 20, 2),
           'min_samples_leaf' : np.arange(1, 20, 2)}

now we have got hyperperameters grid setup let's tune our models by using randomizedsearchcv

In [None]:
# Tune LogisticRegression

np.random.seed(42)

#set up random hyperperameter search for LogisticResgression
rs_lr = RandomizedSearchCV(LogisticRegression(),
                           param_distributions= lr_grid,
                           cv = 5, 
                            n_iter= 20,
                            verbose=True)

#fit random hyperperameter search for LogisticRegression
rs_lr.fit(xtrain, ytrain)


In [None]:
rs_lr.best_params_

In [None]:
rs_lr.score(xtest, ytest)

In [None]:
# tune random forest classifier
np.random.seed(42)

rs_rf = RandomizedSearchCV(RandomForestClassifier(),
                           param_distributions=rf_grid,
                           cv = 5, 
                           n_iter = 20, 
                           verbose=True)

rs_rf.fit(xtrain, ytrain)

In [None]:
rs_rf.score(xtest, ytest)

since logistic regression model works best we will try and improve it with gridSearchCV

# Hyper tuning with GridSearchCV

In [None]:
# creating grid for LogisticResgession Model
lr_grid= {'C' : np.logspace(-4, 4, 30),
          'solver' :['liblinear']}

# setup grid hyperperameter search for logistic regressor
lr_gs = GridSearchCV(LogisticRegression(),
                     param_grid= lr_grid,
                     cv = 5, 
                     verbose=True)

lr_gs.fit(xtrain, ytrain)

In [None]:
lr_gs.best_params_

In [None]:
lr_gs.score(xtest, ytest)

## Evaluating Our tuned machine learning model , beyond accuracy

* roc curve and area under the curve 
* confusion matrix 
* classification report
* precision
* recall
* f1 score

... and it would be great if cross - validation is used where possible

to make comparisions and evaluate our trained model, first we need to make predictions

In [None]:
# let's make predictions with tuned model
y_preds = lr_gs.predict(xtest)

In [None]:
y_preds

In [None]:
ytest

In [None]:
# plot rOC curve and calculate AUC metrics
plot_roc_curve(lr_gs, xtest, ytest)

In [None]:
# confusion metrics
print(confusion_matrix(y_preds, ytest))

In [None]:
sns.set(font_scale = 1.5)

def plot_conf_metrics(ytest, y_preds):
    """
    this function plots confusion matrics using seaborn's heatmap()
    """
    fig, ax = plt.subplots(figsize = (3,3))
    ax = sns.heatmap(confusion_matrix(ytest, y_preds),
                     annot = True,
                     cbar = False)
    plt.xlabel('true labels')
    plt.ylabel('predicted labels')
    
plot_conf_metrics(ytest, y_preds)
    
    

 let's get a classification report as well as cross-validated precission, recall and f1 score

In [None]:
print(classification_report(ytest, y_preds))

we're going to calc accuracy,  precision, recall, f1 using cross-validation and to do this we are going to use cross_val_score

In [None]:
# check best parameters
lr_gs.best_params_

In [None]:
# create new classifier with best params
clf = LogisticRegression(C = 0.20433597178569418, solver = 'liblinear')

In [None]:
# cross validated accuracy
cv_acc = cross_val_score(clf, x, y, cv = 5, scoring ='accuracy')
cv_acc = np.mean(cv_acc)
cv_acc

In [None]:
# cross validated precision
cv_pre = cross_val_score(clf, x, y, scoring = 'precision')
cv_pre = np.mean(cv_pre)
cv_pre

In [None]:
# cross validated recall
cv_rec = cross_val_score(clf, x, y, scoring = 'recall')
cv_rec = np.mean(cv_rec)
cv_rec

In [None]:
# cross validated f1 score
cv_f1 = cross_val_score(clf, x, y, scoring = 'f1')
cv_f1 = np.mean(cv_f1)
cv_f1

In [None]:
# visualise cross validated matrics
cv_matrics = pd.DataFrame({'accuracy' : cv_acc,
              'precision': cv_pre,
              'recall': cv_rec,
               'f1 score': cv_f1},
                         index = [0])

cv_matrics.T.plot.bar(title ='cross validated matrics',
                      legend = False);

## Feature importance
feature importance is another as asking , which features 'contibuted most to the outcome of the model and how did they congtibute?'

finding feature importance is different for each machine learning model

In [None]:
# fitting
clf.fit(xtrain, ytrain)

In [None]:
# check coef_
clf.coef_

In [None]:
# match coef's of features to columns
feature_dict = dict(zip(df.columns, list(clf.coef_[0])))
feature_dict

In [None]:
# visualize feature importance
feature_df = pd.DataFrame(feature_dict, index =[0])
feature_df.T.plot.bar(title = 'feature importance', legend = False)


In [None]:
pd.crosstab(df.sex, df.target) 

 it have a negative coprelation therefore as sex value increases the target vaue decreases

In [None]:
pd.crosstab(df.slope, df.target)

it have a positive corelation as slope value increases target value also increases

# 6. Experimation

If you haven't hit ur evaluation metrics yet... ask yourself..

* could u collect more data?
* couls u try a better model? 
* could u improve the current model(beyond what we've done so far)
* if ur model is gud enough        (you have hit ur evaluation metrics) how would you export and shear with other?



**Save model using Pickel**

In [None]:
import pickle
# save model to disk
filename = 'heart-disease-final-model.sav'
pickle.dump(clf, open(filename, 'wb'))

# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(xtest, ytest)
print(result)