# Classification Model made by ~ Saket Prakash

## Predicting Heart Disease using Machine Learning 

This notebook uses Python based  machine learning and data science libraries in an attempt to create a machine learning model which is capable of predicting whether or not a person has heart disease based on their medical attributes.

The approach followed in building the model are:

1. Problem Definition
2. Data
3. Evaluation
4. Features
5. Modelling



## 1. Problem Definition

This model involves predicting whether a person has heart disease or not base on the data which is provided. This is a **Classification** problem as we can see that this falls into a category whether the result belongs to one class or not. So, the problem definition in this model is can we predict that the patient has heart disease or not. Therefore, this problem lies in the category of **Binary Classification** i.e. (heart disease = 1, no heart disease = 0)

## 2. Data

The original data is of Cleaveland data from the UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/heart+Disease 

This original data set contains contains 76 attributes(also called features) and there are many attributes which are not necessary for this model.

It also has a compressed version with 14 features(13 independent features & 1 dependent feature) on kaggle which is sufficient for creating a model. So, I've taken the data from kaggle. https://www.kaggle.com/ronitf/heart-disease-uci

## 3. Evaluation

The evaluation metric is specified at the beginning of the model. So, if the model is 90% accurate in predicting the correct result then it will be considered otherwise not. 
Therefore,  our **evaluation metric** = 90% 

The Evaluation metrics used here are 
* Confusion matrix
* ROC-AUC Curve
* classification_report
* accuracy_score
* precision_score
* f1_score
* recall_score

## 4. Features

In this section we explore the features of the data and observe what description it is providing which might be useful in predicting the target variable. This may involve some research on internet as well but the most common way to know about the different parts of data is from the **data dictionary**

# The Heart Disease Data Dictionary

1. age - age in years
2. sex - (1 = male; 0 = female)
3. cp - chest pain type
   * 0: Typical angina: chest pain related decrease blood supply to the heart
   * 1: Atypical angina: chest pain not related to heart
   * 2: Non-anginal pain: typically esophageal spasms (non heart related)
   * 3: Asymptomatic: chest pain not showing signs of disease
4. trestbps - resting blood pressure (in mm Hg on admission to the hospital)
   * anything above 130-140 is typically cause for concern
5. chol - serum cholestoral in mg/dl
   * serum = LDL + HDL + .2 * triglycerides
   * above 200 is cause for concern
6. fbs - (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
   * '>126' mg/dL signals diabetes
7. restecg - resting electrocardiographic results
   * 0: Nothing to note
   * 1: ST-T Wave abnormality
      * can range from mild symptoms to severe problems
      * signals non-normal heart beat
   * 2: Possible or definite left ventricular hypertrophy
      * Enlarged heart's main pumping chamber
8. thalach - maximum heart rate achieved
9. exang - exercise induced angina (1 = yes; 0 = no)
10. oldpeak - ST depression induced by exercise relative to rest
    * looks at stress of heart during excercise
    * unhealthy heart will stress more
11. slope - the slope of the peak exercise ST segment
    * 0: Upsloping: better heart rate with excercise (uncommon)
    * 1: Flatsloping: minimal change (typical healthy heart)
    * 2: Downslopins: signs of unhealthy heart
12. ca - number of major vessels (0-3) colored by flourosopy
    * colored vessel means the doctor can see the blood passing through
    * the more blood movement the better (no clots)
13. thal - thalium stress result
    * 1,3: normal
    * 6: fixed defect: used to be defect but ok now
    * 7: reversable defect: no proper blood movement when excercising
14. target - have disease or not (1=yes, 0=no) (= the predicted attribute)


Note : WE DON't HAVE ANY UNIQUE PATIENT ID NUMBER.

## Getting started with importing the tools required to perform the EDA(Exploratory Data Analysis) on the dataset and then moving onto the Modelling part

Tools imported:
1. Pandas - for data analysis
2. NumPy - for numerical operations
3. Matplotlib and Seaborn - for plotting graphs and visualization of data
4. Scikit Learn - for data modelling

In [None]:
# Importing the libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Importing libraries from scikit learn for data spliting and modelling and evaluation of the model

# Models

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

# Model Evaluation

from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV

from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
from sklearn.metrics import plot_roc_curve,roc_curve
from sklearn.metrics import confusion_matrix,classification_report


# Additional
import warnings
warnings.filterwarnings('ignore')

## Loading the data set

In [None]:
df= pd.read_csv('../input/heart-disease-uci/heart.csv')

In [None]:
# Shape of the dataframe
df.shape

**There are 303 rows and 14 columns**

## Exploratory Data Analysis

After inserting the dataframe, now it is time to explore the data. This includes comparing the columns of feature data with one another and comparing it with the target variable. This can be done by referring to the data dictionary which is specified in the feature section. This EDA does not have any specific methodology but there are some things that much be taken care before we start the modelling. Thus, EDA is an important step. The common workflow for EDA is 

1. Check if the data has any missing values and see how we can deal with it
2. Check if there are features which can be removed or added to improve the model

Now, lets start the EDA...


In [None]:
# Viewing the first 10 rows of the dataframe

df.head(10)

In [None]:
# Counting the number of Positive(1) and Negative(0) values in the target column of our dataframe

df['target'].value_counts()

In [None]:
# Positive and Negative target samples in percentages

df['target'].value_counts(normalize=True)

As we can see that the number of Positive(1) and Negative(0) value of samples are balanced
as the percentage of **1's** are **54.45 %** and percentage of **0's** are **45.54%**.

In [None]:
# Plotting the value counts of target 

df['target'].value_counts().plot(kind = 'bar',color=['darkblue','salmon'])
plt.xticks(rotation=0)
plt.ylabel('Count')
plt.xlabel('Target');


In [None]:
# Getting quick insights of our dataframe

df.info()

info() function gives us the insights that there are no missing values in the dataframe as it says **NON-NULL** and all the data types are int and float types i.e. all the data are **Numeric** in nature.

In [None]:
# Alternate way to know whether the datafram has missing values or not

df.isna().sum()

isna().sum() returns the number of missing values in each column. In our dataset we don't have any missing values. 

In [None]:
# Alternate way to get quick insights of the dataframe

df.describe()

describe() function returns some important insights such as the count in each column, mean, standard deviation,minimum and maximum value of each column. It also provides the information about the 25, 50 and 75 percentiles in that column. So, this is an important function.

In [None]:
# Counting the number of male and female

df['sex'].value_counts()

**male - 1
female - 0**

## Heart Disease Frequency according to sex

In [None]:
# Comparing target column with sex column(using crosstab)

pd.crosstab(df['target'],df['sex'])

By comparing the target column with sex column we get insights that how many men have heart disease **(sex = 1,target = 1)** i.e. **93** and how many men do not have heart disease **(sex =1, target = 0)**  i.e. **114**. In Case of women who have heart disease **(sex = 0, target = 1)** i.e. **72** and women who do not have heart disease **(sex = 0, target = 0)** i.e. **24**.

In [None]:
# Plotting the crosstab
pd.crosstab(df['target'],
            df['sex']).plot(kind='bar',
                            figsize=(10,6),
                            color=['darkblue','salmon']);

plt.xlabel('0 - No Heart Disease 1 - Heart Disease')
plt.ylabel('Count')
plt.title('Heart Disease Frequency for Sex')
plt.legend(['Female','Male'])
plt.xticks(rotation=0);

#  Age vs Max Heart Rate for Heart Disease

Now, combining two independent variable i.e. Age and Max Heart Rate(thalach column) and comparing it with the target vatiable of Heart Disease i.e. 0 for no Heart Disease and 1 for Heart Disease.

## Plotting age vs max heart rate using scatter plots



In [None]:
plt.figure(figsize=(10,7))

# plotting scatter plot for positive Heart Disease

plt.scatter(x=df['age'][df['target']==1],
            y=df['thalach'][df['target']==1],
            c='r',
            label='Heart Disease')

# plotting scatter plot for negative Heart Disease

plt.scatter(x=df['age'][df['target']==0],
            y=df['thalach'][df['target']==0],
            c='g',
            label='No Heart Disease',
            alpha=0.8)

#Customizing the plot

plt.xlabel('Age')
plt.ylabel('Max Heart Rate')
plt.title('Age vs Max Heart Rate for Heart Disease')
plt.legend();

In [None]:
# Visualizing the age distribution for Heart Disease with histogram

fig,ax = plt.subplots()
ax=plt.hist(df['age'],color='k',rwidth=0.97)
plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Histogram plot of Age Distribution')

## Heart Disease frequency according to Chest Pain
cp - chest pain type
*  0: Typical angina: chest pain related decrease blood supply to the heart
*  1: Atypical angina: chest pain not related to heart
*  2: Non-anginal pain: typically esophageal spasms (non heart related)
*  3: Asymptomatic: chest pain not showing signs of disease

In [None]:
# combining chest pain with target

pd.crosstab(df['cp'],df['target'])

In [None]:
# Plotting the crosstab

pd.crosstab(df['cp'],df['target']).plot(kind='bar',figsize=(10,6),color=['darkblue','salmon'])

plt.title('Frequency of Heart Disease according to Chest Pain')
plt.xlabel('Chest Pain Type')
plt.ylabel('Count')
plt.legend(['No Heart Disease','Heart Disease'])
plt.xticks(rotation=0);

## Creating the correlation matrix of the data frame

In [None]:
#Correlation matrix

corr_matrix = df.corr()

# Plotting the correlation matrix in seaborn heatmap

fig, ax = plt.subplots(figsize=(14,9))
ax = sns.heatmap(corr_matrix,annot=True,fmt='.2f',linewidth=0.5,cmap='winter')

**We've done enough EDA and now it's time for creating a baseline model and check the accuracy and other metrics so that we can decide which model to pick.**

# 5.Modelling

Three models to be used:
1. KNeighborsclassifier
2. Logistic Regression
3. RandomForestClassifier

In [None]:
# Looking into our dataset

df.head()

In [None]:
# Splitting this dataframe into Feature variable(X) and Target variable(y)

X = df.drop('target',axis=1)    #dropping the target column

y = df['target']                #selecting only the target column 

In [None]:
# Viewing the feature variables
X.head()

In [None]:
y.head()

In [None]:
# Splitting data into test and train sets

np.random.seed(42) # random seed

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

In [None]:
X_train.head()

In [None]:
y_train.head()

In [None]:
X_test.head()

In [None]:
y_test.head()

In [None]:
X_train.shape , X_test.shape , y_train.shape  ,y_test.shape

## Fitting the X train and y train data in the three models and checking the accuracy for the baseline models

* Model 1 - LogisticRegression
* Model 2 - KNeighborsClassifier
* Model 3 - RandomForestClassifier

In [None]:
# For LogisticRegression

clf_logreg = LogisticRegression(C=1,solver='liblinear')

# Fitting the X train and y train data in LogisticRegression

clf_logreg.fit(X_train,y_train);

# Accuracy score for LogisticRegression on test data

accuracy_logreg = clf_logreg.score(X_test,y_test)
accuracy_logreg

In [None]:
# For KneighborsClassifier

clf_knn = KNeighborsClassifier()

# Fitting the X train and y train data in KNeighborsClassifier

clf_knn.fit(X_train,y_train)

# Accuracy score for KNeighborsClassifier on test data

accuracy_knn = clf_knn.score(X_test,y_test)
accuracy_knn


In [None]:
# For RandomForestClassifier

clf_rf = RandomForestClassifier()

# Fitting the X train and y train data in RandomForestClassifier
clf_rf.fit(X_train,y_train);

# Accuracy score for RandomForestClassifier on test data

accuracy_rf = clf_rf.score(X_test,y_test)
accuracy_rf

In [None]:
## Creating an accuracy data frame and plotting a bar plot
models = pd.Series(['LogisticRegression','KNeighborsClassifier','RandomForestClassifier'])
accuracy = pd.Series([accuracy_logreg, accuracy_knn, accuracy_rf])
accuracy_df = pd.DataFrame({"models":models,"accuracy":accuracy})

In [None]:
accuracy_df

In [None]:
## Baseline accuracy bar plot of models

plt.bar(accuracy_df['models'],height = accuracy_df['accuracy'] );

# Model Tuning

## Hyperparameter Tuning of KNeignborsClassifier

In [None]:
## KNeighborsClassifier is tuned by changing the n_neighbors in a loop

# setting up a random seed

np.random.seed(42)

# selecting a range of neighbors

neighbors = np.arange(1,30,1)

# Setting up kn classifier

knn = KNeighborsClassifier()

# Creating lists to append test and training scores

test_scores = []
train_scores = []

## creating a loop for different neighbors

for i in neighbors:
    
    knn = KNeighborsClassifier(n_neighbors = i)
    
    # fitting the knn to train data set
    
    knn.fit(X_train,y_train)
    
    # appending train scores
    
    train_scores.append(knn.score(X_train,y_train))
    
    # appending test scores
    
    test_scores.append(knn.score(X_test,y_test))
    
    


In [None]:
train_scores

In [None]:
test_scores

In [None]:
## Plotting the testing and training scores vs n_neighbors 

plt.plot(neighbors,train_scores,c='r',label = 'Train scores')
plt.plot(neighbors,test_scores,c='b',label = 'Test scores')
plt.xlabel('Number of neigbors')
plt.ylabel('Model Scores')
plt.legend();
print(f'Maximum Test Score of KNeighborsClassifier: {max(test_scores)*100:.2f}%')

## Tuning the hyperparameters of LogisticRegression and RandomForestClassifier

We can tune these models using **1. RandomizedGridCV and 2.GridSearchCV**

Firstly we will be tuning the hyperparameters using the RandomizedSearchCV and check which model provides the better score, which will then be followed by GridSearchCV for further improvement in hyperparameters to improve the score.

### Creating the RandomizedSearch grid for LogisticRegression and RandomForestClassifier 





In [None]:
# Grid for LogisticRegression
rs_logreg_grid = { "C": np.logspace(-4,4,20),
             "solver":['liblinear']}

# Grid for RandomForestClassifier

rs_rf_grid = { "n_estimators": np.arange(100,500,50),
         "max_depth":[None,3,5,10,12,15],
         "min_samples_leaf": np.arange(1,20,2),
         "min_samples_split": np.arange(2,20,2)}

## For LogisticRegression

In [None]:
# setting up random seed

np.random.seed(42)

# Setting up random hyperparameter search using RandomizedSearch

rs_logreg = RandomizedSearchCV(LogisticRegression(),param_distributions=rs_logreg_grid,cv=5,n_iter=20,verbose=True)

# fitting training data to RandomizedSearchCV

rs_logreg.fit(X_train,y_train)

In [None]:
rs_logreg.best_params_

In [None]:
rs_logreg.score(X_test,y_test)

## For RandomForestClassifier

In [None]:
# setting up random seed

np.random.seed(42)

# Setting up hyperparameter search for RandomForestClassifier using RandomizedSearchCv

rs_rf = RandomizedSearchCV(RandomForestClassifier(),param_distributions = rs_rf_grid,cv=5,n_iter=20,verbose=True)

# Fitting the training data to RandomizedSearch model

rs_rf.fit(X_train,y_train)

In [None]:
rs_rf.best_params_

In [None]:
rs_rf.score(X_test,y_test)

## Hyperparameter Tuning with GridSearchCV

**Since our LogisticRegression Model has more accuracy, we will use GridSearchCV on LogisticRegression**

In [None]:
# Creating a grid for GridSearchCV

grid_logreg = { 'C': np.logspace(-4,4,30),
               'solver':['liblinear']}

In [None]:
# Setting up hyperparameter search with GridSearchCV for LogisticRegression

gs_logreg = GridSearchCV(LogisticRegression(),param_grid = grid_logreg,cv=5,verbose=True)

# Fitting the training data into GridSearch Cross validation

gs_logreg.fit(X_train,y_train)

In [None]:
gs_logreg.best_params_

In [None]:
gs_logreg.score(X_test,y_test)

## Evaluating our tuned Model i.e. LogisticRegression

### Making Predictions with our tuned model



In [None]:
y_preds = gs_logreg.predict(X_test)

In [None]:
y_preds

In [None]:
y_test

### Now Comparing the Predicted values(y_preds) with the Actual values(y_test) using different evaluation metrics

In [None]:
# Plotting ROC-AUC Curve

plot_roc_curve(gs_logreg,X_test,y_test);

In [None]:
## Confusion matrix

confusion_matrix(y_preds,y_test)

In [None]:
## Plotting confusion matrix using seaborn heatmap
sns.set(font_scale=1.1)
sns.heatmap(confusion_matrix(y_preds,y_test),annot=True,cmap='winter')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values');



## Metric Evaluation using Cross Validation

In [None]:
## getting the best parameters of our tuned model

gs_logreg.best_params_

In [None]:
# Setting up a classifier with the best paramaters

clf=LogisticRegression(C= 0.20433597178569418, solver= 'liblinear')

In [None]:
# Cross validated accuracy

cross_val_score(clf,X,y,cv=5,scoring='accuracy')

In [None]:
# mean cross validated accuracy

cv_accuracy = cross_val_score(clf,X,y,cv=5,scoring='accuracy').mean()
cv_accuracy

In [None]:
# Cross validated Precision

cross_val_score(clf,X,y,cv=5,scoring='precision')

In [None]:
# Mean cross validated Precision

cv_precision = cross_val_score(clf,X,y,cv=5,scoring='precision').mean()
cv_precision

In [None]:
# Cross validated F1 score
cross_val_score(clf,X,y,cv=5,scoring='f1')

In [None]:
# Mean cross validated f1 score

cv_f1 = cross_val_score(clf,X,y,cv=5,scoring='f1').mean()
cv_f1

In [None]:
# Cross validated recall score

cross_val_score(clf,X,y,cv=5,scoring='recall')

In [None]:
# Mean Cross validated Recall score

cv_recall = cross_val_score(clf,X,y,cv=5,scoring='recall').mean()
cv_recall

In [None]:
## Putting the metrics into a data frame

metrics = pd.DataFrame({"Accuracy Score": cv_accuracy,
                      "Precision Score":cv_precision,
                      "F1 Score": cv_f1,
                      "Recall Score":cv_recall},index=[0])

In [None]:
## Plotting the cross validated scores
plt.figure(figsize=(5,3))
metrics.T.plot(kind='bar',legend=False)
plt.xticks(rotation=0)
plt.title('Cross Validated Scores');

## Feature Importance 

**Feature Importance tells us which features contributes the most in the model to predict the correct values**

In [None]:
gs_logreg.best_params_



In [None]:
# Setting up the classifier i.e. LogisticRegression with best parameters

clf = LogisticRegression(C= 0.20433597178569418, solver= 'liblinear')

# Fitting the training data
clf.fit(X_train,y_train);

In [None]:
df.head()

In [None]:
# Match coef's of features to columns
feature_dict = dict(zip(df.columns, list(clf.coef_[0])))
feature_dict

In [None]:
# Visualize feature importance
feature_df = pd.DataFrame(feature_dict, index=[0])
feature_df.T.plot.bar(title="Feature Importance", legend=False);