# 1. Predicting Heart Disease using Machine Learnign Model

This notebook looks into using various python-based machine learning and data science libraries in an attempt to build a machine learnign model capable of predicting whether or not someone has heart disease based on their medical attribute

#  2. Data is taken from UCI machine learning repository
age

sex

chest pain type (4 values)

resting blood pressure

serum cholestoral in mg/dl

fasting blood sugar > 120 mg/dl

resting electrocardiographic results (values 0,1,2)

maximum heart rate achieved

exercise induced angina

oldpeak = ST depression induced by exercise relative to rest

the slope of the peak exercise ST segment

number of major vessels (0-3) colored by flourosopy

thal: 0 = normal; 1 = fixed defect; 2 = reversable defect

The names and social security numbers of the patients were recently removed from the database, replaced with dummy values.


# 3. Evaluation

if we can reach 95% accuracy at predicting wheteher or not a patient has heart disease during the proof of concept, We"ll pursue the project.

# 4. Features 

# 5. Modelling

# Solution

In [None]:
# Import all the tools we nee

# Regular EDA and plotting libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline 
#We want our plots to appear inside the notebook

#Models from Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Model Evaluations
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.metrics import precision_score,recall_score,f1_score
from sklearn.metrics import plot_roc_curve

In [None]:
# read the data
heart_data=pd.read_csv("heart-disease.csv")

In [None]:
heart_data.head()

In [None]:
heart_data.dtypes

In [None]:
heart_data.describe()

In [None]:
heart_data.info()

In [None]:
heart_data.shape

In [None]:
type(heart_data)

In [None]:
# analysis of Target values

heart_data['target'].value_counts().plot(kind="bar", color=["salmon","lightblue"]);

## Data Exploration(Exploratory Data Analysis or EDA)

The Goal here is to find out more about the data and become a subject matter expert on the dataset u are working with.

1. What questions you are trying to solve?
2. What kind of data we have and how do we treat different data types?
3. What's missing from the data and how deal with it?
4. Where are the outliers and why should you care about them?
5. How can you add, change or remove features to get more out of your data?



In [None]:
heart_data.sex.value_counts()

In [None]:
heart_data.isna().sum()

# since there is no missing value

In [None]:
df=heart_data

In [None]:
# Heart disease Frequency acoording to sex
df.age.value_counts()

In [None]:
df.sex.value_counts()

In [None]:
# compare target column with sex column
pd.crosstab(df.target,df.sex)

In [None]:
pd.crosstab(df.target,df.sex).plot(kind="bar",color=["salmon","lightblue"])
plt.title("Heart_disease frequency according to sex")
plt.xlabel("0=No Disease, 1=Disease")
plt.ylabel("Amount")
plt.legend(["Female","Male"])

In [None]:
pd.crosstab(df.target,df.cp).plot(kind="bar")
plt.title("Heart_disease frequency according to Chest Pain")
plt.xlabel("0=No Disease, 1=Disease")
plt.ylabel("Amount")
#plt.legend(["Female","Male"])

In [None]:
pd.crosstab(df.cp,df.target).plot(kind="bar",color=["Salmon","lightblue"])
plt.title("Heart_disease frequency according to Chest Pain")
plt.xlabel("Chest Pain Type")
plt.ylabel("Amount")
plt.legend(["No Disease","Disease"]);

In [None]:
##  Age vs max_heart_rate for heart disease
# create another figure
plt.figure(figsize=(10,6))

#scatter with positive example
plt.scatter(df.age[df.target==1],df.thalach[df.target==1],c="Salmon")

#scatter with negative example
plt.scatter(df.age[df.target==0],
           df.thalach[df.target==0],c="lightblue");

#add some helpful info
plt.title("Heart_Disease in function of age and max heart rate")
plt.xlabel("AGE")
plt.ylabel("Max_Heart Rate")
plt.legend(["Disease","No Disease"])

In [None]:
# Check the distribution of the age column with a histogram
df.age.plot.hist();

In [None]:
# Make a correlation matrix 
df.corr()

In [None]:
#Let's make our correlation matrix little prettier
corr_matrix=df.corr()
fig,ax=plt.subplots(figsize=(15,10))
ax=sns.heatmap(corr_matrix,annot=True,linewidths=0.5,fmt=".2f",cmap="YlGnBu");
bottom,top=ax.get_ylim()
ax.set_ylim(bottom+0.5,top-0.5);


In [None]:
## Modelling
#split data 
X=df.drop("target",axis=1)
Y=df['target']

np.random.seed(43)
#split data into train and test set
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2)

#now we have got our train and test data 
# We are going to try 3 differen machine learning models

1. Logistic Regression Model 

2. K-Nearest Neighbours Classifier

3. Random Forest Classifier


In [None]:
# Put models in a dictionary
models={
    "Logistic Regression ":LogisticRegression(),
    "KNN":KNeighborsClassifier(),
    "Random Forest":RandomForestClassifier()
}

#create a function to fit and score models
def fit_and_score(models,X_train,X_test,Y_train,Y_test):
    #set random seed
    np.random.seed(47)
    #make a dictionary to keep model scores
    model_scores={}
    #loop through models
    for name,model in models.items():
        #fit the model to the data
        model.fit(X_train,Y_train)
        #evaluate the model and append its score to model_scores
        model_scores[name]=model.score(X_test,Y_test)
    return model_scores


In [None]:
model_scores=fit_and_score(models=models,
                          X_train=X_train,
                          Y_train=Y_train,
                           X_test=X_test,
                          Y_test=Y_test)
model_scores

# Model Comparision

In [None]:
model_compare=pd.DataFrame(model_scores,index=["accuracy"])
model_compare.T.plot.bar()

In [None]:
#let's tune the KNN

train_scores=[]
test_scores=[]

#create a list of different values for n neighbours
neighbors=range(1,21)

#set up KNN instance
knn=KNeighborsClassifier()

#loop through different n_neighbors
for i in neighbors:
    knn.set_params(n_neighbors=i)
    #Fit the algorithm
    knn.fit(X_train,Y_train)
    
    #update the training scores list
    train_scores.append(knn.score(X_train,Y_train))
    
    #update the test scores list
    test_scores.append(knn.score(X_test,Y_test))
    
    

In [None]:
train_scores

In [None]:
test_scores

In [None]:
plt.plot(neighbors,train_scores,label="Train Score")
plt.plot(neighbors,test_scores,label="Test Score")
plt.xlabel("NUmber of Neighbors")
plt.ylabel("Model Score")
plt.legend()
print(f"maximum KNN score on the test data:{max(test_scores)*100:.2f}%")

# Hyperparameter with RandomizedsearchCV


In [None]:
# Create a hyperparameter grid for Logistic Regression
log_reg_grid={
    "C":np.logspace(-4,4,20),
    "solver":["liblinear"]
}

# Create a hyperparameter grid for RandomForestClassifier
rf_grid={
    "n_estimators":np.arange(10,1000,50),
    "max_depth":[None,3,5,10],
    "min_samples_split":np.arange(2,20,2),
    "min_samples_leaf":np.arange(1,20,2)
}

In [None]:
# The Logistic Regression
np.random.seed(42)

# Setup random hyperparameter search for Logistic Regression
rs_log_reg=RandomizedSearchCV(LogisticRegression(),
                             param_distributions=log_reg_grid,
                             cv=5,
                             n_iter=20,
                             verbose=True)

#fit random hyperparameter search model for logistic regression
rs_log_reg.fit(X_train,Y_train) 

In [None]:
rs_log_reg.best_params_

In [None]:
rs_log_reg.score(X_test,Y_test)


Now we have tuned logistic regression.let's do the same for RandomForestClassifier


In [None]:
#setup random seed 
np.random.seed(42)

#setup random hyperparameter search for RandomForestClassifier
rs_rf=RandomizedSearchCV(RandomForestClassifier(),
                        param_distributions=rf_grid,
                        cv=5,
                        n_iter=20,
                        verbose=True)
# now train the model with training data

rs_rf.fit(X_train,Y_train)

In [None]:
rs_rf.score(X_test,Y_test)

In [None]:
rs_rf.best_params_

In [None]:
model_scores

# Hyperparameter tuning with gridSearchCV
Since our Logistic regression model provides the best score so far,we will try and improve them using GridSearchCV


In [None]:
# different hyperparameter for our logistic regression  model
log_reg_grid={"C":np.logspace(-4,4,30),
             "solver":["liblinear"]}
#setup grid hyperparameter search fro logistic regression 
gs_log_reg=GridSearchCV(LogisticRegression(),
                       param_grid=log_reg_grid,
                       cv=5,
                        verbose=True)
#fit grid hyperparameter search model
gs_log_reg.fit(X_train,Y_train)

In [None]:
gs_log_reg.best_params_

In [None]:
gs_log_reg.score(X_test,Y_test)