## Heart Disease Prediction

### In this tutorial we will demonstrate how to train and evaluate a simple binary classifier.

#### The Machine Learning model will try to predict how likely a given person has the presence of a heart disease.

#### To accomplish that, we will train a simple Logistic Regression model using the [Heart Disease UCI](https://www.kaggle.com/ronitf/heart-disease-uci) data from Kaggle Datasets.

#### The idea here is to have a basic understanding of this dataset, training a simple model without any data preparation and without any model tuning. This will be improved later using the Azure Machine Learning service Automated Machine Learning functionality.

#### The trained model will be used later by Azure Machine Learning service to operationalize it to be consumed as a web service.

#### <font color='red'> Before you begin: please download the dataset from Kaggle and save it into the "data" folder as "heart.csv". You will need to login into Kaggle to be able to download the dataset. </font>

#### We begin by importing the necessary packages.

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import pickle

#### We then load the dataset into a Pandas data frame, visualize the first 10 rows, and print the total number of rows and columns. We notice that this dataset has 303 rows and 14 columns. Our response variable is the column named "target".

In [2]:
df_heart = pd.read_csv("./data/heart.csv")

In [3]:
df_heart.head(10)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
5,57,1,0,140,192,0,1,148,0,0.4,1,0,1,1
6,56,0,1,140,294,0,0,153,0,1.3,1,0,2,1
7,44,1,1,120,263,0,1,173,0,0.0,2,0,3,1
8,52,1,2,172,199,1,1,162,0,0.5,2,0,3,1
9,57,1,2,150,168,0,1,174,0,1.6,2,0,2,1


In [4]:
df_heart.shape

(303, 14)

#### Here we describe all columns, to get a sense of the data distribution and possible missing values. We notice that all columns are of numeric type, there is no missing values, and the target variable is binary, having either 0 or 1 as values.

In [5]:
df_heart.describe(include = "all")

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


#### We then split the original data frame into one containing only the predictor variables (df_heart_X) and other, which is actually a Pandas series (df_heart_y) containing only the response variable.

In [6]:
df_heart_X = df_heart.drop(["target"], axis=1)
df_heart_y = df_heart["target"].values

#### Next, we randomly split this dataset into training and test sets. The training set having 80% of the data and the test set with 20%.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(df_heart_X, df_heart_y, test_size = 0.2, random_state=123)

In [8]:
X_train.head(10)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
102,63,0,1,140,195,0,1,179,0,0.0,2,2,2
261,52,1,0,112,230,0,1,160,0,0.0,2,1,2
228,59,1,3,170,288,0,0,159,0,0.2,1,0,3
288,57,1,0,110,335,0,1,143,1,3.0,1,1,3
78,52,1,1,128,205,1,1,184,0,0.0,2,0,2
124,39,0,2,94,199,0,1,179,0,0.0,2,0,2
200,44,1,0,110,197,0,0,177,0,0.0,2,1,2
197,67,1,0,125,254,1,1,163,0,0.2,1,2,3
24,40,1,3,140,199,0,1,178,1,1.4,2,0,3
174,60,1,0,130,206,0,0,132,1,2.4,1,2,3


In [9]:
y_test[0:9]

array([1, 0, 0, 0, 1, 0, 1, 1, 1], dtype=int64)

#### Next we instantiate a Logistic Regression model, train it,  score the trained model on the test dataset, and print the obtained accuracy.

In [10]:
lr = LogisticRegression()
lr.fit(X_train, y_train)

y_test_pred = lr.predict(X_test)
print("Test Accuracy {:.2f}%".format(sum(y_test_pred == y_test) / len(y_test) * 100))

Test Accuracy 77.05%


#### Finally, we save model to disk. We will use it later to deploy the trained model as a web service using the Azure Machine Learning service.

In [11]:
pickle.dump(lr, open('./model/lr_model.pickle', 'wb'))