# Logistic Regression Implementation to predict the possibility of Heart Disease

This is an implementation of the Logistic regression model, which is a simple classification model and a subset of Generealized Linear Models. The model attempts to fit an s shaped curve to the relationship between a variable and it's probability of belonging to a target variable(s shaped since it saturates at the ends- probaility can be a maximum of 1 and a minimum of 0). The best curve for a particular dataset is arrived at by choosing the curve with the highest "Likelihood Estimation", and this process is called Maximum Likelihood Estimation 

In [2]:
# Loading required packages
import pandas as pd
import numpy as np
import tensorflow_data_validation as tfdv #makes it very easy to visualize data stats for easy use
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
import time

## Reading the data and visualizing it 

In [3]:
##Reading a dataset of patient with heart disease to build a predictor which can predict heart disease in people

df = pd.read_csv('D:\Github_projects\Classifier_RandomForest/HeartDisease.csv')
print(df.dtypes)
df.head()

Age                 int64
Sex                object
ChestPainType      object
RestingBP           int64
Cholesterol         int64
FastingBS           int64
RestingECG         object
MaxHR               int64
ExerciseAngina     object
Oldpeak           float64
ST_Slope           object
HeartDisease        int64
dtype: object


Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


## EDA using TensorFlow Data Validation



TensorFlow provides an excellent tool for in the form on TensorFlow Data Validation to understand the dataset better. One can explore both the data schema as well as the summary statistics for a dataset(both train and test) which are presented very cleanly in a visual form with just a few lines of code


### Before starting the EDA, let's split the data in training and testing datasets

In [4]:
X = df.iloc[:, :-1]
y = df.iloc[:,-1]

X_test, X_train, y_test, y_train = train_test_split(X, y, train_size = 0.75, random_state = 100)

In [9]:
#Combining the separated samples to creats train and test df

train_df = pd.concat([X_train,y_train],axis = 1)
test_df = pd.concat([X_test,y_test],axis = 1)

In [10]:
train_stats = tfdv.generate_statistics_from_dataframe(train_df)
tfdv.visualize_statistics(train_stats)

1.We can see that there are no missing values in the above data, hence no imputation is required.

2.There are a lot of zeros in fasting BS, heart disease and Old peak. It makes sense for these fields to have 0 values in them.

3.Most of the continuous data seems to be normally distributed

In [11]:
# Checking out the dataset schema
train_schema = tfdv.infer_schema(statistics = train_stats)
tfdv.display_schema(train_schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'Age',INT,required,,-
'Sex',STRING,required,,'Sex'
'ChestPainType',STRING,required,,'ChestPainType'
'RestingBP',INT,required,,-
'Cholesterol',INT,required,,-
'FastingBS',INT,required,,-
'RestingECG',STRING,required,,'RestingECG'
'MaxHR',INT,required,,-
'ExerciseAngina',STRING,required,,'ExerciseAngina'
'Oldpeak',FLOAT,required,,-


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'Sex',"'F', 'M'"
'ChestPainType',"'ASY', 'ATA', 'NAP', 'TA'"
'RestingECG',"'LVH', 'Normal', 'ST'"
'ExerciseAngina',"'N', 'Y'"
'ST_Slope',"'Down', 'Flat', 'Up'"


## Data Transformations

We'll need do have 2 types of transformations atleast:

1.Since we're dealing with a gradient descent based algorithms, it makes sense to scale the variables since these algorithms are sensitive to range of data(This will help to solution to converge faster)

2.The categorical variables need to be encoded to a continuous values range for the module to be able to process it for logistic regression

### Scaling the continuous variables

In [12]:
list(X_train)

['Age',
 'Sex',
 'ChestPainType',
 'RestingBP',
 'Cholesterol',
 'FastingBS',
 'RestingECG',
 'MaxHR',
 'ExerciseAngina',
 'Oldpeak',
 'ST_Slope']

In [22]:
#let's use a StandardScaler to scale these values
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

#creating a list of contiuous and categorical columns
continuous_var = ['Age',
 'RestingBP',
 'Cholesterol',
 'FastingBS',
 'MaxHR',
 'Oldpeak']
category_var = ['Sex',
 'ChestPainType',
 'RestingECG',
 'ExerciseAngina',
 'ST_Slope']

continuous_df = pd.concat([X_train,X_test], axis = 0)
continuous_df = continuous_df[continuous_var]

category_df = pd.concat([X_train,X_test], axis = 0)
category_df = category_df[category_var]

#scaling the continuous variables
continuous_df = pd.DataFrame(scaler.fit_transform(continuous_df), columns = continuous_df.columns)

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak
0,0.900464,-1.210356,-1.818435,-0.551341,-0.660578,-0.363384
1,0.688318,-0.669935,-1.818435,1.813758,0.164684,1.043759
2,-0.054192,-0.129513,0.431746,1.813758,1.422226,-0.832432
3,-0.478484,0.410909,-0.107932,-0.551341,1.382928,-0.832432
4,1.749048,-0.940145,-1.818435,-0.551341,-1.760927,-0.832432
...,...,...,...,...,...,...
913,1.536902,2.572596,0.687864,1.813758,0.518368,0.668521
914,0.794391,0.951331,0.404305,1.813758,0.007491,0.105664
915,0.051881,-1.210356,0.367716,-0.551341,-0.424789,1.794236
916,-0.796702,0.951331,0.294540,-0.551341,0.400473,2.544713


### Encoding categorical features

In [23]:
#let's create dummy columns for each categorical variable in the dataset
category_df = pd.get_dummies(category_df, columns = category_df.columns)
category_df

Unnamed: 0,Sex_F,Sex_M,ChestPainType_ASY,ChestPainType_ATA,ChestPainType_NAP,ChestPainType_TA,RestingECG_LVH,RestingECG_Normal,RestingECG_ST,ExerciseAngina_N,ExerciseAngina_Y,ST_Slope_Down,ST_Slope_Flat,ST_Slope_Up
428,0,1,1,0,0,0,0,1,0,0,1,0,1,0
424,0,1,0,0,1,0,0,1,0,0,1,0,0,1
799,0,1,0,0,1,0,1,0,0,1,0,0,0,1
173,0,1,0,0,1,0,0,1,0,1,0,0,0,1
391,0,1,1,0,0,0,0,0,1,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
855,0,1,0,0,1,0,1,0,0,0,1,0,1,0
871,0,1,0,0,1,0,0,1,0,0,1,0,1,0
835,0,1,1,0,0,0,0,1,0,0,1,0,1,0
792,0,1,0,0,1,0,0,1,0,1,0,0,1,0


## Building the model