## ***** In Progress *****

# Heart Disease Prediction
<b> Build with Logistic Regression classifier aims to classify person with heart disease in the next 10 years.</b>

### Author: Shachi Kaul


### Data Source 
Data for this usecase can be found [here](https://www.kaggle.com/amanajmera1/framingham-heart-study-dataset)
    
    
### Dataset Description

- The dataset consists of 15 variables of related aspects below:
    - demographic: male, age
    - behavioral: currentSmoker, cigsPerDay
    - medical: BPMeds, prevalentStroke, prevalentHyp, diabetes, totChol, sysBP, diaBP, BMI, heartRate, glucose
- Predicted variable: TenYearCHD
    This variable refers to a person of having heart disease or not with 1 as yes and 0 as no.

The aim is to identify an individual at risk of having coronary heart disease in the next 10 years.

# NOTE: Notebooks focuses on how to leverage Dask in building ML model rather the model quality.

In [1]:
from dask import dataframe 
import joblib
import dask
from dask_ml.linear_model import LogisticRegression
#from dask_ml.model_selection import GridSearchCV, train_test_split, HyperbandSearchCV
import dask_ml.model_selection as dms
from dask.distributed import Client,progress
from dask_ml.metrics import accuracy_score

#### - Some sklearn algorithms written for execute in parallel using Joblib via njobs argument such as GridSearchCV used in this notebook. 
#### - Dask reuses this Joblib library as Joblib backend and scales out to a cluster of machines. 
#### - This notebook will showcase one of the Dask features of to parallelize grid search across cluster.

## Get started with Dask Dashboard to track with the progress and other info

In [2]:
# create local cluster
client = Client(processes=False)
client

0,1
Client  Scheduler: inproc://10.220.104.115/20832/1  Dashboard: http://10.220.104.115:8787/status,Cluster  Workers: 1  Cores: 2  Memory: 8.59 GB


In [3]:
df = dataframe.read_csv("heart_disease.csv")

In [4]:
df.head()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


#### Treat Missing Values

In [5]:
df.isna().sum().compute()

male                 0
age                  0
education          105
currentSmoker        0
cigsPerDay          29
BPMeds              53
prevalentStroke      0
prevalentHyp         0
diabetes             0
totChol             50
sysBP                0
diaBP                0
BMI                 19
heartRate            1
glucose            388
TenYearCHD           0
dtype: int64

15% of total values are empty which is acceptable to be dropped for now.

In [6]:
(df.isna().sum().sum() / len(df) *100).compute()

15.212264150943398

In [7]:
df = df.dropna()

In [8]:
df.isna().sum().sum().compute()

0

### Model Building

In [16]:
df.columns

Index(['male', 'age', 'education', 'currentSmoker', 'cigsPerDay', 'BPMeds',
       'prevalentStroke', 'prevalentHyp', 'diabetes', 'totChol', 'sysBP',
       'diaBP', 'BMI', 'heartRate', 'glucose', 'TenYearCHD'],
      dtype='object')

In [19]:
df

Unnamed: 0_level_0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
,int64,int64,float64,int64,float64,float64,int64,int64,int64,float64,float64,float64,float64,float64,float64,int64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


In [20]:
data = df[['male', 'age', 'education', 'currentSmoker', 'cigsPerDay', 'BPMeds',
       'prevalentStroke', 'prevalentHyp', 'diabetes', 'totChol', 'sysBP',
       'diaBP', 'BMI', 'heartRate', 'glucose']].values
target = df['TenYearCHD']

In [21]:
type(data)

dask.array.core.Array

In [22]:
type(target)

dask.dataframe.core.Series

x = df.iloc[:,:-1]
y = df.iloc[:,[-1]]

In [25]:
target

Dask Series Structure:
npartitions=1
    int64
      ...
Name: TenYearCHD, dtype: int64
Dask Name: getitem, 3 tasks

In [23]:
xtrain, xtest, ytrain, ytest = dms.train_test_split(data, target, test_size=0.2)

TypeError: Got mixture of dask DataFrames and Arrays. Specify 'convert_mixed_types=True'

In [14]:

lr = LogisticRegression(fit_intercept=False)
lr.fit(xtrain,ytrain)

TypeError: This estimator does not support dask dataframes. This might be resolved with one of

    1. ddf.to_dask_array(lengths=True)
    2. ddf.to_dask_array()  # may cause other issues because of unknown chunk sizes

In [15]:
param_grid = {"C": [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0], "max_iter": [50,100,150,200]}

In [16]:
import sklearn.model_selection

#### To fit the data like in usual normal way

In [17]:
grid_search = sklearn.model_selection.GridSearchCV(LogisticRegression(),param_grid=param_grid, cv=3, n_jobs=-1)

In [None]:
with ProgressBar():
    

In [18]:
%%time
grid_search.fit(xtrain,ytrain)

NotImplementedError: 'DataFrame.iloc' only supports selecting columns. It must be used like 'df.iloc[:, column_indexer]'.

#### To fit the data using cluster 
- Create a Dask client
- Use Dask backend to Joblib
- Joblib uses context manager
- Wrap sklearn code around joblib.parallel_backend('dask')

In [None]:
%%time
with joblib.parallel_backend('dask'):
    joblib.Parallel(verbose=100)
    grid_search.fit(xtrain, ytrain)

In [17]:
grid_search.best_score_

0.8602187286397812

In [20]:
grid_search.best_score_

0.8602187286397812

In [22]:
grid_search.cv_results_



{'mean_fit_time': array([0.0154167 , 0.01041643, 0.0126462 , 0.01563605, 0.03251982,
        0.03031031, 0.02413448, 0.02069314, 0.03110806, 0.03380791,
        0.0245405 , 0.06569314, 0.04067413, 0.03313351, 0.05947185,
        0.03513726, 0.03505707, 0.03489757, 0.03534532, 0.03214065,
        0.03918592, 0.04235951, 0.03984634, 0.03919593, 0.03904637,
        0.0371863 , 0.04228806, 0.03720005, 0.04461869, 0.04649337,
        0.04471914, 0.10466901]),
 'std_fit_time': array([2.98961645e-04, 7.36552728e-03, 4.23206416e-03, 1.27701218e-02,
        5.37072438e-03, 7.58742268e-03, 4.87956250e-03, 1.13181183e-05,
        8.43451996e-03, 1.56026778e-03, 3.35070732e-03, 3.26927783e-02,
        1.82574293e-03, 2.72735188e-03, 2.44200718e-02, 5.59229080e-03,
        4.53716729e-03, 2.37779792e-03, 7.02319034e-03, 8.11861947e-03,
        1.87645324e-03, 8.68121931e-03, 7.66165318e-03, 3.06999561e-03,
        4.19017344e-03, 8.70053642e-03, 8.61674038e-03, 4.46860430e-03,
        1.13807184e-0

In [31]:
ypred = grid_search.predict(xtest)

In [25]:
grid_search.score(xtest, ytest)

0.8360655737704918

In [29]:
grid_search.predict_proba(xtest)

array([[0.67138056, 0.32861944],
       [0.83166562, 0.16833438],
       [0.85282195, 0.14717805],
       ...,
       [0.88321456, 0.11678544],
       [0.82635471, 0.17364529],
       [0.92852216, 0.07147784]])

## Model Evaluation

In [32]:
accuracy_score(ytest,ypred)

0.8360655737704918

In [34]:
confusion_matrix(ytest, ypred)

array([[604,   5],
       [115,   8]], dtype=int64)

In [36]:
classification_report(y_true=ytest, y_pred=ypred)

'             precision    recall  f1-score   support\n\n          0       0.84      0.99      0.91       609\n          1       0.62      0.07      0.12       123\n\navg / total       0.80      0.84      0.78       732\n'

In [38]:
dataframe.crosstab(ytest, ypred, rownames=['Actual Species'], colnames=['Predicted Species'])

AttributeError: module 'dask.dataframe' has no attribute 'crosstab'

In [None]:
search = HyperbandSearchCV(clf, params, max_iter=81, random_state=0)

search.fit(X_train, y_train, classes=[0, 1]);