## What to cover?


* an end-to-end scikit learn workflow
* getting data ready (to be used with machine learning models)
* choosing a machine learning model
* fitting a model to the data (learning patterns)
* making predictions with model (using patterns)
* evaluating model predictions
* improving model predictions
* saving and loading models

## an end-to-end scikit learn workflow

In [19]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [20]:
# getting data ready
import pandas as pd
heart_disease = pd.read_csv('data/heart-disease.csv')
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [21]:
# create X (features matrix)
X = heart_disease.drop('target', axis=1)

# create Y (label)
y = heart_disease['target']

In [22]:
# choose the right model and hyper parameters
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

# keep the default hyper parameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [23]:
# fit the model to the data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [24]:
# fits the training the data into model
clf.fit(X_train, y_train);

In [25]:
X_train

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
7,44,1,1,120,263,0,1,173,0,0.0,2,0,3
146,44,0,2,118,242,0,1,149,0,0.3,1,1,2
278,58,0,1,136,319,1,0,152,0,0.0,2,2,2
235,51,1,0,140,299,0,1,173,1,1.6,2,0,3
196,46,1,2,150,231,0,1,147,0,3.6,1,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
293,67,1,2,152,212,0,0,150,0,0.8,1,0,3
231,57,1,0,165,289,1,0,124,0,1.0,1,3,3
142,42,0,2,120,209,0,1,173,0,0.0,1,0,2
180,55,1,0,132,353,0,1,132,1,1.2,1,1,3


In [26]:
# make prediction
y_preds = clf.predict(X_test)
y_preds

array([1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0], dtype=int64)

In [27]:
y_test

132    1
78     1
229    0
51     1
16     1
      ..
123    1
100    1
84     1
294    0
97     1
Name: target, Length: 61, dtype: int64

In [28]:
# evaluate the model on the training data and test data
clf.score(X_train, y_train)

1.0

In [29]:
clf.score(X_test, y_test)

0.7868852459016393

In [30]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [31]:
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.72      0.81      0.76        26
           1       0.84      0.77      0.81        35

    accuracy                           0.79        61
   macro avg       0.78      0.79      0.78        61
weighted avg       0.79      0.79      0.79        61



In [32]:
confusion_matrix(y_test, y_preds)

array([[21,  5],
       [ 8, 27]], dtype=int64)

In [33]:
accuracy_score(y_test, y_preds)

0.7868852459016393

In [34]:
# Improving the model

# try different amount n_estimators
np.random.seed(22)
for i in range(10,100,10):
    print(f'Trying model with {i} estimators')
    clf = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f'Model accuracy on test set: {clf.score(X_test, y_test) * 100:.2f}%')
    print('')

Trying model with 10 estimators
Model accuracy on test set: 75.41%

Trying model with 20 estimators
Model accuracy on test set: 77.05%

Trying model with 30 estimators
Model accuracy on test set: 77.05%

Trying model with 40 estimators
Model accuracy on test set: 73.77%

Trying model with 50 estimators
Model accuracy on test set: 77.05%

Trying model with 60 estimators
Model accuracy on test set: 73.77%

Trying model with 70 estimators
Model accuracy on test set: 77.05%

Trying model with 80 estimators
Model accuracy on test set: 77.05%

Trying model with 90 estimators
Model accuracy on test set: 81.97%



In [35]:
# saving a model and load it
import pickle
pickle.dump(clf, open('random_forst_model_1.pkl', 'wb'))

In [36]:
loaded_model = pickle.load(open('random_forst_model_1.pkl', 'rb'))
loaded_model.score(X_test, y_test) #this will give you the last result i.e. 90 estimators in our case

0.819672131147541

## In Depth

## 1. Getting our data ready

Three main things we need to do
1. Split the data into features and labels (usually 'X' and 'y')
2. Filling (also called imputing) or disregarding missing values
3. Converting non numerical values to numerical values (also called feature encoding)

In [37]:
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [38]:
X = heart_disease.drop('target', axis=1)
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [39]:
y = heart_disease['target']
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [40]:
# Split the training data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [43]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

In [44]:
len(heart_disease)

303

In [45]:
242 + 61

303