# Introduction

What we will cover:

0. End to end Scikit-Learn workflow
1. Preparing the data
2. Choosing the right algorithm (estimator)
3. Fit the algorithm and use it to make predictions on our data
4. Evaluating the model
5. Improving the model
6. Saving and loading the model
7. Put all the above togethor in a practice exercise

## 0. Scikit-learn workflow

In [3]:
# Prepare the data

import pandas as pd

heart_disease_df = pd.read_csv("heart-disease.csv")
heart_disease_df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [10]:
# Create X (features matrix)

X = heart_disease_df.drop("target", axis=1)

# Create Y (label matrix)

y = heart_disease_df["target"]

In [11]:
# Choose the right model and hyperparameters

"""
we want a model that predicts whether a patient may have heart_disease or not

therefore this is a classification problem.

"""

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()

# we will keep the default hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [12]:
# Fit the model to the training data

from sklearn.model_selection import train_test_split

# split the data

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.2)

# find the patterns in the training data set

clf.fit(X_train, y_train)

RandomForestClassifier()

In [15]:
# make a prediction

y_preds = clf.predict(X_test)
y_preds

array([1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0,
       0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0], dtype=int64)

In [14]:
y_test

4      1
82     1
19     1
177    0
280    0
      ..
134    1
13     1
228    0
266    0
96     1
Name: target, Length: 61, dtype: int64

In [16]:
# evaluate the model based on the training data and test data

clf.score(X_train, y_train) # accuracy

# 1.0 is the best possible score a model can get

1.0

In [19]:
# our model got 100% bcz it used the training set to learn.
# therefore when we evaluate it using the same set, it passes with flying colors

In [17]:
clf.score(X_test, y_test)

0.8360655737704918

In [None]:
# now when we evaluate the model using data it has never seen before (test set)
# it gets 83% accuracy 

In [18]:
# more ways of evaluating the model

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [20]:
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.81      0.74      0.77        23
           1       0.85      0.89      0.87        38

    accuracy                           0.84        61
   macro avg       0.83      0.82      0.82        61
weighted avg       0.83      0.84      0.83        61



In [21]:
confusion_matrix(y_test, y_preds)

array([[17,  6],
       [ 4, 34]], dtype=int64)

In [22]:
accuracy_score(y_test, y_preds)

0.8360655737704918

In [28]:
# improve the model

# we can try change some hyperparameters e.g n_estimators
import numpy as np

np.random.seed(7)

for num in range(10, 100, 10):
    print(f"Trying to fit model with {num} estimators")
    # train
    model = RandomForestClassifier()
    model.fit(X_train, y_train)
    
    # evaluate
    accuracy = model.score(X_test, y_test) * 100
    print(f"{accuracy:.2f}%")
    print("_"*20 + "\n")

Trying to fit model with 10 estimators
83.61%
____________________

Trying to fit model with 20 estimators
85.25%
____________________

Trying to fit model with 30 estimators
83.61%
____________________

Trying to fit model with 40 estimators
78.69%
____________________

Trying to fit model with 50 estimators
81.97%
____________________

Trying to fit model with 60 estimators
80.33%
____________________

Trying to fit model with 70 estimators
80.33%
____________________

Trying to fit model with 80 estimators
81.97%
____________________

Trying to fit model with 90 estimators
85.25%
____________________



In [29]:
# save model

import pickle

pickle.dump(model, open("random_forest_model_1.pkl", mode="wb"))

In [30]:
# load model

loaded_model = pickle.load(open("random_forest_model_1.pkl", mode="rb"))

loaded_model.score(X_test, y_test)

0.8524590163934426