## Scikit Learn

Scikit Learn is a Python library that has a lot of tools for machine learning including built-in models and wyas to evaluate your models on top of a very well designed APÌ.

Working with Scikit Learn (sklearn for short) follows a very simple and broad workflow:
1. Get your data ready (using pandas)
2. Pick a model that suits your problem
3. Fit the model to your data
4. Evaluate your model to see how it performs
5. Try some experiments to improve your model
6. Save your model so yo ucan use it later

## Getting Data Ready

In [1]:
import pandas as pd
import numpy as np
heart_disease = pd.read_csv("resources/heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [2]:
# We are going to split our dataframe into data and target
# Our data (features)
x = heart_disease.drop('target', axis=1)
# What we want to predict/model, our target value
y = heart_disease['target']

In [3]:
"""
Here we are splitting our data into a training and a test split
this is done to prevent our model from overfitting, we are going to
fit the model to our training data set and then check how well it
performs on our test dataset
"""
# Split the data into training and test sets
from sklearn.model_selection import train_test_split
# Example use case (requires X & y)
x_train, x_test, y_train, y_test = train_test_split(x, y)

## Choosing our model

In [4]:
""" Since this is a classification problem we'll begin with a Random Forest Classifier
(ie: each entry is from one class: has hearth disease or the other: no heart disease)"""
from sklearn.ensemble import RandomForestClassifier
# We instantiate a RandomForestClassifier with the default hyperparameters
rfc = RandomForestClassifier()
# Let's see what are the hyperparameters, which can be tuned to improve our machine learning model
rfc.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

## Fitting the model to our data

In [5]:
# We fit the model by passing our train data and the correct answers (ground truth)
rfc.fit(x_train, y_train)

RandomForestClassifier()

In [6]:
# Let's see how well it performs on our test data, the default metric is accuracy
rfc.score(x_test, y_test)

0.8552631578947368

In [7]:
# We can also score it on the training data
rfc.score(x_train, y_train)

1.0

As we can see the model performs perfectly on the training data, because it has already seen it, but on the test data it has an accuracy of around 79% and this is why we did the train/test split, to avoid being overconfident on our model by just looking at hwo it performs in the training data.

In [8]:
# Let's take a look at more complete performance metrics
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score

In [9]:
print(classification_report(rfc.predict(x_test), y_test))

              precision    recall  f1-score   support

           0       0.84      0.86      0.85        37
           1       0.87      0.85      0.86        39

    accuracy                           0.86        76
   macro avg       0.86      0.86      0.86        76
weighted avg       0.86      0.86      0.86        76



In [10]:
print(confusion_matrix(rfc.predict(x_test), y_test))

[[32  5]
 [ 6 33]]


In [11]:
print(accuracy_score(rfc.predict(x_test), y_test))

0.8552631578947368


## Improving the Model

In [12]:
np.random.seed(42)
for i in range(10,110,10):
    print(f"Trying model with {i} estimators")
    model = RandomForestClassifier(n_estimators=i)
    model.fit(x_train, y_train)
    print(f"n_estimators {i} -> Acc.: {accuracy_score(model.predict(x_test), y_test):.4f}")

Trying model with 10 estimators
n_estimators 10 -> Acc.: 0.8026
Trying model with 20 estimators
n_estimators 20 -> Acc.: 0.8421
Trying model with 30 estimators
n_estimators 30 -> Acc.: 0.8289
Trying model with 40 estimators
n_estimators 40 -> Acc.: 0.8421
Trying model with 50 estimators
n_estimators 50 -> Acc.: 0.7895
Trying model with 60 estimators
n_estimators 60 -> Acc.: 0.8553
Trying model with 70 estimators
n_estimators 70 -> Acc.: 0.8289
Trying model with 80 estimators
n_estimators 80 -> Acc.: 0.8289
Trying model with 90 estimators
n_estimators 90 -> Acc.: 0.8553
Trying model with 100 estimators
n_estimators 100 -> Acc.: 0.8289


## Saving our model

In [13]:
# We will use pickle to save our model as a binary file
import pickle

In [14]:
pickle.dump(rfc, open("heart-disease-rfc-model.pkl", "wb"))

In [15]:
# Loading our model back so we can use it
loaded_model = pickle.load(open("heart-disease-rfc-model.pkl", "rb"))
loaded_model.score(x_test, y_test)

0.8552631578947368