# Introduction to Scikit-Learn (sklearn)

___

## Why Scikit-Learn?

1. Built on NumPy and Matplotlib. (and python)
2. Has many built-in machine learning models
3. Has methods to evaluate your machine learning models
4. Very well-designed API

## Refresher: Machine Learning

* Computer doesn't know anything. You need to teach it.
* Normal vs Machine Learning
|   |Developer|Result|
|:---|:---|:---|
|Normal|Input & Algorithm| Output|
|Machine Learning|Input & Output| Algorithm (Model)|

* Algorithms made by computer are hard for humans to understand
* The more complex a model is, the harder it is for humans to understand the algorithms behind it, which raises concerns
* Machine Learning basically is the idea of a computer writing it's own algorithm

## What is Covered

0. An overview of scikit-learn's workflow
1. Getting the data ready
2. Choose the right estimator for a problem
3. Fit a model and use it to make predictions on a data
4. Evaluating a model
5. improving a model
6. Save and load a trained model
7. Putting it all together

___

## 0. An overview of scikit-learn's workflow

**Note**: This workflow assumes that the data is ready to be used with machine learning models (is numerical, has no missing values).

In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Getting the data ready

In [8]:
# Import dataset
heart_disease = pd.read_csv("data/heart-disease.csv")

# View the data
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In this example, we'll make the model predict the target column

In [9]:
# Create x (The Input)
x = heart_disease.drop("target", axis=1)

# Create y (The Output)
y = heart_disease["target"]

# Split the data into training and test sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y)

# View data shapes
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((227, 13), (76, 13), (227,), (76,))

### Choosing the right estimator

In Scikit-Learn, machine learning models are referred to as **estimators**. In this example, since we're working on a classification problem, we'll use the **RandomForestClassifier** estimator.

You can decide on the right estimator using the [sklearn map](https://scikit-learn.org/stable/machine_learning_map.html)

In [10]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()

### Fitting the model and using it to make a prediction on the data

A model will **attempt to** learn the patterns in a dataset by calling the `fit()` function on it and passing it the data. Once a model has learned patterns in data, you can use them to make a prediction with the `predict()` function, which will output your model's **guessed values**.  

Think of `predict()` as asking the model what they think the answers are, then comparing their answers to the actual answers `y_test` to avaluate how accurate they are.

In [11]:
# Make the model attempt to learn the patterns between x_train and y_train
model.fit(x_train, y_train)

# Ask the model to make a prediction on the test data x_test (data it hasn't seen before)
y_preds = model.predict(x_test)

# View the model's guessed answers
y_preds

array([0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 0, 0])

#### How to make a prediction on a single record

In [12]:
# View the example record without the answer (target)
x_test.loc[146]

age          44.0
sex           0.0
cp            2.0
trestbps    118.0
chol        242.0
fbs           0.0
restecg       1.0
thalach     149.0
exang         0.0
oldpeak       0.3
slope         1.0
ca            1.0
thal          2.0
Name: 146, dtype: float64

In [13]:
# View the example record with the answer (target)
heart_disease.loc[146]

age          44.0
sex           0.0
cp            2.0
trestbps    118.0
chol        242.0
fbs           0.0
restecg       1.0
thalach     149.0
exang         0.0
oldpeak       0.3
slope         1.0
ca            1.0
thal          2.0
target        1.0
Name: 146, dtype: float64

  
As you can see, the target for this record equals to 1.0  

Now we can ask the model predict the target value for this record and see if it will answer correctly

In [14]:
# Record has to be in array format
formatted_record = np.array(x_test.loc[146]).reshape(1, -1)

# View formatted record
formatted_record

array([[ 44. ,   0. ,   2. , 118. , 242. ,   0. ,   1. , 149. ,   0. ,
          0.3,   1. ,   1. ,   2. ]])

In [15]:
# Ask the model to predict the target value
model.predict(formatted_record)



array([1])

As you can see, it predicted the answer correctly.  

The "1" you see right now, isn't the same "1" in the heart_disease record. This "1" was guessed by the model, so it could've been wrong, but the "1" in the heart_disease record is the actual answer.

### Evaluating the model

A trained model can be evaluated by calling `score()` and passing it an input it has never seen before, i.e. `x_test`, and the expected/actual output, i.e. `y_test`.  

`score()` method will do the following:
1. Take the input (first parameter)
2. Guess the output (make a prediction)
3. Compare the guessed output to the actual output (second parameter)
4. Return a percentage of accuracy, with 1 being 100% accuracy and 0 being 0% accuracy.

In [16]:
# Scoring the model on data it hasn't seen before
model.score(x_test, y_test)

0.8026315789473685

In [17]:
# Example scoring the model on data it was trained on (seen before)
model.score(x_train, y_train)

1.0

As you can see, in the second example, since it has seen the data before, it scored 100%, predicting all of the values correctly.  

However, on the first example, using data it has never seen before, it scored 81.57%, which means it predicted 81.57% of the answers correctly.

### Improving the model (using different hyperparameters)

A **hyperparameter** is a setting you choose before training that controls how the learning process goes. There are different hyperparameters. It is considered best practice to test the model with different hyperparameters.  

In this step, we'll train the model on different numbers of estimators `n_estimators`, which is a hyperparameter you can play with, and print it's accuracy each time, then choose the model with the best accuracy.

In [29]:
for i in range(10, 100, 10):
    print(f"Testing model with {i} estimators...")

    model = RandomForestClassifier(n_estimators=i) # Create the model and setting number of estimators to i
    model.fit(x_train, y_train)                    # Train the model on the training data
    score = model.score(x_test, y_test)            # Test the model on unseen data and get an accuracy percentage
    formatted_score = f"{(score*100):.2f}%"        # Format accuracy percentage to look like percentage

    print(f"Model accuracy on test set is: {formatted_score}")

Testing model with 10 estimators...
Model accuracy on test set is: 72.37%
Testing model with 20 estimators...
Model accuracy on test set is: 78.95%
Testing model with 30 estimators...
Model accuracy on test set is: 76.32%
Testing model with 40 estimators...
Model accuracy on test set is: 81.58%
Testing model with 50 estimators...
Model accuracy on test set is: 77.63%
Testing model with 60 estimators...
Model accuracy on test set is: 76.32%
Testing model with 70 estimators...
Model accuracy on test set is: 77.63%
Testing model with 80 estimators...
Model accuracy on test set is: 78.95%
Testing model with 90 estimators...
Model accuracy on test set is: 78.95%


As you can see, the model with the best score is the one trained on **40 estimators**. So, we'll be using that.

In [30]:
# Train the model on 40 estimators
model = RandomForestClassifier(n_estimators=40)
model.fit(x_train, y_train)

0,1,2
,n_estimators,40
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


### Save the model for later use

You can export the model using `pickle` module.

In [31]:
import pickle

pickle.dump(model, open("models/model_01.pkl", "wb"))

### Extra: Loading the model

In [32]:
# Load the model
loaded_model = pickle.load(open("models/model_01.pkl", "rb"))

# Testing it with our previous single record
loaded_model.predict(formatted_record)



array([1])