## What is Scikit-Learn (sklearn)?
[Scikit-Learn](https://scikit-learn.org/stable/index.html), also referred to as `sklearn`, is an open-source Python machine learning library.
It's built on top on NumPy (Python library for numerical computing) and Matplotlib (Python library for data visualization).

## Why Scikit-Learn?
Scikit-learn is machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines
Although the field of machine learning is vast, the main goal is finding patterns within data and then using those patterns to make predictions.And there are certain categories which a majority of problems fall into.
If you're trying to create a machine learning model to predict whether an email is spam and or not spam, you're working on a classification problem (whether something is something(s) or another).
If you're trying to create a machine learning model to predict the price of houses given their characteristics, you're working on a regression problem (predicting a number).
Once you know what kind of problem you're working on, there are also similar steps you'll take for each. Steps like splitting the data into different sets, one for your machine learning algorithms to learn on and another to test them on.
Choosing a machine learning model and then evaluating whether or not your model has learned anything.
Scikit-Learn offers Python implementations for doing all of these kinds of tasks. Saving you having to build them from scratch.

# Inroduction to Scikit-Learn(sklearn)

This notebook shows a breif workflow you might use with `scikit-learn` to build a machine learning model to classify whether or not a patient has heart disease

We will Cover:

0. An end-to-end Scikit-Learn Workflow
1. Getting the Data Ready
2. Choose the Right Estimator/Algorithm for our problems
3. Fit the Model/Algorithm and use it to make predictions on our data
4. Evaluating a model
5. Improve a model
6. Save and Load the Trained Model
7. Putting it all Together (PipeLine)

### 0. An end-to-end Scikit-Learn workflow
Before we get in-depth, let's quickly check out what an end-to-end Scikit-Learn workflow might look like.
Once we've seen an end-to-end workflow, we'll dive into each step a little deeper.<br>**Note:** Since Scikit-Learn is such a vast library, capable of tackling many problems, the workflow we're using is only one example of how you can use it.

In [1]:
# Standard imports
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

### 1. Getting the Data Ready

In [2]:
heart_disease = pd.read_csv('Dataset/heart-disease.csv')
heart_disease.head(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [3]:
# Create X (all the feature columns)
X = heart_disease.drop("target", axis=1)

# Create y (the target column)
y = heart_disease["target"]

In [4]:
# Split the data into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

# View the data shapes
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((227, 13), (76, 13), (227,), (76,))

### 2. Choose the Right Estimator/Algorithm for our problems
You can do this using the https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html.
In Scikit-Learn, machine learning models are referred to as estimators.
In this case, since we're working on a classification problem, we've chosen the RandomForestClassifier estimator which is part of the ensembles module.

In [5]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()

### 3. Fit the Model/Algorithm and use it to make predictions on our data

In [6]:
model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

#### Once a model has learned patterns in data, you can use them to make a prediction with the `predict()` function.

In [7]:
#Make Prediction
y_preds = model.predict(X_test)

In [8]:
# This will be in the same format as y_test
y_preds

array([1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0,
       1, 0, 0, 0, 1, 0, 0, 1, 1, 1], dtype=int64)

In [14]:
y_test

114    1
219    0
122    1
125    1
301    0
      ..
59     1
207    0
135    1
151    1
43     1
Name: target, Length: 76, dtype: int64

### 4. Evaluating a model
A trained model/estimator can be evaluated by calling the `score()` function and passing it a collection of data.

In [11]:
#on the training set
model.score(X_train,y_train)

1.0

In [13]:
#On the test set
model.score(X_test,y_test)

0.8289473684210527

In [15]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.74      0.86      0.79        29
           1       0.90      0.81      0.85        47

    accuracy                           0.83        76
   macro avg       0.82      0.84      0.82        76
weighted avg       0.84      0.83      0.83        76



In [16]:
confusion_matrix(y_test, y_preds)

array([[25,  4],
       [ 9, 38]], dtype=int64)

In [17]:
accuracy_score(y_test, y_preds)

0.8289473684210527

### 5. Improve a model (hyperparameter tuning)

A model's first evaluation metrics aren't always its last. One way to improve a models predictions is with hyperparameter tuning.

In [20]:
# Try different numbers of estimators (n_estimators is a hyperparameter you can change)

np.random.seed(42)
for i in range(10,100, 10):
    print(f"Trying model {i} estimators...")
    model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accuracy on test set: {model.score(X_test, y_test)}")
    print("")
    

Trying model 10 estimators...
Model accuracy on test set: 0.7631578947368421

Trying model 20 estimators...
Model accuracy on test set: 0.8026315789473685

Trying model 30 estimators...
Model accuracy on test set: 0.8421052631578947

Trying model 40 estimators...
Model accuracy on test set: 0.8421052631578947

Trying model 50 estimators...
Model accuracy on test set: 0.7763157894736842

Trying model 60 estimators...
Model accuracy on test set: 0.8552631578947368

Trying model 70 estimators...
Model accuracy on test set: 0.8552631578947368

Trying model 80 estimators...
Model accuracy on test set: 0.8026315789473685

Trying model 90 estimators...
Model accuracy on test set: 0.8289473684210527



Note: It's best practice to test different hyperparameters with a validation set or cross-validation.

In [21]:
from sklearn.model_selection import cross_val_score

# Try different numbers of estimators with cross-validation and no cross-validation
np.random.seed(42)
for i in range(10, 100, 10):
    print(f"Trying model with {i} estimators...")
    model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accruacy on test set: {model.score(X_test, y_test)}")
    print(f"Cross-validation score: {np.mean(cross_val_score(model, X, y, cv=5)) * 100}%")
    print("")

Trying model with 10 estimators...
Model accruacy on test set: 0.7631578947368421
Cross-validation score: 78.53551912568305%

Trying model with 20 estimators...
Model accruacy on test set: 0.8026315789473685
Cross-validation score: 79.84699453551912%

Trying model with 30 estimators...
Model accruacy on test set: 0.8026315789473685
Cross-validation score: 80.50819672131148%

Trying model with 40 estimators...
Model accruacy on test set: 0.8289473684210527
Cross-validation score: 82.15300546448088%

Trying model with 50 estimators...
Model accruacy on test set: 0.8552631578947368
Cross-validation score: 81.1639344262295%

Trying model with 60 estimators...
Model accruacy on test set: 0.8289473684210527
Cross-validation score: 83.47540983606557%

Trying model with 70 estimators...
Model accruacy on test set: 0.8421052631578947
Cross-validation score: 81.83060109289617%

Trying model with 80 estimators...
Model accruacy on test set: 0.8421052631578947
Cross-validation score: 82.8142076502

### 6. Save and Load the Trained Model
#### A trained model can be exported and saved so it can be imported and used later. One way to save a model is using Python's `pickle` module.

In [22]:
import pickle

# Save trained model to file
pickle.dump(model, open("random_forest_model.pkl", "wb"))

In [23]:
# Load a saved model and make a prediction on a single example
loaded_model = pickle.load(open("random_forest_model.pkl", "rb"))


In [24]:
loaded_model.score(X_test, y_test)

0.8289473684210527