# Scikit-learn
- Scikit-learn is a free machine learning library for the Python programming language.
- Installation: https://scikit-learn.org/stable/install.html

* Simple and efficient tools for predictive data analysis
* Accessible to everybody, and reusable in various contexts
* Built on NumPy, SciPy, and matplotlib
* Open source, commercially usable - BSD license

An end-to-end Scikit-Learn worfklow

* Getting the data ready
* Choosing the right machine learning estimator/aglorithm/model for your problem
* Fitting your chosen machine learning model to data and using it to make a prediction
* Evaluting a machine learning model
* Improving predictions through experimentation (hyperparameter tuning)
* Saving and loading a pretrained model
* Putting it all together in a pipeline

# An end-to-end Scikit-Learn worfklow

In [1]:
# Standard imports
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

## Random Forest Classifier Workflow for Classifying Heart Disease

### 1. Get the data ready

In [3]:
# import data 
heart_disease = pd.read_csv('data/heart-disease.csv')
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [4]:
# info
heart_disease.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


In [5]:
# Create X (all the feature columns)
X = heart_disease.drop("target", axis=1)

# Create y (the target column)
y = heart_disease["target"]

In [6]:
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [11]:
# check if balanced data set 
y.value_counts()

1.1956521739130435

In [8]:
# Split the data into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100, stratify=y)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

In [12]:
y_test.value_counts()

1    33
0    28
Name: target, dtype: int64

### 2. Choose the model and hyperparameters

In [13]:
# We'll use a Random Forest
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10)

In [14]:
# We'll leave the hyperparameters as default to begin with...
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 10,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

### 3. Fit the model to the data and use it to make a prediction
* Fitting the model on the data involves passing it the data so that the ML Algorithm can the patterns.

* If there are labels (supervised learning), the model tries to work out the relationship between the data and the labels.

* If there are no labels (unsupervised learning), the model tries to find patterns and group similar samples together.

In [15]:
clf.fit(X_train, y_train)

RandomForestClassifier(n_estimators=10)

### 4. Use the model to make a prediction
Once our model instance is trained, you can use the predict() method to predict a target value given a set of features. In other words, use the model, along with some unlabelled data to predict the label.

Note, data you predict on has to be in the same shape as data you trained on.

In [12]:
# In order to predict a label, data has to be in the same shape as X_train
X_test.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
24,40,1,3,140,199,0,1,178,1,1.4,2,0,3
174,60,1,0,130,206,0,0,132,1,2.4,1,2,3
30,41,0,1,105,198,0,1,168,0,0.0,2,1,2
63,41,1,1,135,203,0,1,132,0,0.0,1,0,1
180,55,1,0,132,353,0,1,132,1,1.2,1,1,3


In [17]:
y_preds = clf.predict(X_test)

### 5. Evaluate the model
Each model or estimator has a built-in score method. This method compares how well the model was able to learn the patterns between the features and labels. In other words, it returns how accurate your model is.

##### Score Method

Accuracy is the default metric for the score() function within each of Scikit-Learn's classifier models. And it's probably the metric you'll see most often used for classification problems.

In [14]:
# Evaluate the model on the training set
clf.score(X_train, y_train)

0.9793388429752066

In [18]:
# Evaluate the model on the test set
clf.score(X_test, y_test)

0.8688524590163934

#### classification_report
* Precision — What percent of your predictions were correct
* Recall — What proportion of the positive cases were predicted.
* F1 score — What percent of positive predictions were correct?
* Support is the number of actual occurrences of the class in the specified dataset.

In [19]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.92      0.79      0.85        28
           1       0.84      0.94      0.89        33

    accuracy                           0.87        61
   macro avg       0.88      0.86      0.87        61
weighted avg       0.87      0.87      0.87        61



#### Confusion Matrix
* True positive = model predicts 1 when truth is 1
* False positive = model predicts 1 when truth is 0
* True negative = model predicts 0 when truth is 0
* False negative = model predicts 0 when truth is 1

In [20]:
conf_mat = confusion_matrix(y_test, y_preds)
conf_mat

array([[22,  6],
       [ 2, 31]], dtype=int64)

### 6. Experiment to Improve the Model
* The first model you build is often referred to as a <b>baseline.</b>

* The next step in the workflow is to try and improve upon your baseline model.

* Experiment with different hyperparameters

* All different parameters should be cross-validated

* Different models you use will have different hyperparameters you can tune. For the case of our model, the RandomForestClassifier(), we'll start trying different values for n_estimators.

In [55]:
np.random.seed(42)
for i in range(10, 151, 10):
    print(f"Trying model with {i} estimators...")
    model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accuracy on test set: {model.score(X_test, y_test) * 100}%")
    print("")

Trying model with 10 estimators...
Model accuracy on test set: 81.9672131147541%

Trying model with 20 estimators...
Model accuracy on test set: 81.9672131147541%

Trying model with 30 estimators...
Model accuracy on test set: 86.88524590163934%

Trying model with 40 estimators...
Model accuracy on test set: 83.60655737704919%

Trying model with 50 estimators...
Model accuracy on test set: 83.60655737704919%

Trying model with 60 estimators...
Model accuracy on test set: 86.88524590163934%

Trying model with 70 estimators...
Model accuracy on test set: 83.60655737704919%

Trying model with 80 estimators...
Model accuracy on test set: 88.52459016393442%

Trying model with 90 estimators...
Model accuracy on test set: 86.88524590163934%

Trying model with 100 estimators...
Model accuracy on test set: 85.24590163934425%

Trying model with 110 estimators...
Model accuracy on test set: 86.88524590163934%

Trying model with 120 estimators...
Model accuracy on test set: 86.88524590163934%

Try

In [21]:
from sklearn.model_selection import cross_val_score

# With cross-validation
np.random.seed(42)
for i in range(10, 100, 10):
    print(f"Trying model with {i} estimators...")
    model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accuracy on test set: {model.score(X_test, y_test) * 100}%")
    print(f"Cross-validation score: {np.mean(cross_val_score(model, X, y, cv=5)) * 100}%")
    print("")

Trying model with 10 estimators...
Model accuracy on test set: 81.9672131147541%
Cross-validation score: 78.53551912568305%

Trying model with 20 estimators...
Model accuracy on test set: 85.24590163934425%
Cross-validation score: 79.84699453551912%

Trying model with 30 estimators...
Model accuracy on test set: 88.52459016393442%
Cross-validation score: 80.50819672131148%

Trying model with 40 estimators...
Model accuracy on test set: 85.24590163934425%
Cross-validation score: 82.15300546448088%

Trying model with 50 estimators...
Model accuracy on test set: 86.88524590163934%
Cross-validation score: 81.1639344262295%

Trying model with 60 estimators...
Model accuracy on test set: 85.24590163934425%
Cross-validation score: 83.47540983606557%

Trying model with 70 estimators...
Model accuracy on test set: 86.88524590163934%
Cross-validation score: 81.83060109289617%

Trying model with 80 estimators...
Model accuracy on test set: 85.24590163934425%
Cross-validation score: 82.81420765027

##### GridSearchCV
It helps to loop through predefined hyperparameters and fit your estimator (model) on your training set.

In [22]:
np.random.seed(42)
from sklearn.model_selection import GridSearchCV

# Define the parameters to search over
param_grid = {'n_estimators': [i for i in range(10, 101, 10)], 'max_depth': [0,5,10]}

# Setup the grid search
grid = GridSearchCV(RandomForestClassifier(),
                    param_grid,
                    cv=5,
                    scoring='recall'
                   )

# Fit the grid search to the data
grid.fit(X, y)

# Find the best parameters
grid.best_params_

50 fits failed out of a total of 150.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
50 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\user\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 681, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\user\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py", line 442, in fit
    trees = Parallel(
  File "C:\Users\user\anaconda3\lib\site-packages\joblib\parallel.py", line 1041, in __call__
    if self.dispatch_one_batch(iterator):
  File "C:\Users\user\anaconda3\lib\site-packages\joblib\parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\Users\user\anaconda3\

{'max_depth': 5, 'n_estimators': 50}

In [52]:
# Set the model to be the best estimator
clf = grid.best_estimator_
clf

RandomForestClassifier(max_depth=5, n_estimators=50)

In [53]:
# Fit the best model
clf = clf.fit(X_train, y_train)

In [54]:
# Find the best model scores
clf.score(X_test, y_test)

0.8688524590163934

### 7. Save a model for someone else to use
* When you've done a few experiments and you're happy with how your model is doing, you'll likely want someone else to be able to use it.

* This may come in the form of a teammate or colleague trying to replicate and validate your results or through a customer using your model as part of a service or application you offer.

* Saving a model also allows you to reuse it later without having to go through retraining it. Which is helpful, especially when your training times start to increase.

* You can save a scikit-learn model using Python's in-built pickle module.

In [30]:
import pickle

# Save an existing model to file
pickle.dump(model, open("random_forest_model_1.pkl", "wb"))

In [31]:
# Load a saved model and make a prediction
loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb"))
loaded_model.score(X_test, y_test)

0.8688524590163934

In [23]:
import joblib
filename='mymodel'
joblib.dump(model, filename)

['mymodel']

In [24]:
mymodel = joblib.load(filename)

In [25]:
mymodel.score(X_test, y_test)

0.8688524590163934