In [6]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.pipeline import Pipeline

from timeit import timeit

In [7]:
X, y = load_iris(return_X_y=True)

We already learned how the efficiency of a model is significantly affected by pre-processing steps on data before training.

This leads us to think of a model not only as the sklearn-object but the pipeline of pre-processing steps plus the sklearn-object.

In [10]:
isolated_model = KNeighborsClassifier(n_jobs=-1, weights='distance')

In [27]:
pipe = Pipeline([
    ('model', isolated_model)
])

pipe_minmax = Pipeline([
    ('scaler', MinMaxScaler()),
    ('model', isolated_model)
])

pipe_standard = Pipeline([
    ('scaler', StandardScaler()),
    ('model', isolated_model)
])

In [28]:
n_trials = 1000

def run_pipe():
    pipe.fit(X, y)

def run_pipe_minmax():
    pipe_minmax.fit(X, y)

def run_pipe_standard():
    pipe_standard.fit(X, y)

In [29]:
timeit(run_pipe, number=n_trials)

0.7447044880000249

In [30]:
timeit(run_pipe_minmax, number=n_trials)

1.0371052380000947

In [31]:
timeit(run_pipe_standard, number=n_trials)

1.079402636000168

Note that, for experimentation, pipelines are not the most efficient workflow since it executes all steps every time. This leads to an increase of execution times.

BUT, pipelines are great for creating structures that contain all the information about how a machine learning model is being trained.
In the next notebook we'll see a great application for finding out the best configurations.