# <span style="color:darkblue"> Lecture 19: Machine Learning Pipelines </span>

<font size = "5">


One of the most common mistakes in ML analyses is data leakage:

https://www.youtube.com/watch?v=y8qaI5mpJeA

This happens when you use information from leak information from the <br>
validation/test sets to the training set.
- This can happen when we proprocess X using the whole dataset
- Machine learning pipelines automate the pre-processing stages and <br>
help avoid data leakage


# <span style="color:darkblue"> I. Setup Working Environment </span>


In [82]:
# Import the package for the University of California Irvine API
from ucimlrepo import fetch_ucirepo 

# Import SK-Learn library for machine learning functions
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model import RidgeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import cross_validate
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import get_scorer_names

# Import ML pipeline:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline

# Import PCA
from sklearn.decomposition import PCA

# Import standard data analysis packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import Kmenas
from sklearn.cluster import MiniBatchKMeans



<font size = "5">

Cancer screening dataset


In [83]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
cancer = load_breast_cancer()

X = cancer.data
y = cancer.target

# <span style="color:darkblue"> II. Pipelines </span>

<font size = "5">

Example of machine learning pipeline

- Involves multiple splitting + pre-processing steps
- Preprocessing steps involve ```.fit()``` and ```.transform()```
- Model fitting at the end only involves ```.fit()```
- Evaluate with a scorer out-of-sample

In [84]:
# Split into training/test
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=0)

# Scale inputs
scaler        = StandardScaler().fit(X_train)
X_train_star  = scaler.transform(X_train)

# Obtain PCA (optional)
pca           = PCA(n_components=3).fit(X_train_star)
X_train_star  = pca.transform(X_train_star)

# Fit SVC model 
svm = SVC()
svm.fit(X_train_star, y_train)

# Predict test data
# Note: Here we are scaling and obtaining based on the loadings
# from the training data (we could also do this for the test data)
X_test_star = pca.transform(scaler.transform(X_test))

# Evaluate prediction accuracy
svm.score(X_test_star,y_test)

# Optional: Obtain predictions
# predictions = svm.predict(X_test_star)

0.916083916083916

<font size = "5">

Automatic ML pipelines

- Chain multiple functions together, applied in order
- Specify names of functions and outputs.
- The last step uses ```.fit()```, while the intermediate <br>
steps use ```.fit()``` and ```.transform()```. Any set of functions <br>
with this structure can be used with ```Pipeline()```

In [85]:
# Main syntax
# Each step is a different element of a list
# Each step is comprised of two arguments: name + function
# (wrapped in parentheses and comma-separated)
pipe = Pipeline([("scaler",StandardScaler()),
                 ("pca",PCA(n_components= 3)),
                 ("svm",SVC())])

  
pipe.fit(X_train,y_train)
pipe.score(X_test,y_test)


0.916083916083916

<font size = "5">

Abbreviated pipeline syntax

- Useful if you don't need to reference one of the components <br>
in subsequent analyses


In [None]:
# We can also use the abbreviate syntax
pipe_abbreviated = make_pipeline(StandardScaler(),
                                 PCA(n_components = 3),
                                 SVC())

<font size = "5">

Try it yourself!

Check that the abbreviated syntax also fits a model <br>
with the same prediction accuracy

In [87]:
# Write your own code

pipe_abbreviated.fit(X_train,y_train)
pipe_abbreviated.score(X_test,y_test)


0.916083916083916

<font size = "5">

Create a pipeline without PCA. Does it have better accuracy?

In [89]:
# Write your own code

# Create the pipeline without PCA
pipe_no_pca= make_pipeline(StandardScaler(),
                                 SVC())

pipe_no_pca_option2 = Pipeline([("scaler",StandardScaler()),
                                    ("svm",SVC())])

# Fit the pipeline on the training data
pipe_no_pca.fit(X_train, y_train)

# Evaluate the accuracy on the test data
accuracy_no_pca = pipe_no_pca.score(X_test, y_test)

accuracy_no_pca


0.965034965034965

# <span style="color:darkblue"> II. Pipelines + Cross-Validation </span>

<font size = "5">

Create a grid of values to search

- In the pipe framework, use name of the step, e.g. "svm" <br>
followed by two underscores, and the name of the parameters
- Doesn't work for abbreviated pipes

In [91]:
k_features = 3
param_grid = {'svm__C':     [ 0.01, 0.1],
              'svm__gamma': [0.01/(k_features), 0.1/(k_features)]}


<font size = "5">

Create a CV grid search

- The advantage is that the scaling and PCA steps are done on <br>
the spit-sample folds.
- Pipes avoid a proble known as "leakage": <br>
incorrectly using scale/PCA using the whole data <br>
violates the assumption of independence of the folds <br>
and could we obtain distorted evaluation metrics



In [92]:
# Configure grid
grid = GridSearchCV(pipe, param_grid, cv=2,
                          return_train_score=True,
                          n_jobs = -1)

grid.fit(X_train,y_train)


<font size = "5">

Evaluate cross-validated model

In [93]:
print("Best estimator:\n{}".format(grid.best_estimator_))


grid.score(X_test,y_test)

Best estimator:
Pipeline(steps=[('scaler', StandardScaler()), ('pca', PCA(n_components=3)),
                ('svm', SVC(C=0.1, gamma=0.03333333333333333))])


0.9300699300699301

# <span style="color:darkblue"> II. Pipelines + Cross-Validation + Multimodels </span>

<font size = "5">

Define names of each step, and a default option:

In [94]:
pipe_multimodel = Pipeline([('preprocessing', StandardScaler()),
                            ('classifier', SVC())])


<font size = "5">

Define grid of models/tuning parameters

- Can vary which classifier
- Can vary whether to use preprocessing or not
- Within a specific model


In [95]:
# Each separate model is grouped with curly brackets

param_grid = [
    {'classifier': [SVC()], 'preprocessing': [StandardScaler(), None],
     'classifier__C':   [ 0.01, 0.1]},
    {'classifier': [RidgeClassifier()], 'preprocessing': [StandardScaler(), None]}]

<font size = "5">

Search over all models

In [96]:
# Configure grid
grid = GridSearchCV(pipe_multimodel, param_grid, cv=2,
                          return_train_score=True,
                          n_jobs = -1)

grid.fit(X_train,y_train)

print("Best estimator:\n{}".format(grid.best_estimator_))

print(grid.score(X_test,y_test))

Best estimator:
Pipeline(steps=[('preprocessing', StandardScaler()),
                ('classifier', RidgeClassifier())])
0.965034965034965
