# Pipeline Steps


Reference _Introduction to Machine Learning_ [Chapter 6](https://github.com/amueller/introduction_to_ml_with_python/blob/master/06-algorithm-chains-and-pipelines.ipynb):
-  6. Algorithm Chains and Pipelines to 6.4 The General Pipeline interface



In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
import mglearn

This example demonstrates how to use `Pipeline` for the breast cancer dataset used previously

## 1. The plan
1. Load breast cancer dataset, and split off a test portion
1. Create a pipeline with pre-processing and classification
1. Use grid search to find which combination of pre-processing and classifier works best

Notes:
- Breast cancer has all numerical features, likely StandardScaler, MinMaxScaler will help
- Comparing a linear and non-linear classifier

## 2. Load and split the data 
Notes:
- use `random_state=0` when splitting

In [3]:
from sklearn.datasets import load_breast_cancer

In [4]:
X, y = load_breast_cancer(return_X_y=True)
print(X.shape)
print(y.shape)

(569, 30)
(569,)


In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)


## 3. Create the pipeline with processing and classifier steps 

Notes:
- Use `StandardScaler()` and `SVC(kernel=linear)` as placeholders
- Call `fit()` and `score()` on training data to verify
- use sklearn `set_config()` to display pipeline as diagram

In [6]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC



In [7]:
pipe = Pipeline([('preprocessing', StandardScaler()), ('classifier', SVC(kernel='linear'))])
pipe

In [8]:
from sklearn import set_config
set_config(display='diagram')
pipe

In [9]:
pipe.fit(X_train, y_train)
print(f'Training accuracy {pipe.score(X_train, y_train):.2f} on {y_train.shape} samples')

Training accuracy 0.99 on (426,) samples


In [10]:
set_config(display='text')
pipe.named_steps.preprocessing

StandardScaler()

## 4. Use grid search to find best combination of pre-processing and classifier

Notes:
- parameter grid can be a list of dictionaries
- use `'classifier__parameter_name'` to define grid search for parameter names
- Compare `SVC(kernel='linear')` and `RandomForestClassifier(random_state=43)`
- Set preprocessing to `[None]` for random forest

In [11]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

param_grid = [{'classifier': [SVC(kernel='linear')], 
               'classifier__C': [0.01, 0.1, 1.0, 10.0],
               'preprocessing': [StandardScaler(), None]
              },
              {'classifier': [RandomForestClassifier(random_state=43)], 
               'classifier__max_depth': [3, 5, 7, 9],
               'preprocessing': [None]
              }]

grid = GridSearchCV(pipe, param_grid, cv=5)


In [12]:
grid.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessing', StandardScaler()),
                                       ('classifier', SVC(kernel='linear'))]),
             param_grid=[{'classifier': [SVC(kernel='linear')],
                          'classifier__C': [0.01, 0.1, 1.0, 10.0],
                          'preprocessing': [StandardScaler(), None]},
                         {'classifier': [RandomForestClassifier(random_state=43)],
                          'classifier__max_depth': [3, 5, 7, 9],
                          'preprocessing': [None]}])

In [13]:
grid.best_estimator_

Pipeline(steps=[('preprocessing', StandardScaler()),
                ('classifier', SVC(kernel='linear'))])

In [14]:
grid.best_params_

{'classifier': SVC(kernel='linear'),
 'classifier__C': 1.0,
 'preprocessing': StandardScaler()}

In [15]:
print(f'Cross-Validation accuracy {grid.best_score_:.2f}')
print(f'Test accuracy {grid.score(X_test, y_test):.2f}')

Cross-Validation accuracy 0.98
Test accuracy 0.97
