<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#scikit-learn's-Pipeline" data-toc-modified-id="scikit-learn's-Pipeline-1">scikit-learn's Pipeline</a></span></li><li><span><a href="#Pipeline-works-well-with-Cross-Validation" data-toc-modified-id="Pipeline-works-well-with-Cross-Validation-2">Pipeline works well with Cross Validation</a></span></li><li><span><a href="#Pipeline-works-with-RandomizedSearchCV" data-toc-modified-id="Pipeline-works-with-RandomizedSearchCV-3">Pipeline works with RandomizedSearchCV</a></span></li><li><span><a href="#Check-for-understanding" data-toc-modified-id="Check-for-understanding-4">Check for understanding</a></span></li><li><span><a href="#Always-CV-Across-Features-and-Algorithms" data-toc-modified-id="Always-CV-Across-Features-and-Algorithms-5">Always CV Across Features and Algorithms</a></span></li><li><span><a href="#Search-Across-Algorithms" data-toc-modified-id="Search-Across-Algorithms-6">Search Across Algorithms</a></span></li><li><span><a href="#Many-options-for-Hyperparameter-Search" data-toc-modified-id="Many-options-for-Hyperparameter-Search-7">Many options for Hyperparameter Search</a></span></li><li><span><a href="#Protip" data-toc-modified-id="Protip-8">Protip</a></span></li><li><span><a href="#Takeaways" data-toc-modified-id="Takeaways-9">Takeaways</a></span></li><li><span><a href="#Further-Study" data-toc-modified-id="Further-Study-10">Further Study</a></span></li><li><span><a href="#Bonus-Material" data-toc-modified-id="Bonus-Material-11">Bonus Material</a></span></li></ul></div>

<center><h2>scikit-learn's Pipeline</h2></center>



In [30]:
reset -fs

In [31]:
import numpy as np

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

Pipeline works well with Cross Validation
-----

A Pipeline makes it easier to compose estimators, providing simple behavior under cross-validation

In [32]:
# Load and split the data
from sklearn.datasets        import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, 
                                                    iris.target, 
                                                    test_size=0.2)

In [33]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline      import Pipeline
from sklearn.tree          import DecisionTreeClassifier

pipe_dt = Pipeline([('scl', StandardScaler()),
                    ('pca', PCA(n_components=2)),
                    ('clf', DecisionTreeClassifier())])

In [34]:
# Visualize pipeline
# This is good idea for your Final Project

from sklearn import set_config

set_config(display='diagram')

pipe_dt

In [35]:
from sklearn.model_selection import cross_val_score, KFold

kfold = KFold(n_splits=10, random_state=42)

results = cross_val_score(pipe_dt, # Put your pipeline where an Estimator would go
                          X_train, 
                          y_train, 
                          cv=kfold)
print(f"The mean training validation accuracy - {results.mean():.4f}")

The mean training validation accuracy - 0.9000


Source: https://machinelearningmastery.com/automate-machine-learning-workflows-pipelines-python-scikit-learn/ 

<center><h2>Pipeline works with RandomizedSearchCV</h2></center>

<center><img src="../images/random_search.png" width="65%"/></center>

In [36]:
from sklearn.model_selection import RandomizedSearchCV

# Grid search over each element in pipeline
#                  <estimator>__<hyperparameter>                            
hyperparameters = dict(pca__n_components     = [1, 2, 3],
                       clf__max_depth        = range(1, 5),
                       clf__criterion        = ['gini', 'entropy'],
                       clf__min_samples_leaf = range(3, 15))

clf_rand_cv = RandomizedSearchCV(estimator=pipe_dt, 
                              param_distributions=hyperparameters, 
                              n_iter=25, #25
                              cv=5, 
                              n_jobs=-1,
                              verbose=False)

clf_rand_cv.fit(X_train, y_train)


<center><h2>Check for understanding</h2></center>

During CV: 
- Does each combination get only a single folder?
- Or does each combination on 5 folds and average?

In [37]:
clf_rand_cv = RandomizedSearchCV(estimator=pipe_dt, 
                              param_distributions=hyperparameters, 
                              n_iter=25, #25
                              cv=5, 
                              n_jobs=-1,
                              verbose=True)
clf_rand_cv.fit(X_train, y_train)

Fitting 5 folds for each of 25 candidates, totalling 125 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 125 out of 125 | elapsed:    0.4s finished


In [38]:
# Best hyperparameters
clf_rand_cv.best_params_

{'pca__n_components': 2,
 'clf__min_samples_leaf': 14,
 'clf__max_depth': 3,
 'clf__criterion': 'gini'}

In [39]:
# Best algorithm with best hyperparameters
clf_rand_cv.best_estimator_

<center><h2>Always CV Across Features and Algorithms</h2></center>

> In practice, you almost always want to search over a pipeline, instead of a single estimator. 

> One of the main reasons is that if you apply a pre-processing step to the whole dataset without using a pipeline, and then perform any kind of cross-validation, you would be breaking the fundamental assumption of independence between training and testing data.

> Indeed, since you pre-processed the data using the whole dataset, some information about the test sets are available to the train sets. This will lead to over-estimating the generalization power of the estimator (you can read more in [this Kaggle post](https://www.kaggle.com/alexisbcook/data-leakage)).

> Using a pipeline for cross-validation and searching will largely keep you from this common pitfall.

Source: https://scikit-learn.org/stable/getting_started.html#automatic-parameter-searches

<center><h2>Search Across Algorithms</h2></center>

In [40]:
reset -fs

In [41]:
# Create Placeholder Estimator to use different algorithms
from sklearn.base          import BaseEstimator
from sklearn.pipeline      import Pipeline

class DummyEstimator(BaseEstimator):
    def fit(self): pass
    def score(self): pass
    
# Create a pipeline
pipe = Pipeline([('clf', DummyEstimator())]) # Placeholder Estimator

In [42]:
# Load and split the data
from sklearn.datasets        import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, 
                                                    iris.target, 
                                                    test_size=0.2)

In [43]:
import numpy as np
from sklearn.linear_model     import LogisticRegression
from sklearn.model_selection  import RandomizedSearchCV
from sklearn.tree             import DecisionTreeClassifier

# Create space of candidate learning algorithms and their hyperparameters
search_space = [{'clf': [LogisticRegression()], # Actual Estimator
                 'clf__penalty': ['l1', 'l2'],
                 'clf__C': np.logspace(0, 4, 10)},
                
                {'clf': [DecisionTreeClassifier()],  # Actual Estimator
                 'clf__criterion': ['gini', 'entropy']}]

# Good ol' random search
clf_algos_rand = RandomizedSearchCV(estimator=pipe, 
                                    param_distributions=search_space, 
                                    n_iter=25,
                                    cv=5, 
                                    n_jobs=-1,
                                    verbose=1)


#  Fit grid search
best_model = clf_algos_rand.fit(X_train, y_train);

# View best model
best_model.best_estimator_.get_params()['clf']

Fitting 5 folds for each of 22 candidates, totalling 110 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  98 tasks      | elapsed:    2.2s
[Parallel(n_jobs=-1)]: Done 110 out of 110 | elapsed:    2.3s finished


Source: https://chrisalbon.com/machine_learning/model_selection/model_selection_using_grid_search/

<center><h2>Many options for Hyperparameter Search</h2></center>
<br>
<center><img src="../images/lego boxes of model fitting.png" width="75%"/></center>

<center><h2>Protip</h2></center>

Hyperparameter search within Pipelines can get complex. Do not try to enumerate all possible options and then search across.


Hints:

- Break the problem into subsections based on the project goals.
    - Fitting sub models helps you build domain expertise.
- Feature engineering is algorithm specific. Build algorithm specific pipelines.
- Focus on select algorithms.

<center><h2>Takeaways</h2></center>

- Scikit-learn's `Pipeline` is a simple way to encapsulate many steps in a machine learning flow.
- Scikit-learn's `Pipeline` automates many steps:
    - Cross validation 
    - Grid searching across and within different algorithms

Further Study
------

- Custom Transformer
- FeatureUnion which concatenates the output of transformers into a composite feature space.

<center><h2>Bonus Material</h2></center>

In [45]:
# Example with text pipeline
from sklearn.ensemble                import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput             import MultiOutputClassifier

pipeline = Pipeline([
        ('vect',  CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf',   MultiOutputClassifier(estimator=RandomForestClassifier()))
])
                    
serach_space = { 
    'vect__ngram_range': ((1, 1), (1, 2)),  
    'vect__max_df': (0.5, 0.75, 1.0),                                
    'vect__max_features': (None, 5000, 10000),
    'tfidf__use_idf': (True, False),
    'clf__estimator__n_estimators': [50,100,150,200],
    'clf__estimator__max_depth': [20,50,100,200],
    'clf__estimator__random_state': [42]
}

<br>
<br> 
<br>

----