Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 2] ENH enable setting pipeline components as parameters #1769

Merged
merged 29 commits into from Aug 29, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
8bd5658
ENH enable setting pipeline components as parameters
jnothman May 8, 2013
71db8a4
DOC add what's new entry
jnothman Aug 7, 2014
cb4e533
FIX avoid dict comprehension for Py2.6
jnothman Aug 7, 2014
0be857e
FIX remove debug print statements :/
jnothman Apr 27, 2016
1469ba6
FIX dummy estimator must support transform to go in FeatureUnion
jnothman Apr 27, 2016
d946ee1
FIX Py3 metaclass; doctest name error
jnothman Jul 10, 2016
ae640ec
Clean up merge mess
jnothman Jul 11, 2016
4f95613
DOC cosmetic improvements
jnothman Jul 11, 2016
1806294
avoid deprecated; remove unused import; etc
jnothman Jul 11, 2016
ebca76f
Avoid UserWarning
jnothman Jul 11, 2016
762c0c0
FIX for missing None handling in score
jnothman Jul 11, 2016
08f918f
TST/FIX clean up tests; insert missing None handlers
jnothman Jul 11, 2016
1d23fe8
COSMIT/DOC/TST docstrings, variable names and comment fixes
jnothman Jul 30, 2016
66f83f3
DOC revert some docstrings overwritten during rebase
jnothman Jul 30, 2016
7e15255
DOC get_params -> get_params()
jnothman Aug 2, 2016
29ef413
DOC more doc X description fixes
jnothman Aug 2, 2016
074ad5a
DOC comment get_feature_names as property
jnothman Aug 2, 2016
1a68f2f
DOC Clarify setting to None in docstring
jnothman Aug 24, 2016
31e68c4
Revert changes to hasattr(FeatureUnion, 'get_feature_names')
jnothman Aug 24, 2016
6f4f46c
Comment on _iter
jnothman Aug 24, 2016
159651a
Fix narrative doc examples and variable naming
jnothman Aug 24, 2016
45f0557
TST cosmetic improvements
jnothman Aug 24, 2016
2a4bb61
TST/FIX improvements to pipeline test coverage
jnothman Aug 25, 2016
8d2c998
TST improve dummy estimator names
jnothman Aug 25, 2016
ffb06e6
remove debugging print statements
jnothman Aug 25, 2016
00338e9
TST/FIX more rigorous pipeline tests; add missing validation
jnothman Aug 26, 2016
33c7993
ENH support last estimator in Pipeline being None for transform
jnothman Aug 26, 2016
94ef5ba
TST allow tests to operate when assert_dict_equal unavailable
jnothman Aug 27, 2016
ed325f4
Python 2.6 str.format support
jnothman Aug 27, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
38 changes: 27 additions & 11 deletions doc/modules/pipeline.rst
Expand Up @@ -37,17 +37,16 @@ is an estimator object::
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.svm import SVC
>>> from sklearn.decomposition import PCA
>>> estimators = [('reduce_dim', PCA()), ('svm', SVC())]
>>> clf = Pipeline(estimators)
>>> clf # doctest: +NORMALIZE_WHITESPACE
>>> estimators = [('reduce_dim', PCA()), ('clf', SVC())]
>>> pipe = Pipeline(estimators)
>>> pipe # doctest: +NORMALIZE_WHITESPACE
Pipeline(steps=[('reduce_dim', PCA(copy=True, iterated_power=4,
n_components=None, random_state=None, svd_solver='auto', tol=0.0,
whiten=False)), ('svm', SVC(C=1.0, cache_size=200, class_weight=None,
whiten=False)), ('clf', SVC(C=1.0, cache_size=200, class_weight=None,
coef0=0.0, decision_function_shape=None, degree=3, gamma='auto',
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False))])


The utility function :func:`make_pipeline` is a shorthand
for constructing pipelines;
it takes a variable number of estimators and returns a pipeline,
Expand All @@ -64,23 +63,23 @@ filling in the names automatically::

The estimators of a pipeline are stored as a list in the ``steps`` attribute::

>>> clf.steps[0]
>>> pipe.steps[0]
('reduce_dim', PCA(copy=True, iterated_power=4, n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False))

and as a ``dict`` in ``named_steps``::

>>> clf.named_steps['reduce_dim']
>>> pipe.named_steps['reduce_dim']
PCA(copy=True, iterated_power=4, n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)

Parameters of the estimators in the pipeline can be accessed using the
``<estimator>__<parameter>`` syntax::

>>> clf.set_params(svm__C=10) # doctest: +NORMALIZE_WHITESPACE
>>> pipe.set_params(clf__C=10) # doctest: +NORMALIZE_WHITESPACE
Pipeline(steps=[('reduce_dim', PCA(copy=True, iterated_power=4,
n_components=None, random_state=None, svd_solver='auto', tol=0.0,
whiten=False)), ('svm', SVC(C=10, cache_size=200, class_weight=None,
whiten=False)), ('clf', SVC(C=10, cache_size=200, class_weight=None,
coef0=0.0, decision_function_shape=None, degree=3, gamma='auto',
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False))])
Expand All @@ -90,9 +89,17 @@ This is particularly important for doing grid searches::

>>> from sklearn.model_selection import GridSearchCV
>>> params = dict(reduce_dim__n_components=[2, 5, 10],
... svm__C=[0.1, 10, 100])
>>> grid_search = GridSearchCV(clf, param_grid=params)
... clf__C=[0.1, 10, 100])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should remain svm__C

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks.

>>> grid_search = GridSearchCV(pipe, param_grid=params)

Individual steps may also be replaced as parameters, and non-final steps may be
ignored by setting them to ``None``::

>>> from sklearn.linear_model import LogisticRegression
>>> params = dict(reduce_dim=[None, PCA(5), PCA(10)],
... clf=[SVC(), LogisticRegression()],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean svm=[SVC(), LogisticRegression()]? We should rename svm to classifier or maybe more explicitly classifier_with_C

Copy link
Member

@MechCoder MechCoder Aug 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or rename clf that denotes the pipeline to pipe

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed

... clf__C=[0.1, 10, 100])
>>> grid_search = GridSearchCV(pipe, param_grid=params)

.. topic:: Examples:

Expand Down Expand Up @@ -172,6 +179,15 @@ Like pipelines, feature unions have a shorthand constructor called
:func:`make_union` that does not require explicit naming of the components.


Like ``Pipeline``, individual steps may be replaced using ``set_params``,
and ignored by setting to ``None``::

>>> combined.set_params(kernel_pca=None) # doctest: +NORMALIZE_WHITESPACE
FeatureUnion(n_jobs=1, transformer_list=[('linear_pca', PCA(copy=True,
iterated_power=4, n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)), ('kernel_pca', None)],
transformer_weights=None)

.. topic:: Examples:

* :ref:`example_feature_stacker.py`
Expand Down
6 changes: 6 additions & 0 deletions doc/whats_new.rst
Expand Up @@ -233,6 +233,12 @@ Enhancements
- Added new return type ``(data, target)`` : tuple option to :func:`load_iris` dataset. (`#7049 <https://github.com/scikit-learn/scikit-learn/pull/7049>`_)
By `Manvendra Singh`_ and `Nelson Liu`_.

- Added support for substituting or disabling :class:`pipeline.Pipeline`
and :class:`pipeline.FeatureUnion` components using the ``set_params``
interface that powers :mod:`sklearn.grid_search`.
See :ref:`example_plot_compare_reduction.py`. By `Joel Nothman`_ and
`Robert McGibbon`_.

Bug fixes
.........

Expand Down
75 changes: 75 additions & 0 deletions examples/plot_compare_reduction.py
@@ -0,0 +1,75 @@
#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
=================================================================
Selecting dimensionality reduction with Pipeline and GridSearchCV
=================================================================

This example constructs a pipeline that does dimensionality
reduction followed by prediction with a support vector
classifier. It demonstrates the use of GridSearchCV and
Pipeline to optimize over different classes of estimators in a
single CV run -- unsupervised PCA and NMF dimensionality
reductions are compared to univariate feature selection during
the grid search.
"""
# Authors: Robert McGibbon, Joel Nothman

from __future__ import print_function, division

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.decomposition import PCA, NMF
from sklearn.feature_selection import SelectKBest, chi2

print(__doc__)

pipe = Pipeline([
('reduce_dim', PCA()),
('classify', LinearSVC())
])

N_FEATURES_OPTIONS = [2, 4, 8]
C_OPTIONS = [1, 10, 100, 1000]
param_grid = [
Copy link
Member

@amueller amueller Jul 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome example!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank @rmcgibbo!

{
'reduce_dim': [PCA(iterated_power=7), NMF()],
'reduce_dim__n_components': N_FEATURES_OPTIONS,
'classify__C': C_OPTIONS
},
{
'reduce_dim': [SelectKBest(chi2)],
'reduce_dim__k': N_FEATURES_OPTIONS,
'classify__C': C_OPTIONS
},
]
reducer_labels = ['PCA', 'NMF', 'KBest(chi2)']

grid = GridSearchCV(pipe, cv=3, n_jobs=2, param_grid=param_grid)
digits = load_digits()
grid.fit(digits.data, digits.target)

mean_scores = np.array(grid.results_['test_mean_score'])
# scores are in the order of param_grid iteration, which is alphabetical
mean_scores = mean_scores.reshape(len(C_OPTIONS), -1, len(N_FEATURES_OPTIONS))
# select score for best C
mean_scores = mean_scores.max(axis=0)
bar_offsets = (np.arange(len(N_FEATURES_OPTIONS)) *
(len(reducer_labels) + 1) + .5)

plt.figure()
COLORS = 'bgrcmyk'
for i, (label, reducer_scores) in enumerate(zip(reducer_labels, mean_scores)):
plt.bar(bar_offsets + i, reducer_scores, label=label, color=COLORS[i])

plt.title("Comparing feature reduction techniques")
plt.xlabel('Reduced number of features')
plt.xticks(bar_offsets + len(reducer_labels) / 2, N_FEATURES_OPTIONS)
plt.ylabel('Digit classification accuracy')
plt.ylim((0, 1))
plt.legend(loc='upper left')
plt.show()