New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+3] ENH Caching Pipeline by memoizing transformer #7990

Merged
merged 5 commits into from Feb 13, 2017

Conversation

Projects
None yet
9 participants
@glemaitre
Contributor

glemaitre commented Dec 6, 2016

Reference Issue

Address the discussions in #3951
Other related issues and PR: #2086 #5082 #5080

What does this implement/fix? Explain your changes.

It implements a version of Pipeline which allows for caching transformer.

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Dec 6, 2016

Contributor

@jnothman @GaelVaroquaux @amueller

From #3951, this is what I could come with. I will add promptly the testing from the Pipeline and adapt it. From what I could check, this is working for those tests.

However, I get into trouble while using the CachedPipeline in a GridSearchCV:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import samples_generator
from sklearn.decomposition import PCA
from sklearn.pipeline import CachedPipeline
from sklearn.model_selection import GridSearchCV

# generate some data to play with                                                               
X, y = samples_generator.make_classification(
    n_samples=100,
    n_informative=5, n_redundant=0, random_state=42)

pca = PCA()
clf = RandomForestClassifier()
pipeline = CachedPipeline([('pca', pca), ('rf', clf)],
                          memory='./')
parameters = {'pca__n_components': (.25, .5, .75),
              'rf__n_estimators': (10, 20, 30)}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=1, verbose=1)

grid_search.fit(X, y)

After that joblib loads the cached PCA, the transformer is seen as non fitted:

NotFittedError: This PCA instance is not fitted yet.
Call 'fit' with appropriate arguments before using this method.

I'm going to check why but if you have already an obvious answer, I would be happy to hear it.

Contributor

glemaitre commented Dec 6, 2016

@jnothman @GaelVaroquaux @amueller

From #3951, this is what I could come with. I will add promptly the testing from the Pipeline and adapt it. From what I could check, this is working for those tests.

However, I get into trouble while using the CachedPipeline in a GridSearchCV:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import samples_generator
from sklearn.decomposition import PCA
from sklearn.pipeline import CachedPipeline
from sklearn.model_selection import GridSearchCV

# generate some data to play with                                                               
X, y = samples_generator.make_classification(
    n_samples=100,
    n_informative=5, n_redundant=0, random_state=42)

pca = PCA()
clf = RandomForestClassifier()
pipeline = CachedPipeline([('pca', pca), ('rf', clf)],
                          memory='./')
parameters = {'pca__n_components': (.25, .5, .75),
              'rf__n_estimators': (10, 20, 30)}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=1, verbose=1)

grid_search.fit(X, y)

After that joblib loads the cached PCA, the transformer is seen as non fitted:

NotFittedError: This PCA instance is not fitted yet.
Call 'fit' with appropriate arguments before using this method.

I'm going to check why but if you have already an obvious answer, I would be happy to hear it.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Dec 6, 2016

Member
Member

GaelVaroquaux commented Dec 6, 2016

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Dec 6, 2016

Contributor

Do you get the same problem if you take a fitted PCA, pickle it, unpickle it and try to transform on the data?

Nop, this is working fine.

pca.fit(X, y)
joblib.dump(pca, 'pca.p')
pickled_pca = joblib.load('pca.p')
pickled_pca.transform(X)
Contributor

glemaitre commented Dec 6, 2016

Do you get the same problem if you take a fitted PCA, pickle it, unpickle it and try to transform on the data?

Nop, this is working fine.

pca.fit(X, y)
joblib.dump(pca, 'pca.p')
pickled_pca = joblib.load('pca.p')
pickled_pca.transform(X)
Show outdated Hide outdated sklearn/pipeline.py
memory = self.memory
if isinstance(memory, six.string_types) or memory is None:
memory = Memory(cachedir=memory, verbose=10)
self._fit_transform_one = memory.cache(_fit_transform_one)

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Dec 6, 2016

Member

Don't do that: don't decorate methods. Use the pattern described in https://github.com/joblib/joblib/blob/master/doc/memory.rst, under the bullet point "caching methods".

@GaelVaroquaux

GaelVaroquaux Dec 6, 2016

Member

Don't do that: don't decorate methods. Use the pattern described in https://github.com/joblib/joblib/blob/master/doc/memory.rst, under the bullet point "caching methods".

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Dec 6, 2016

Contributor

For the first configuration of the grid search, only the only dumped and fitted PCA is the first one. On the 2 others, PCA is dumped but not fitted.

Contributor

glemaitre commented Dec 6, 2016

For the first configuration of the grid search, only the only dumped and fitted PCA is the first one. On the 2 others, PCA is dumped but not fitted.

Show outdated Hide outdated sklearn/pipeline.py
Xt, transform = memory.cache(_fit_transform_one)(
transform, name,
None, Xt, y,
**fit_params_steps[name])

This comment has been minimized.

@agramfort

agramfort Dec 6, 2016

Member

I would refactor Pipeline with a private class to avoid the code dupe.

@agramfort

agramfort Dec 6, 2016

Member

I would refactor Pipeline with a private class to avoid the code dupe.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Dec 6, 2016

Member
Member

GaelVaroquaux commented Dec 6, 2016

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Dec 6, 2016

Contributor

Do you get the same problem if you take a fitted PCA, pickle it, unpickle it and try to transform on the data?

Stupid mistakes fixed. I forgot to assign the loaded pipeline.

Contributor

glemaitre commented Dec 6, 2016

Do you get the same problem if you take a fitted PCA, pickle it, unpickle it and try to transform on the data?

Stupid mistakes fixed. I forgot to assign the loaded pipeline.

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Dec 7, 2016

Contributor

@agramfort @GaelVaroquaux While speaking of private class, is the last commit implements what you had in mind?

@GaelVaroquaux

I would use a subclass of Pipeline, rather than Pipeline itself. The reason being that it is probably a good idea to clone the objects before memoizing the fit (in order to maximize the cache hits).

I should miss something regarding the cloning of the objects. Could you elaborate on what is required in the implementation?

Contributor

glemaitre commented Dec 7, 2016

@agramfort @GaelVaroquaux While speaking of private class, is the last commit implements what you had in mind?

@GaelVaroquaux

I would use a subclass of Pipeline, rather than Pipeline itself. The reason being that it is probably a good idea to clone the objects before memoizing the fit (in order to maximize the cache hits).

I should miss something regarding the cloning of the objects. Could you elaborate on what is required in the implementation?

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Dec 7, 2016

Contributor

I would use a subclass of Pipeline, rather than Pipeline itself. The reason being that it is probably a good idea to clone the objects before memoizing the fit (in order to maximize the cache hits).

I should miss something regarding the cloning of the objects. Could you elaborate on what is required in the implementation?

I think I got it. The cloning is to clean the fitting info to not cache twice the same estimator, once without fitting and the second with the fitting.

Contributor

glemaitre commented Dec 7, 2016

I would use a subclass of Pipeline, rather than Pipeline itself. The reason being that it is probably a good idea to clone the objects before memoizing the fit (in order to maximize the cache hits).

I should miss something regarding the cloning of the objects. Could you elaborate on what is required in the implementation?

I think I got it. The cloning is to clean the fitting info to not cache twice the same estimator, once without fitting and the second with the fitting.

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Dec 7, 2016

Contributor

I added the test which are similar to the one propose in #3951
However, this test does not check that the cache has been loaded.
It only check the resulting array which kinda permissive.

Is there anything in joblib which inform if the cache has been read?

Contributor

glemaitre commented Dec 7, 2016

I added the test which are similar to the one propose in #3951
However, this test does not check that the cache has been loaded.
It only check the resulting array which kinda permissive.

Is there anything in joblib which inform if the cache has been read?

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Dec 7, 2016

Contributor

For now, I am checking that the timestamp of a DummyTransformer, assigned at the first fit within the pipeline, is recovered through the different fitting.

Contributor

glemaitre commented Dec 7, 2016

For now, I am checking that the timestamp of a DummyTransformer, assigned at the first fit within the pipeline, is recovered through the different fitting.

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Dec 15, 2016

Contributor

@agramfort Do you see additional changes to do?

Contributor

glemaitre commented Dec 15, 2016

@agramfort Do you see additional changes to do?

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Dec 20, 2016

Member

You should probably change your title to [MRG].

Can you trigger a rebuild (e.g. by doing git commit --amend and force push) to trigger a documentation build. It looks like at one point CircleCI stopped running on PRs last week.

Member

lesteve commented Dec 20, 2016

You should probably change your title to [MRG].

Can you trigger a rebuild (e.g. by doing git commit --amend and force push) to trigger a documentation build. It looks like at one point CircleCI stopped running on PRs last week.

Show outdated Hide outdated sklearn/pipeline.py
with its name to another estimator, or a transformer removed by setting
to None.
Read more in the :ref:`User Guide <pipeline>`.

This comment has been minimized.

@lesteve

lesteve Dec 20, 2016

Member

Should you not link to the cached_pipeline section?

@lesteve

lesteve Dec 20, 2016

Member

Should you not link to the cached_pipeline section?

This comment has been minimized.

@agramfort

agramfort Dec 21, 2016

Member

+1 to add it to See Also

@agramfort

agramfort Dec 21, 2016

Member

+1 to add it to See Also

Show outdated Hide outdated doc/modules/pipeline.rst
Usage
-----
Similarly to :class:`Piepeline`, the pipeline is built on the same manner. However,

This comment has been minimized.

@lesteve

lesteve Dec 20, 2016

Member

Actually you have a typo here in Pipeline. No need to do git commit --amend changing this should trigger a CircleCI doc build

@lesteve

lesteve Dec 20, 2016

Member

Actually you have a typo here in Pipeline. No need to do git commit --amend changing this should trigger a CircleCI doc build

Show outdated Hide outdated examples/plot_compare_reduction_cached.py
@@ -0,0 +1,77 @@
#!/usr/bin/python

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Dec 20, 2016

Member

Looks like the wrong "#!" line to me: it will always use the system Python, which is often not the right thing. The right way is "#!/usr/bin/env python"

@GaelVaroquaux

GaelVaroquaux Dec 20, 2016

Member

Looks like the wrong "#!" line to me: it will always use the system Python, which is often not the right thing. The right way is "#!/usr/bin/env python"

This comment has been minimized.

@lesteve

lesteve Dec 20, 2016

Member

The shebang line is only used if the example is executable right? No example has the executable flag AFAICT. I would be in favour of removing the shebang line.

@lesteve

lesteve Dec 20, 2016

Member

The shebang line is only used if the example is executable right? No example has the executable flag AFAICT. I would be in favour of removing the shebang line.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Dec 20, 2016

Member

Any way that the new example could be integrated with an existing one? The danger is that we have too many examples.

The way that I would suggest doing it is by extending an existing one (if there is one that is relevant), and use "notebook-style examples" of sphinx-gallery to add extra cells at the bottom (with an extra title) without making the initial example more complicated. That aspect is important: the initial example should be left as simple, without additional lines of code. And sphinx-gallery notebook-style formatting can be used to add discussion and code at the end.

Member

GaelVaroquaux commented Dec 20, 2016

Any way that the new example could be integrated with an existing one? The danger is that we have too many examples.

The way that I would suggest doing it is by extending an existing one (if there is one that is relevant), and use "notebook-style examples" of sphinx-gallery to add extra cells at the bottom (with an extra title) without making the initial example more complicated. That aspect is important: the initial example should be left as simple, without additional lines of code. And sphinx-gallery notebook-style formatting can be used to add discussion and code at the end.

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Dec 20, 2016

Contributor

Any way that the new example could be integrated with an existing one? The danger is that we have too many examples.

This example is a modified version of examples/plot_compare_reduction.py.
I'll try to merge both as suggested.

Contributor

glemaitre commented Dec 20, 2016

Any way that the new example could be integrated with an existing one? The danger is that we have too many examples.

This example is a modified version of examples/plot_compare_reduction.py.
I'll try to merge both as suggested.

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Dec 20, 2016

Member

@glemaitre the Travis failure in the doctest is because the default of decision_function_shape is None and not 'ovr' in master.

I can reproduce the failure locally. The reason we could not reproduce locally earlier was because we were trying on your branch rather than on the result of the merge of your branch into master, d'oh ...

Member

lesteve commented Dec 20, 2016

@glemaitre the Travis failure in the doctest is because the default of decision_function_shape is None and not 'ovr' in master.

I can reproduce the failure locally. The reason we could not reproduce locally earlier was because we were trying on your branch rather than on the result of the merge of your branch into master, d'oh ...

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Dec 20, 2016

Contributor

@lesteve d'oh indeed ...

Contributor

glemaitre commented Dec 20, 2016

@lesteve d'oh indeed ...

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Dec 20, 2016

Contributor

@GaelVaroquaux Is the example look like what you had in mind?

Contributor

glemaitre commented Dec 20, 2016

@GaelVaroquaux Is the example look like what you had in mind?

Show outdated Hide outdated examples/plot_compare_reduction.py
memory=memory)
# This time, a cached pipeline will be used within the grid search
grid = GridSearchCV(cached_pipe, cv=3, n_jobs=2, param_grid=param_grid)

This comment has been minimized.

@agramfort

agramfort Dec 21, 2016

Member

don't use n_jobs != 1 in examples

@agramfort

agramfort Dec 21, 2016

Member

don't use n_jobs != 1 in examples

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Dec 21, 2016

Member

This must still be changed.

@GaelVaroquaux

GaelVaroquaux Dec 21, 2016

Member

This must still be changed.

Show outdated Hide outdated sklearn/pipeline.py
with its name to another estimator, or a transformer removed by setting
to None.
Read more in the :ref:`User Guide <pipeline>`.

This comment has been minimized.

@agramfort

agramfort Dec 21, 2016

Member

+1 to add it to See Also

@agramfort

agramfort Dec 21, 2016

Member

+1 to add it to See Also

Show outdated Hide outdated sklearn/tests/test_pipeline.py
# Don't make the cachedir, Memory should be able to do that on the fly
print(80 * '_')
print('test_memory setup (%s)' % env['dir'])
print(80 * '_')

This comment has been minimized.

@agramfort

agramfort Dec 21, 2016

Member

why all this print statements?

@agramfort

agramfort Dec 21, 2016

Member

why all this print statements?

This comment has been minimized.

@glemaitre

glemaitre Dec 21, 2016

Contributor

In fact it was the way which was used in joblib.
However, I am having a second thought about that. Probably a simple try - finally statement as here could be enough for the purpose of the test.

@glemaitre

glemaitre Dec 21, 2016

Contributor

In fact it was the way which was used in joblib.
However, I am having a second thought about that. Probably a simple try - finally statement as here could be enough for the purpose of the test.

Show outdated Hide outdated doc/modules/pipeline.rst
@@ -124,6 +125,40 @@ i.e. if the last estimator is a classifier, the :class:`Pipeline` can be used
as a classifier. If the last estimator is a transformer, again, so is the
pipeline.
.. _cached_pipeline:
CachedPipeline: memoizing transformers

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Dec 21, 2016

Member

I think that I would put a title that is more focused towards the problem that it solves rather than the technique used. Something like "CachedPipeline: avoiding to repeat computation"

@GaelVaroquaux

GaelVaroquaux Dec 21, 2016

Member

I think that I would put a title that is more focused towards the problem that it solves rather than the technique used. Something like "CachedPipeline: avoiding to repeat computation"

Show outdated Hide outdated doc/modules/pipeline.rst
.. currentmodule:: sklearn.pipeline
:class:`CachedPipeline` can be used instead of :class:`Pipeline` to avoid to fit

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Dec 21, 2016

Member

I would say "to avoid to compute the fit"

@GaelVaroquaux

GaelVaroquaux Dec 21, 2016

Member

I would say "to avoid to compute the fit"

Show outdated Hide outdated examples/plot_compare_reduction.py
Selecting dimensionality reduction with Pipeline and GridSearchCV
=================================================================
=======================================================================
Selecting dimensionality reduction with Pipeline, CachedPipeline, and \

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Dec 21, 2016

Member

I don't think that I would change the title here. It makes it too long, IMHO.

@GaelVaroquaux

GaelVaroquaux Dec 21, 2016

Member

I don't think that I would change the title here. It makes it too long, IMHO.

Show outdated Hide outdated examples/plot_compare_reduction.py
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.pipeline import Pipeline, CachedPipeline

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Dec 21, 2016

Member

I would import this only later: trying to separate a bit the two parts of the example.

We have to keep in mind that every added piece of information to an example makes it harder to understand.

@GaelVaroquaux

GaelVaroquaux Dec 21, 2016

Member

I would import this only later: trying to separate a bit the two parts of the example.

We have to keep in mind that every added piece of information to an example makes it harder to understand.

Show outdated Hide outdated examples/plot_compare_reduction.py
from sklearn.svm import LinearSVC
from sklearn.decomposition import PCA, NMF
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.externals.joblib import Memory

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Dec 21, 2016

Member

Same comment.

@GaelVaroquaux

GaelVaroquaux Dec 21, 2016

Member

Same comment.

Show outdated Hide outdated examples/plot_compare_reduction.py
@@ -73,3 +90,29 @@
plt.ylim((0, 1))
plt.legend(loc='upper left')
plt.show()

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Dec 21, 2016

Member

I think that that "show" must be moved at the end.

@GaelVaroquaux

GaelVaroquaux Dec 21, 2016

Member

I think that that "show" must be moved at the end.

This comment has been minimized.

@glemaitre

glemaitre Dec 22, 2016

Contributor

With the current scheme, it will generate the figure right after the code snipped and before the second section. It looks fine to me.

@glemaitre

glemaitre Dec 22, 2016

Contributor

With the current scheme, it will generate the figure right after the code snipped and before the second section. It looks fine to me.

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jan 3, 2017

Member

It will block the execution of the script before the end. The convention is really to have it only at the end.

@GaelVaroquaux

GaelVaroquaux Jan 3, 2017

Member

It will block the execution of the script before the end. The convention is really to have it only at the end.

Show outdated Hide outdated examples/plot_compare_reduction.py
# Authors: Robert McGibbon, Joel Nothman
from __future__ import print_function, division
from tempfile import mkdtemp

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Dec 21, 2016

Member

Same comment about avoiding to import this early.

@GaelVaroquaux

GaelVaroquaux Dec 21, 2016

Member

Same comment about avoiding to import this early.

Show outdated Hide outdated examples/plot_compare_reduction.py
# This time, a cached pipeline will be used within the grid search
grid = GridSearchCV(cached_pipe, cv=3, n_jobs=2, param_grid=param_grid)
digits = load_digits()
grid.fit(digits.data, digits.target)

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Dec 21, 2016

Member

Maybe a comment here (in a notebook-style cell rendered by sphinx-gallery) on how this is faster and cooler. Just to conclude the example.

@GaelVaroquaux

GaelVaroquaux Dec 21, 2016

Member

Maybe a comment here (in a notebook-style cell rendered by sphinx-gallery) on how this is faster and cooler. Just to conclude the example.

Show outdated Hide outdated sklearn/pipeline.py
array([False, False, True, True, False, False, True, True, False,
True, False, True, True, False, True, False, True, True,
False, False], dtype=bool)
class _Pipeline(six.with_metaclass(ABCMeta, _BasePipeline)):

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Dec 21, 2016

Member

What is the use of this class? Why not merge it with Pipeline?

@GaelVaroquaux

GaelVaroquaux Dec 21, 2016

Member

What is the use of this class? Why not merge it with Pipeline?

Show outdated Hide outdated sklearn/pipeline.py
"""
def __init__(self, steps, memory=Memory(cachedir=None, verbose=0)):
self.memory = memory

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Dec 21, 2016

Member

I think that it would be cool if "memory" could be either a memory instance or a string. In the latter case it would give path to the cachedir of a memory that would be instanciated in the fit.

We do this in nilearn, and it makes the life of users much easier.

@GaelVaroquaux

GaelVaroquaux Dec 21, 2016

Member

I think that it would be cool if "memory" could be either a memory instance or a string. In the latter case it would give path to the cachedir of a memory that would be instanciated in the fit.

We do this in nilearn, and it makes the life of users much easier.

This comment has been minimized.

@glemaitre

glemaitre Dec 21, 2016

Contributor

This is the case. memory accepts both string and Memory -> _fit_single_transform. However, the default is a memory object.

@glemaitre

glemaitre Dec 21, 2016

Contributor

This is the case. memory accepts both string and Memory -> _fit_single_transform. However, the default is a memory object.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Dec 21, 2016

Member
Member

GaelVaroquaux commented Dec 21, 2016

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Dec 21, 2016

Member

I've finished a pass on this PR. Overall, it looks great, and I am favorable for merge once the little pending issues have been addressed.

The doc/whats_new.rst file must be updated.

Member

GaelVaroquaux commented Dec 21, 2016

I've finished a pass on this PR. Overall, it looks great, and I am favorable for merge once the little pending issues have been addressed.

The doc/whats_new.rst file must be updated.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Dec 21, 2016

Member

Also, CachedPipeline must be added to doc/modules/classes.rst

Member

GaelVaroquaux commented Dec 21, 2016

Also, CachedPipeline must be added to doc/modules/classes.rst

@glemaitre glemaitre changed the title from [WIP] Cache pipeline to [MRG] Cache pipeline Dec 22, 2016

@glemaitre glemaitre changed the title from [MRG] Cache pipeline to [MRG] ENH CachedPipeline for memoizing transformer Dec 22, 2016

@GaelVaroquaux GaelVaroquaux changed the title from [MRG] ENH CachedPipeline for memoizing transformer to [MRG+1] ENH CachedPipeline for memoizing transformer Jan 3, 2017

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Feb 3, 2017

Member
Member

GaelVaroquaux commented Feb 3, 2017

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Feb 3, 2017

Member
Member

GaelVaroquaux commented Feb 3, 2017

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Feb 3, 2017

Member

It sounds like there's something to fix on joblib's end, probably saving to a temp file before moving to avoid corruption.

+1 to both aspects of the sentence.

Yeah that's pretty much what I said in #7990 (comment).

Member

lesteve commented Feb 3, 2017

It sounds like there's something to fix on joblib's end, probably saving to a temp file before moving to avoid corruption.

+1 to both aspects of the sentence.

Yeah that's pretty much what I said in #7990 (comment).

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Feb 3, 2017

Member

However, it also seems that we're going to lose the advantage of parallelism unless GridSearchCV is explicitly pipeline-aware, or joblib implements a locking and waiting mechanism for parallel caching.

I would expect to loose only a little bit, in a situation where the grid
is a bit complicated, unless the number of cores is large compared to the
number of grid points.

Also one point to bear in mind is that cache is useful for across python sessions, say you ran a GridSearch with a CachedPipeline a week ago and you want to rerun it today. So the initial run may not be as efficient as it could be but subsequent runs will be.

Member

lesteve commented Feb 3, 2017

However, it also seems that we're going to lose the advantage of parallelism unless GridSearchCV is explicitly pipeline-aware, or joblib implements a locking and waiting mechanism for parallel caching.

I would expect to loose only a little bit, in a situation where the grid
is a bit complicated, unless the number of cores is large compared to the
number of grid points.

Also one point to bear in mind is that cache is useful for across python sessions, say you ran a GridSearch with a CachedPipeline a week ago and you want to rerun it today. So the initial run may not be as efficient as it could be but subsequent runs will be.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Feb 3, 2017

Member
Member

GaelVaroquaux commented Feb 3, 2017

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Feb 3, 2017

Member

I opened joblib/joblib#490 to track the joblib issue.

Member

lesteve commented Feb 3, 2017

I opened joblib/joblib#490 to track the joblib issue.

@jnothman

A thought on this parallelism issue before I continue my review: assume we're grid searching over some estimator with a cached component whose calculation is invariant across some searched parameter settings. This is the common case this PR treats, but may be more broadly applicable. We are more likely to get cache hits if the jobs dispatched and hence fit in parallel are those run with different data not with different parameters.

At master, we iterate over candidates in the inner loop. For the sake of this PR's functionality, I think we should consider reversing those loops. However, we reversed them in #7941 so as to avoid materialising the CV splits as a list, although this was based on an incorrect bug report (not that I can find it now). Materialising CV splits takes n_samples*n_splits which one would not expect to blow the blank. (However, perhaps our conflicting needs in this matter are another sign that CV splitters was ill-designed with respect to determinism.)

Thoughts?

Show outdated Hide outdated doc/modules/pipeline.rst
.. currentmodule:: sklearn.pipeline
Fitting transformers may sometimes be computational expensive. With its

This comment has been minimized.

@jnothman

jnothman Feb 8, 2017

Member

*computationally

Drop "sometimes".

@jnothman

jnothman Feb 8, 2017

Member

*computationally

Drop "sometimes".

@jnothman

Otherwise LGTM

Show outdated Hide outdated doc/modules/pipeline.rst
after calling ``fit``.
This feature is used to avoid computing the fit transformers within a pipeline
if the parameters and input data are identical. A typical example is the case of
a grid search in which the transformers can be fitted only once and reuse for

This comment has been minimized.

@jnothman

jnothman Feb 8, 2017

Member

*reused

@jnothman

jnothman Feb 8, 2017

Member

*reused

Show outdated Hide outdated doc/modules/pipeline.rst
.. warning:: **Side effect of caching transfomers**
Using a :class:`Pipeline` without cache enabled, this is possible to

This comment has been minimized.

@jnothman

jnothman Feb 8, 2017

Member

this -> it

@jnothman

jnothman Feb 8, 2017

Member

this -> it

Show outdated Hide outdated doc/modules/pipeline.rst
.. warning:: **Side effect of caching transfomers**
Using a :class:`Pipeline` without cache enabled, this is possible to
introspect the original instance such as::

This comment has been minimized.

@jnothman

jnothman Feb 8, 2017

Member

introspect -> inspect

@jnothman

jnothman Feb 8, 2017

Member

introspect -> inspect

Show outdated Hide outdated doc/modules/pipeline.rst
Enabling caching triggers a clone of the transformers before fitting.
Therefore, the transformer instance given to the pipeline cannot be
introspected directly. Use the attribute ``named_steps`` to introspect

This comment has been minimized.

@jnothman

jnothman Feb 8, 2017

Member

introspected -> inspected
introspect -> inspect

@jnothman

jnothman Feb 8, 2017

Member

introspected -> inspected
introspect -> inspect

Show outdated Hide outdated doc/modules/pipeline.rst
introspected directly. Use the attribute ``named_steps`` to introspect
estimators within the pipeline.
In this case, the :class:`PCA` instance ``pca2`` cannot be

This comment has been minimized.

@jnothman

jnothman Feb 8, 2017

Member

This paragraph repeats things from the previous. Please clean a little.

@jnothman

jnothman Feb 8, 2017

Member

This paragraph repeats things from the previous. Please clean a little.

@@ -72,4 +86,45 @@
plt.ylabel('Digit classification accuracy')
plt.ylim((0, 1))
plt.legend(loc='upper left')
###############################################################################
# Caching transformers within a ``Pipeline``

This comment has been minimized.

@jnothman

jnothman Feb 8, 2017

Member

It's fine here. So we need to think of our examples more like notebooks. Sounds fine.

@jnothman

jnothman Feb 8, 2017

Member

It's fine here. So we need to think of our examples more like notebooks. Sounds fine.

Show outdated Hide outdated sklearn/pipeline.py
@@ -107,6 +108,18 @@ class Pipeline(_BasePipeline):
chained, in the order in which they are chained, with the last object
an estimator.
memory : Instance of joblib.Memory or string, optional (default=None)
Used to cache the fitted transformers of the transformer of the
pipeline. By default, no cache is performed.

This comment has been minimized.

@jnothman

jnothman Feb 8, 2017

Member

*cache -> *caching

@jnothman

jnothman Feb 8, 2017

Member

*cache -> *caching

Show outdated Hide outdated sklearn/pipeline.py
If a string is given, it is the path to the caching directory.
Enabling caching triggers a clone of the transformers before fitting.
Therefore, the transformer instance given to the pipeline cannot be
introspected directly. Use the attribute ``named_steps`` to introspect

This comment has been minimized.

@jnothman

jnothman Feb 8, 2017

Member

introspected -> inspected

@jnothman

jnothman Feb 8, 2017

Member

introspected -> inspected

This comment has been minimized.

@jnothman

jnothman Feb 8, 2017

Member

The reference to named_steps may make users think this is different from steps. You might as well suggest that either can be used.

@jnothman

jnothman Feb 8, 2017

Member

The reference to named_steps may make users think this is different from steps. You might as well suggest that either can be used.

Show outdated Hide outdated sklearn/tests/test_pipeline.py
cached_pipe.fit, X, y)
def test_cached_pipeline():

This comment has been minimized.

@jnothman

jnothman Feb 8, 2017

Member

rename to test_pipeline_memory

@jnothman

jnothman Feb 8, 2017

Member

rename to test_pipeline_memory

@codecov

This comment has been minimized.

Show comment
Hide comment
@codecov

codecov bot Feb 9, 2017

Codecov Report

❗️ No coverage uploaded for pull request base (master@dfcf632). Click here to learn what that means.

@@            Coverage Diff            @@
##             master    #7990   +/-   ##
=========================================
  Coverage          ?   94.74%           
=========================================
  Files             ?      342           
  Lines             ?    60739           
  Branches          ?        0           
=========================================
  Hits              ?    57546           
  Misses            ?     3193           
  Partials          ?        0
Impacted Files Coverage Δ
sklearn/tests/test_pipeline.py 99.61% <100%> (ø)
sklearn/pipeline.py 99.26% <95.23%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dfcf632...1afb47f. Read the comment docs.

codecov bot commented Feb 9, 2017

Codecov Report

❗️ No coverage uploaded for pull request base (master@dfcf632). Click here to learn what that means.

@@            Coverage Diff            @@
##             master    #7990   +/-   ##
=========================================
  Coverage          ?   94.74%           
=========================================
  Files             ?      342           
  Lines             ?    60739           
  Branches          ?        0           
=========================================
  Hits              ?    57546           
  Misses            ?     3193           
  Partials          ?        0
Impacted Files Coverage Δ
sklearn/tests/test_pipeline.py 99.61% <100%> (ø)
sklearn/pipeline.py 99.26% <95.23%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dfcf632...1afb47f. Read the comment docs.

@jnothman jnothman changed the title from [MRG+2] ENH Caching Pipeline by memoizing transformer to [MRG+3] ENH Caching Pipeline by memoizing transformer Feb 9, 2017

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Feb 9, 2017

Member

Are we good for merge? Should we wait for the joblib fixes?

Member

jnothman commented Feb 9, 2017

Are we good for merge? Should we wait for the joblib fixes?

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Feb 9, 2017

Contributor

@ogrisel wanted to wait for the joblib fixes.

Contributor

glemaitre commented Feb 9, 2017

@ogrisel wanted to wait for the joblib fixes.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Feb 9, 2017

Member
Member

jnothman commented Feb 9, 2017

@ogrisel

Actually as the default setting will not trigger the joblib race condition I thing we can merge this as is.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Feb 13, 2017

Member

Hurray. Merging. Good job, @glemaitre !

Member

GaelVaroquaux commented Feb 13, 2017

Hurray. Merging. Good job, @glemaitre !

@GaelVaroquaux GaelVaroquaux merged commit b3a639f into scikit-learn:master Feb 13, 2017

4 checks passed

ci/circleci Your tests passed on CircleCI!
Details
codecov/patch 98.66% of diff hit (target 94.74%)
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Feb 13, 2017

Member

@jnothman w.r.t. your comment on GS loop ordering in #7990 (review) this would not impact the optimal design of this PR right?

Member

ogrisel commented Feb 13, 2017

@jnothman w.r.t. your comment on GS loop ordering in #7990 (review) this would not impact the optimal design of this PR right?

@raghavrv

This comment has been minimized.

Show comment
Hide comment
@raghavrv

raghavrv Feb 13, 2017

Member

Hurray!! 🎉

Member

raghavrv commented Feb 13, 2017

Hurray!! 🎉

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Feb 13, 2017

Member

@jnothman w.r.t. your comment on GS loop ordering in #7990 (review) this would not impact the optimal design of this PR right?

I don't know what you're asking, @ogrisel. The ordering issue doesn't stop this change being useful, but it makes this (and other memoisation) less useful in parallel because the cache will be missed unnecessarily. Maximal cache hits is (n_candidates - 1) * n_splits, but under the current ordering it is more like (n_candidates - n_jobs) * n_splits.

Member

jnothman commented Feb 13, 2017

@jnothman w.r.t. your comment on GS loop ordering in #7990 (review) this would not impact the optimal design of this PR right?

I don't know what you're asking, @ogrisel. The ordering issue doesn't stop this change being useful, but it makes this (and other memoisation) less useful in parallel because the cache will be missed unnecessarily. Maximal cache hits is (n_candidates - 1) * n_splits, but under the current ordering it is more like (n_candidates - n_jobs) * n_splits.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Feb 13, 2017

Member

Agreed. Once the joblib race condition is fixed on windows we ca reinvestigate that issue in GS.

Member

ogrisel commented Feb 13, 2017

Agreed. Once the joblib race condition is fixed on windows we ca reinvestigate that issue in GS.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Feb 14, 2017

Member

OMG AMAZING!

Member

amueller commented Feb 14, 2017

OMG AMAZING!

sergeyf added a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017

[MRG+3] ENH Caching Pipeline by memoizing transformer (#7990)
* ENH Caching Pipeline by memoizing transformer

* Fix lesteve changes

* Fix comments

* Fix doc

* Fix jnothman comments

@Przemo10 Przemo10 referenced this pull request Mar 17, 2017

Closed

update fork (#1) #8606

@amueller amueller removed this from PR phase in Andy's pets May 12, 2017

@lsorber

This comment has been minimized.

Show comment
Hide comment
@lsorber

lsorber Jun 6, 2017

Thanks for the nice work on this. May I suggest to (optionally) also cache the pipeline's last step? The last step could itself be a transformer (e.g., in a pipeline of pipelines).

In fact, I'm not even sure what the downside of caching all steps by default is. If you want the current behaviour, you could simply create a cached pipeline of all the steps you want cached, and insert that into a non-caching pipeline for the steps you don't want cached. Conversely, it is much more tedious to create a fully cached pipeline with the current implementation.

lsorber commented Jun 6, 2017

Thanks for the nice work on this. May I suggest to (optionally) also cache the pipeline's last step? The last step could itself be a transformer (e.g., in a pipeline of pipelines).

In fact, I'm not even sure what the downside of caching all steps by default is. If you want the current behaviour, you could simply create a cached pipeline of all the steps you want cached, and insert that into a non-caching pipeline for the steps you don't want cached. Conversely, it is much more tedious to create a fully cached pipeline with the current implementation.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 6, 2017

Member

@lsorber can you maybe open a new issue for that?

Member

amueller commented Jun 6, 2017

@lsorber can you maybe open a new issue for that?

@lsorber

This comment has been minimized.

Show comment
Hide comment
@lsorber

lsorber Jun 6, 2017

All right, will do.

lsorber commented Jun 6, 2017

All right, will do.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 6, 2017

Member
Member

jnothman commented Jun 6, 2017

@lsorber

This comment has been minimized.

Show comment
Hide comment
@lsorber

lsorber Jun 6, 2017

@jnothman I was in fact suggesting to cache the full pipeline, including the last step. Some motivation in #9007.

lsorber commented Jun 6, 2017

@jnothman I was in fact suggesting to cache the full pipeline, including the last step. Some motivation in #9007.

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG+3] ENH Caching Pipeline by memoizing transformer (#7990)
* ENH Caching Pipeline by memoizing transformer

* Fix lesteve changes

* Fix comments

* Fix doc

* Fix jnothman comments

NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

[MRG+3] ENH Caching Pipeline by memoizing transformer (#7990)
* ENH Caching Pipeline by memoizing transformer

* Fix lesteve changes

* Fix comments

* Fix doc

* Fix jnothman comments

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+3] ENH Caching Pipeline by memoizing transformer (#7990)
* ENH Caching Pipeline by memoizing transformer

* Fix lesteve changes

* Fix comments

* Fix doc

* Fix jnothman comments

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG+3] ENH Caching Pipeline by memoizing transformer (#7990)
* ENH Caching Pipeline by memoizing transformer

* Fix lesteve changes

* Fix comments

* Fix doc

* Fix jnothman comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment