WIP allow Pipeline to memoize partial results #2086

Closed
wants to merge 1 commit into
from

Projects

None yet

7 participants

Owner

This PR adds a memory parameter to Pipeline which allows it to memoize the results of partial pipeline evaluations (fits, transforms). See [http://www.mail-archive.com/scikit-learn-general@lists.sourceforge.net/msg07402.html](a request for this feature).

Currently:

  • it is a prototype, untested and not sparkling from clean code
  • it requires storing the Pipeline's training data on the instance (perhaps a hash would suffice)
  • the implementation perhaps caches data more frequently than necessary
  • fit and transform are called separately even where fit_transform is implemented
  • it memoizes the last step (whether it be fit, transform, score or predict, etc.), perhaps unnecessarily
  • it does not support passing keyword arguments to steps' fit methods as Pipeline currently does (this is the failing test)
  • it does not take advantage of the fact that for some estimators only a subset of parameters produce distinct models (e.g. for SelectKBest, only score_func should be a key for fit, while (score_func, k) affect the result of transform)

@GaelVaroquaux, is this what you had in mind?
@amueller, this isn't as much a generalised CV solution (#1626) as #2000 was; to what extent does it satisfy your use-cases?

Owner

I should clarify the mechanism a bit: for the ith step, the following are in the cache key:

  • the step names, classes and parameters up to step i
  • the most recent arguments to fit or fit_transform
  • the current arguments to whatever method is being called

For fit, the cache stores all models from the beginning of the Pipeline to step i. For transform etc it stores the output at step i.

Thus the cache is checked recursively from the end to the beginning of the pipeline (and filled from the beginning to the end).

Owner

I guess one thing I would like to know is: which transform methods in scikit-learn are substantially more expensive than loading a cached result?

Owner

I guess one thing I would like to know is: which transform methods in
scikit-learn are substantially more expensive than loading a cached result?

As this is estimator dependent, this could/should be sorted by adding a
'memory' keyword to the transformers, as in the case of the feature
agglomeration. The nice aspect is that it can then be chosen to be used
in the best place (think the SVD in the PCA for instance).

G

@GaelVaroquaux GaelVaroquaux commented on the diff Jun 23, 2013
sklearn/pipeline.py
+
+ @cache
+ def _fit(fit_args, step_states):
+ """Fit the steps covered by steps_state"""
+ X = fit_args[0]
+ other_args = fit_args[1:]
+ if len(step_states) > 1:
+ self.steps[:len(step_states) - 1] = _fit(fit_args,
+ step_states[:-1])
+ X = _transform(X, other_args, fit_args, step_states[:-1])
+ step_name, step_estimator = self.steps[len(step_states) - 1]
+ step_estimator.fit(X, *other_args)
+ return self.steps[:len(step_states)]
+
+ @cache
+ def _transform(X, other_args, fit_args, step_states,
GaelVaroquaux
GaelVaroquaux Jun 23, 2013 Owner

Please don't define functions in closures. It makes code hard to debug.

jnothman
jnothman Jun 23, 2013 Owner

I couldn't work out another way to sensibly use Memory.cache to perform this operation. In particular, I do not want the pipeline object itself to be part of the cache key. I considered an approach that uses Memory.cache's ignore argument, but I can't remember why I decided against it.

If you have a neat alternative, let me know. But this code is currently intended only as a prototype of the functionality.

@GaelVaroquaux GaelVaroquaux commented on the diff Jun 23, 2013
sklearn/pipeline.py
@@ -107,81 +117,100 @@ def get_params(self, deep=True):
out['%s__%s' % (name, key)] = value
return out
- # Estimator interface
+ def _cached_run(self, method, X, *other_args):
GaelVaroquaux
GaelVaroquaux Jun 23, 2013 Owner

This method is hard to read because it tries to do things in a very generic way. I think that I would much prefer adding a small number of calls to 'cache' in the original code.

One problem that you face is that an estimator object can have estimated parameters. These may render the cache invalid, while they really shouldn't affect the fit. The way to do this would be something like:

def fit_model(estimator, X, y):
   estimator.fit(X, y)
   return estimator

# Later in the code
mem.cache(fit_model)(clone(estimator), X, y)
GaelVaroquaux
GaelVaroquaux Jun 23, 2013 Owner

I made a few change editing the previous comment online, sorry for the spam.

jnothman
jnothman Jun 23, 2013 Owner

Estimated parameters will not render the cache invalid in the current implementation, as I'm basically using (type(estimator), estimator.get_params()) as the estimator component of the key.

jnothman
jnothman Jun 23, 2013 Owner

And again, the method is hard to read because it was an organic development of the idea into a prototype. It's an admitted mess. I wanted to know whether it was notionally close-to-sensible, i.e. when you actually run a Pipeline using it, does it make the cache hits you expect?

jnothman
jnothman Jun 23, 2013 Owner

See #2086 (comment), and note that the key must include precisely the parameters affecting steps prior to the current one being evaluated, and that cached _fit currently returns all the estimators up to the given step, rather than checking the cache for one at a time (probably an unnecessary optimisation).

jnothman
jnothman Jun 23, 2013 Owner

(But of course, although it is a prototype, thank you for taking the time to check it out!)

GaelVaroquaux
GaelVaroquaux Jun 23, 2013 Owner

I wanted to know whether it was notionally close-to-sensible, i.e. when
you actually run a Pipeline using it, does it make the cache hits you
expect?

I am not sure, as I am not sure that I understood the code :). I think it
does something sensible.

GaelVaroquaux
GaelVaroquaux Jun 23, 2013 Owner

Estimated parameters will not render the cache invalid in the current
implementation, as I'm basically using (type(estimator), estimator.get_params
()) as the estimator component of the key.

I know, but the code is very hard to read.

GaelVaroquaux
GaelVaroquaux Jun 23, 2013 Owner

See #2086, and note that the key must include precisely the parameters
affecting steps prior to the current one being evaluated,

No: if X and y are used to compute the hash, you don't need to know where
they come from. This is actually a benefit of not trying to be clever:
changing 'n_jobs', or 'verbose' on previous transforms will trigger a
recompute down the line only if the transform is actually changed.

jnothman
jnothman Jun 23, 2013 Owner

I'm not currently using the transformed X and y in the key.

I can see some benefits in using such a less clever version, though had hoped to allow estimators to provide a _non_fit_params to indicate those that shouldn't be used in the key.

I guess if we assume Pipelines are not very long, doing 2 or 3 cache lookups, one for each component, is not much worse than doing 1 to get them all.

jnothman
jnothman Jun 24, 2013 Owner

You're right: changing parameters in an earlier estimator that affect neither the output of fit nor transform would require more than just _non_fit_params.

GaelVaroquaux
GaelVaroquaux Jun 24, 2013 Owner

I can see some benefits in using such a less clever version, though had hoped
to allow estimators to provide a _non_fit_params to indicate those that
shouldn't be used in the key.

I would add an ignore=['verbose'] to the cache call.

jnothman
jnothman Jun 24, 2013 Owner

Fair enough, and it's possible with my closure approach or with joblib/joblib#69. I don't see how I would do it with your fit_model.

Owner

As this is estimator dependent, this could/should be sorted by adding a 'memory' keyword to the transformers, as in the case of the feature agglomeration.

I thought this might be what you intended. As far as I'm concerned, configuring a memory parameter (and providing) for each transformer is an annoyance, as would be a caching MetaEstimator.

I think there is a great usability advantage in providing an easy solution for avoiding unnecessary Pipeline subsequence fits and perhaps transforms (the value of both is estimator-dependent). The question I was getting at is whether there's much value saved by caching transforms as well as fits.

And if fine-grained control is actually necessary, one could implement cache_fits and cache_transforms parameters that are boolean or an array of booleans corresponding to the Pipeline steps. But I think this solution is overkill for a convenience implementation.

Owner

As far as I'm concerned, configuring a memory parameter (and providing)
for each transformer is an annoyance, as would be a caching
MetaEstimator.

You are right in theory, but in practice, you will always get better
performance with more specific code.

I think there is a great usability advantage in providing an easy
solution for avoiding unnecessary Pipeline subsequence fits and perhaps
transforms (the value of both is estimator-dependent). The question I
was getting at is whether there's much value saved by caching
transforms as well as fits.

I agree that generic memory in pipelines would be useful. I would think
that we want to cache transformer's fit, but probably not the transform
method.

Owner

I'm considering what you wrote here:

One problem that you face is that an estimator object can have estimated parameters. These may render the cache invalid, while they really shouldn't affect the fit.

My way of handling this was providing a closure whose arguments corresponded exactly to the cache key. Such closures are indeed not ideal, but I also think cloning the estimator is merely a workaround, not a clean solution. This is a limitation of Memory. Perhaps just as Memory.cache has an argument ignore, it should also have a way to get additional cache keys.

The other difficulty I here is that fit updates the state of the object, and we don't actually want to cache its return value so much as the final state (even though they should generally be the same). It is the reason fit_transform cannot be cached.

Owner

Your cache(fit)(clone(estimator), X, y) solution also will not handle the case of excluding parameters that don't affect fit and its learnt attributes. For example, using a LinearSVC as a feature selector, but wanting to play with the threshold in different grid search candidates, we don't want to refit when that threshold parameter changes.

As far as I'm concerned, that eliminates clone as an option. Would you rather a closure, or some extra argument to Memory.cache?

Owner

My way of handling this was providing a closure whose arguments corresponded
exactly to the cache key. Such closures are indeed not ideal, but I also think
cloning the estimator is merely a workaround, not a clean solution. This is a
limitation of Memory. Perhaps just as Memory.cache has an argument ignore, it
should also have a way to get additional cache keys.

My experience of software development (which is somewhat substantial
having worked and followed many project) is that complexity is our worst
enemy, to a point which should not be underestimated. Indeed, as a
project grows more and more complex, the development slows down as adding
each feature becomes harder, and less and less people are qualified to
modify it.

Simpler code should be preferred to elegant code. Features that add a lot
of complexity should be included only if they are mission critical.

These rules, I believe, are excellent guidelines to making a project
successful in the long run. And indeed, I review code with them in mind:
if there is a simpler solution (less abstractions, less lines of code), I
will always push for it.

Owner

Your cache(fit)(clone(estimator), X, y) solution also will not handle
the case of excluding parameters that don't affect fit and its learnt
attributes. For example, using a LinearSVC as a feature selector, but
wanting to play with the threshold in different grid search candidates,
we don't want to refit when that threshold parameter changes.

Yes, that's correct. I don't want to build a full pipeline with parameter
tracking. Experience shows that it is a very costly enterprise that
really slows down package development. I would suggest people with such
need to implement a 'mem' inside the estimator object: using a simpler
solution, that is not generic.

Owner

I understand where you're coming from, but removing redundant work in a grid search is a frequent, and sensible, request. If you can consider a way to do it without modifying each underlying estimator (should users really be expected to do so?) that does not add complexity to the code or the interface, do let me know.

Owner
ogrisel commented Oct 31, 2013

I have been thinking about this use case and I think it should be possible to generate clean cache keys recursively to support nested estimators, see: https://gist.github.com/ogrisel/7091781

My feature extraction step takes up the vast bulk of my pipeline execution time. This feature would really help me out.

Owner

One option you could use is here. Unfortunately it requires a small change to the scikit-learn codebase, after which you can just wrap those models you want to memoize the fitting of with remember_model (or remember_transform).

However, you might be better off just performing feature extraction as a preprocessing step.

I ended up writing a wrapper estimator that pickles the fitted estimator to a file on the first run and on subsequent runs just unpickles it, but a more general solution does seem like it would be useful in many situations.

Owner

Well, if there are no other comments, I'll merge in a day or so. The CI
failures appear spurious.

please don't merge before getting 2 +1 by others.

Owner

Ah, @agramfort, I intended that for another thread! :)

Owner

I ended up writing a wrapper estimator that pickles the fitted estimator to a file on the first run and on subsequent runs just unpickles it, but a more general solution does seem like it would be useful in many situations.

remember_model does basically that, however in that case I found the cloning behaviour in cross validation broke it, and so I had to specially handle it.

Owner

Perhaps the cloning behaviour only broke when you wanted to memoize the fit, but there were parameters etc. that didn't depend on fit. There may be better ways to do remember_model in any case.

@jnothman jnothman closed this Nov 9, 2014
Owner

Sorry I didn't follow the discussion closely, could you summarize why you closed this one?

Owner

I have PRs that are older than this one and I actually expect to receive
reviews and be merged. This one I don't expect so. The issue is still an
issue: too much work is being redone in cross-validated pipelines by
default, but I don't think this is the right solution. I think being able
to tell any estimator to cache its model with respect to all or a subset of
its parameters and training data (by way of metaestimator, mixin, inbuilt
parameter in BaseEstimator or whatever) is a good way to keep the feature
modular. Supporting a metaestimator form only requires a change for
sklearn.base.clone to support polymorphism (see
jnothman@de0f86d
).

On 11 November 2014 04:52, Andreas Mueller notifications@github.com wrote:

Sorry I didn't follow the discussion closely, could you summarize why
you closed this one?


Reply to this email directly or view it on GitHub
#2086 (comment)
.

@jnothman I'm after this feature as well. So what was the problem with your other method jnothman@de0f86d for it not to be merged? It looks like something I can use already in my project. The only problem I see is the possibly slow cache comparisons when the inputs (X) are large, as in pipelines only the initial input really needs to be compared.

Owner

I think we should actually investigate https://github.com/ContinuumIO/dask for doing this, though this would be a bit of a dependency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment