Issues encountered with the memory option in Pipeline #10068

fcharras · 2017-11-04T19:47:09Z

I'm very interested in using the memory option in Pipeline so I can organise my code in a simple fashion. However I've found that it does not scale well or there seem to be some caveats one could not expect:

is the input train data is too big (~several GB), joblib.memory takes a very long time to hash it, it can considerably slow the execution in an unexpected way.
the documentation hints that Caching the transformers is advantageous when fitting is time consuming.. However, the fit_transform is cached, so not only the fitted transformer but also the transformed data of the train data seems to be cached ? Then, it is also advantageous when transforming is time consuming, however this can quickly add up to a considerable space taken on the hard drive.
finally, it the code of a transformer change, but neither his methods nor his attributes, it seems to me that the hash will not change (because the code of _fit_transform_one does not). That's something the user could be warned about (need to wipe the cache when the code of a currently cached transformer has been altered) or it can happen to load from the cache the previous version of a transformer by mistake.

The text was updated successfully, but these errors were encountered:

jnothman · 2017-11-04T22:10:57Z

Thanks for this. It's very useful to get feedback on new features...

All potential optimisations need to be taken with care. They often help someone and hurt another.

In the first instance, let's improve documentation. I don't even think we currently make it clear that memory is persisted across sessions.

In addition:

I think we could minimise the first problem by finding a way to make the fit a function of the once-hashed pipeline input and any preceding steps (assuming all are deterministic). Is hashing the input once reasonable? (And this would be easier to implement with ENH support callable to modify Memory caching key joblib/joblib#69, I think...)
Not caching fit_transform means that we require fit_transform(X) == fit(X).transform(X) for all X. We can make this assumption if est.fit_transform is BaseEstimator.fit_transform but not otherwise. I can see the code getting quite intricate... Also we will probably reduce performance substantially for current users where the transformation is as/more expensive as the fit (e.g. nearest neighbors-based transformations), but maybe we can live with that.
One thing that could reduce all of these is to allow to user to specify which steps are memoized. The same can be achieved now with nested pipelines, but that is, IMO, far from friendly.

But whether adding API complexity is worthwhile is another issue... Ping @glemaitre, @lesteve.

FarahSaeed · 2017-11-06T10:20:12Z

I would like to work on this. Can I take up this issue?

lesteve · 2017-11-06T15:45:29Z

Thanks @FarahSaeed for your interest! I would recommend that you find an issue with "good first issue" or "Easy" tag to get familiar with contributing to scikit-learn first. Also please have a look at our contribution guidelines. The open source guides are a great reference too.

glemaitre · 2017-11-06T16:03:11Z

@jnothman I will let @lesteve answered to your first point since he knows much better joblib than me. However, hashing the input once (it could still take some times) or a subset of the input data (it might failed sometimes) should minimize the first problem.

Regarding point 2-3, I see the point but I have trouble to see a friendly way to implement an on demand Memory. When building a Pipeline, is it too much tedious to pass a Memory object for each transformer?

fcharras · 2017-11-06T19:36:28Z

Maybe different issues should be opened with each of those points ? The issue with joblib loading results coming from a previously cached deprecated version of a transformer sounds important too as long as it's not documented.

When building a Pipeline, is it too much tedious to pass a Memory object for each transformer?

I for one ended up writing a meta estimator CachedTransformer that would let me choose if I want to cache the fit, the transform or both (and if both fit_transform too). The code is heavier like this but I rarely need to cache more than 1 or 2 transformers in a pipeline so I found it acceptable.

hashing the input once (it could still take some times) or a subset of the input data

Both options sound very good.

jnothman · 2017-11-06T23:25:25Z

A separate memoize_steps=['step1'] would be how I'd allow the user to select only some transformations to be memoized. (Hey, I've long been a proponent of a separate wrapper to memoize a step, but Pipeline.memory is what received consensus ...)

I'm hesitant to hash only a subset of the data without the user controlling that... which might be a joblib-level parameter??

As an articulate early adopter, I'd really appreciate you contributing improvements to the documentation as a first step, @fcharras. Is that possible?

fcharras · 2017-11-07T08:13:47Z

Yes I'd be happy to contribute, I'll see what I can do for the documentation. Also I'll open a new issue with a minimal example.

lesteve · 2017-11-07T08:40:30Z

Also I'll open a new issue with a minimal example.

It would be great to have a small example where the use of memory is detrimental indeed.

jimmywan · 2017-11-29T23:14:40Z

It would be great to have a small example where the use of memory is detrimental indeed.

FWIW, I tried to run a benchmark comparing a simple pipeline using scikit-learn Pipelines, scikit-learn Pipelines using memory, and dask-ml. I was unable to reliably get a benefit from using memory, and often found it slightly worse. The benchmark wasn't of my own design and was mostly just a slight repurposing of some pre-existing code, but I did expect it to be slightly faster to use cached results than to re-run the StandardScaler:

dask/dask-ml#88 (comment)

jnothman · 2017-11-29T23:25:36Z

You're only likely to get benefits on something that has a slow fit or transform or both operations. StandardScaler is about as cheap as it gets. Try it with a CountVectorizer.

TomAugspurger mentioned this issue Nov 9, 2017

Revise documentation on hyperparameter search dask/dask-ml#88

Open

fcharras mentioned this issue Nov 18, 2017

Bug / Documentation: Wrong version of transformers might be loaded from cache #10166

Closed

jnothman mentioned this issue Mar 6, 2018

GridSearchCV using Pipeline with Memory should look beyond identical parameters for certain cached transformer lookups #10755

Closed

jnothman mentioned this issue May 23, 2018

Multi-metric scoring with pipelines repeats transform for each metric prediction #10823

Closed

cmarmo added Needs Benchmarks A tag for the issues and PRs which require some benchmarks Performance labels Dec 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues encountered with the memory option in Pipeline #10068

Issues encountered with the memory option in Pipeline #10068

fcharras commented Nov 4, 2017 •

edited

jnothman commented Nov 4, 2017

FarahSaeed commented Nov 6, 2017 •

edited

lesteve commented Nov 6, 2017

glemaitre commented Nov 6, 2017

fcharras commented Nov 6, 2017 •

edited

jnothman commented Nov 6, 2017

fcharras commented Nov 7, 2017

lesteve commented Nov 7, 2017

jimmywan commented Nov 29, 2017

jnothman commented Nov 29, 2017 via email

Issues encountered with the memory option in Pipeline #10068

Issues encountered with the memory option in Pipeline #10068

Comments

fcharras commented Nov 4, 2017 • edited

jnothman commented Nov 4, 2017

FarahSaeed commented Nov 6, 2017 • edited

lesteve commented Nov 6, 2017

glemaitre commented Nov 6, 2017

fcharras commented Nov 6, 2017 • edited

jnothman commented Nov 6, 2017

fcharras commented Nov 7, 2017

lesteve commented Nov 7, 2017

jimmywan commented Nov 29, 2017

jnothman commented Nov 29, 2017 via email

fcharras commented Nov 4, 2017 •

edited

FarahSaeed commented Nov 6, 2017 •

edited

fcharras commented Nov 6, 2017 •

edited