Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues encountered with the memory option in Pipeline #10068

Open
fcharras opened this issue Nov 4, 2017 · 10 comments
Open

Issues encountered with the memory option in Pipeline #10068

fcharras opened this issue Nov 4, 2017 · 10 comments
Labels
Needs Benchmarks A tag for the issues and PRs which require some benchmarks Performance

Comments

@fcharras
Copy link
Contributor

fcharras commented Nov 4, 2017

I'm very interested in using the memory option in Pipeline so I can organise my code in a simple fashion. However I've found that it does not scale well or there seem to be some caveats one could not expect:

  • is the input train data is too big (~several GB), joblib.memory takes a very long time to hash it, it can considerably slow the execution in an unexpected way.

  • the documentation hints that Caching the transformers is advantageous when fitting is time consuming.. However, the fit_transform is cached, so not only the fitted transformer but also the transformed data of the train data seems to be cached ? Then, it is also advantageous when transforming is time consuming, however this can quickly add up to a considerable space taken on the hard drive.

  • finally, it the code of a transformer change, but neither his methods nor his attributes, it seems to me that the hash will not change (because the code of _fit_transform_one does not). That's something the user could be warned about (need to wipe the cache when the code of a currently cached transformer has been altered) or it can happen to load from the cache the previous version of a transformer by mistake.

@jnothman
Copy link
Member

jnothman commented Nov 4, 2017

Thanks for this. It's very useful to get feedback on new features...

All potential optimisations need to be taken with care. They often help someone and hurt another.

In the first instance, let's improve documentation. I don't even think we currently make it clear that memory is persisted across sessions.

In addition:

  • I think we could minimise the first problem by finding a way to make the fit a function of the once-hashed pipeline input and any preceding steps (assuming all are deterministic). Is hashing the input once reasonable? (And this would be easier to implement with ENH support callable to modify Memory caching key joblib/joblib#69, I think...)
  • Not caching fit_transform means that we require fit_transform(X) == fit(X).transform(X) for all X. We can make this assumption if est.fit_transform is BaseEstimator.fit_transform but not otherwise. I can see the code getting quite intricate... Also we will probably reduce performance substantially for current users where the transformation is as/more expensive as the fit (e.g. nearest neighbors-based transformations), but maybe we can live with that.
  • One thing that could reduce all of these is to allow to user to specify which steps are memoized. The same can be achieved now with nested pipelines, but that is, IMO, far from friendly.

But whether adding API complexity is worthwhile is another issue... Ping @glemaitre, @lesteve.

@FarahSaeed
Copy link
Contributor

FarahSaeed commented Nov 6, 2017

I would like to work on this. Can I take up this issue?

@lesteve
Copy link
Member

lesteve commented Nov 6, 2017

Thanks @FarahSaeed for your interest! I would recommend that you find an issue with "good first issue" or "Easy" tag to get familiar with contributing to scikit-learn first. Also please have a look at our contribution guidelines. The open source guides are a great reference too.

@glemaitre
Copy link
Member

@jnothman I will let @lesteve answered to your first point since he knows much better joblib than me. However, hashing the input once (it could still take some times) or a subset of the input data (it might failed sometimes) should minimize the first problem.

Regarding point 2-3, I see the point but I have trouble to see a friendly way to implement an on demand Memory. When building a Pipeline, is it too much tedious to pass a Memory object for each transformer?

@fcharras
Copy link
Contributor Author

fcharras commented Nov 6, 2017

Maybe different issues should be opened with each of those points ? The issue with joblib loading results coming from a previously cached deprecated version of a transformer sounds important too as long as it's not documented.

When building a Pipeline, is it too much tedious to pass a Memory object for each transformer?

I for one ended up writing a meta estimator CachedTransformer that would let me choose if I want to cache the fit, the transform or both (and if both fit_transform too). The code is heavier like this but I rarely need to cache more than 1 or 2 transformers in a pipeline so I found it acceptable.

hashing the input once (it could still take some times) or a subset of the input data

Both options sound very good.

@jnothman
Copy link
Member

jnothman commented Nov 6, 2017

A separate memoize_steps=['step1'] would be how I'd allow the user to select only some transformations to be memoized. (Hey, I've long been a proponent of a separate wrapper to memoize a step, but Pipeline.memory is what received consensus ...)

I'm hesitant to hash only a subset of the data without the user controlling that... which might be a joblib-level parameter??

As an articulate early adopter, I'd really appreciate you contributing improvements to the documentation as a first step, @fcharras. Is that possible?

@fcharras
Copy link
Contributor Author

fcharras commented Nov 7, 2017

Yes I'd be happy to contribute, I'll see what I can do for the documentation. Also I'll open a new issue with a minimal example.

@lesteve
Copy link
Member

lesteve commented Nov 7, 2017

Also I'll open a new issue with a minimal example.

It would be great to have a small example where the use of memory is detrimental indeed.

@jimmywan
Copy link
Contributor

It would be great to have a small example where the use of memory is detrimental indeed.

FWIW, I tried to run a benchmark comparing a simple pipeline using scikit-learn Pipelines, scikit-learn Pipelines using memory, and dask-ml. I was unable to reliably get a benefit from using memory, and often found it slightly worse. The benchmark wasn't of my own design and was mostly just a slight repurposing of some pre-existing code, but I did expect it to be slightly faster to use cached results than to re-run the StandardScaler:

dask/dask-ml#88 (comment)

@jnothman
Copy link
Member

jnothman commented Nov 29, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Benchmarks A tag for the issues and PRs which require some benchmarks Performance
Projects
None yet
Development

No branches or pull requests

7 participants