New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues encountered with the memory option in Pipeline #10068
Comments
Thanks for this. It's very useful to get feedback on new features... All potential optimisations need to be taken with care. They often help someone and hurt another. In the first instance, let's improve documentation. I don't even think we currently make it clear that memory is persisted across sessions. In addition:
But whether adding API complexity is worthwhile is another issue... Ping @glemaitre, @lesteve. |
I would like to work on this. Can I take up this issue? |
Thanks @FarahSaeed for your interest! I would recommend that you find an issue with "good first issue" or "Easy" tag to get familiar with contributing to scikit-learn first. Also please have a look at our contribution guidelines. The open source guides are a great reference too. |
@jnothman I will let @lesteve answered to your first point since he knows much better joblib than me. However, hashing the input once (it could still take some times) or a subset of the input data (it might failed sometimes) should minimize the first problem. Regarding point 2-3, I see the point but I have trouble to see a friendly way to implement an on demand Memory. When building a Pipeline, is it too much tedious to pass a Memory object for each transformer? |
Maybe different issues should be opened with each of those points ? The issue with joblib loading results coming from a previously cached deprecated version of a transformer sounds important too as long as it's not documented.
I for one ended up writing a meta estimator
Both options sound very good. |
A separate I'm hesitant to hash only a subset of the data without the user controlling that... which might be a joblib-level parameter?? As an articulate early adopter, I'd really appreciate you contributing improvements to the documentation as a first step, @fcharras. Is that possible? |
Yes I'd be happy to contribute, I'll see what I can do for the documentation. Also I'll open a new issue with a minimal example. |
It would be great to have a small example where the use of |
FWIW, I tried to run a benchmark comparing a simple pipeline using scikit-learn Pipelines, scikit-learn Pipelines using memory, and dask-ml. I was unable to reliably get a benefit from using memory, and often found it slightly worse. The benchmark wasn't of my own design and was mostly just a slight repurposing of some pre-existing code, but I did expect it to be slightly faster to use cached results than to re-run the StandardScaler: |
You're only likely to get benefits on something that has a slow fit or
transform or both operations. StandardScaler is about as cheap as it gets.
Try it with a CountVectorizer.
|
I'm very interested in using the
memory
option inPipeline
so I can organise my code in a simple fashion. However I've found that it does not scale well or there seem to be some caveats one could not expect:is the input train data is too big (~several GB), joblib.memory takes a very long time to hash it, it can considerably slow the execution in an unexpected way.
the documentation hints that
Caching the transformers is advantageous when fitting is time consuming.
. However, thefit_transform
is cached, so not only the fitted transformer but also the transformed data of the train data seems to be cached ? Then, it is also advantageous when transforming is time consuming, however this can quickly add up to a considerable space taken on the hard drive.finally, it the code of a transformer change, but neither his methods nor his attributes, it seems to me that the hash will not change (because the code of _fit_transform_one does not). That's something the user could be warned about (need to wipe the cache when the code of a currently cached transformer has been altered) or it can happen to load from the cache the previous version of a transformer by mistake.
The text was updated successfully, but these errors were encountered: