New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] MAINT: option to unvendor joblib #11166

Merged
merged 9 commits into from Jun 29, 2018

Conversation

Projects
None yet
8 participants
@GaelVaroquaux
Member

GaelVaroquaux commented May 30, 2018

Setting the SKLEARN_SITE_JOBLIB to a non null value will now force scikit-learn to use the site joblib rather than the vendored one.

This is going to be very useful to test the progress in joblib: the joblib team is hard at work on enabling better parallelism and scaling, but real-world test cases (by the developers themselves or advanced users) are crucial.

MAINT: option to unvendor joblib
Setting the SKLEARN_SITE_JOBLIB to a non null value will now force
scikit-learn to use the site joblib rather than the vendored one.
@jnothman

This comment has been minimized.

Member

jnothman commented May 30, 2018

Test failures still

GaelVaroquaux added some commits May 31, 2018

GaelVaroquaux added a commit to GaelVaroquaux/distributed that referenced this pull request May 31, 2018

Unvendoring joblib
Follows scikit-learn/scikit-learn#11166 to deal
with unvendoring joblib in scikit-learn
@@ -1,6 +1,9 @@
This directory contains bundled external dependencies that are updated
every once in a while.
Note to developers and advanced users: setting the SKLEARN_SITE_JOBLIB to
a non null value will force scikit-learn to use the site joblib.

This comment has been minimized.

@jnothman

jnothman May 31, 2018

Member

Maybe note that joblib.Memory cache and joblib pickles may be invalidated when switching from one to the other.

@amueller

This comment has been minimized.

Member

amueller commented May 31, 2018

This sounds good to me. It needs mention in the dev docs, though, right?

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented May 31, 2018

@amueller

This comment has been minimized.

Member

amueller commented Jun 4, 2018

I was 100% sure I replied to this... I think it should be somewhere with set_config which is currently in doc/modules/computational_performance.rst. Maybe we should have a section on configuration in the user guide? There's no explicit section on set_config and get_config right now :-/
I would also add it to the joblib entry in the glossary maybe?

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jun 4, 2018

@amueller

This comment has been minimized.

Member

amueller commented Jun 4, 2018

I think we can merge 6 and 7 but I don't understand your new TOC.
Would it make sense to distinguish RAM saving strategies from speeding up strategies from parallelization strategies?

I'm not sure I understand the Model Reshaping section. Is that feature selection?

There seems to be a bunch of model specific stuff in 7. like "sparsify" which might be better housed in the linear models section maybe?

@amueller

This comment has been minimized.

Member

amueller commented Jun 4, 2018

Also 6 and 7 are a bit hard to find imho :-/

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jun 4, 2018

@amueller

This comment has been minimized.

Member

amueller commented Jun 4, 2018

Sure, go for it.

@jnothman

This comment has been minimized.

Member

jnothman commented Jun 5, 2018

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jun 5, 2018

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jun 11, 2018

I've reorganized the docs and updated the glossary. This is ready for review and merge.

@GaelVaroquaux GaelVaroquaux changed the title from MAINT: option to unvendor joblib to [MRG] MAINT: option to unvendor joblib Jun 11, 2018

When this environment variable is set to a non zero value,
scikit-learn uses the site joblib rather than its vendored version.
Consequently, joblib must be installed for scikit-learn to run

This comment has been minimized.

@jnothman

jnothman Jun 11, 2018

Member

You don't think it's worth mentioning that Memory and dumped content may be invalidated when switching?

@jnothman

Also, please mention in what's new

@glemaitre glemaitre added this to the 0.20 milestone Jun 28, 2018

@jnothman

This comment has been minimized.

Member

jnothman commented Jun 28, 2018

glemaitre added some commits Jun 29, 2018

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jun 29, 2018

Sounds reasonable, but can also be changed (on or off) after release if we want it or don't.

OK let's do that later.

@glemaitre glemaitre merged commit 106bb9e into scikit-learn:master Jun 29, 2018

0 of 5 checks passed

LGTM analysis: Python Running analyses for revisions
Details
ci/circleci: python2 Your tests are queued behind your running builds
Details
ci/circleci: python3 Your tests are queued behind your running builds
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jun 29, 2018

Thanks

@sklearn-lgtm

This comment has been minimized.

sklearn-lgtm commented Jun 29, 2018

This pull request introduces 2 alerts when merging d5de52c into f97b515 - view on LGTM.com

new alerts:

  • 2 for Unused import

Comment posted by LGTM.com

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jun 29, 2018

@jnothman

This comment has been minimized.

Member

jnothman commented Jul 2, 2018

This has broken CircleCI on master, which is trying, if I understand correctly, to load pickles which reference sklearn.externals.joblib.numpy_pickle, resulting in an ImportError.

Firstly, I presume to fix Circle, we just need to clear a cache somewhere (but I've not worked out where).

Secondly, are we unnecessarily breaking pickles from previous versions? Should we see if we can find a way to make this function importable when loading a joblib pickle?

@lesteve

This comment has been minimized.

Member

lesteve commented Jul 2, 2018

Weird, here is the first CircleCI failed build in master. It only seems to be a problem in Python 2.

Firstly, I presume to fix Circle, we just need to clear a cache somewhere (but I've not worked out where).

I can probably find the way to clear the CircleCI cache let me look for this.

Secondly, are we unnecessarily breaking pickles from previous versions? Should we see if we can find a way to make this function importable when loading a joblib pickle?

Breaking pickles is not something we should do lightly. We should definitely double check that and see whether this can be fixed.

@jnothman

This comment has been minimized.

Member

jnothman commented Jul 2, 2018

Well it's easily fixed if we move sklearn.externals._joblib back to sklearn.externals.joblib, and import internally from ._joblib or something confusing like that.

One hack is to use sys.modules['sklearn.externals.joblib'] = joblib and same for all its sub-modules, when SKLEARN_SITE_JOBLIB is set ...

A more "correct" approach might be to modify sys.path_hooks.

@lesteve

This comment has been minimized.

Member

lesteve commented Jul 2, 2018

I can probably find the way to clear the CircleCI cache let me look for this.

So it looks like the only way to clear the cache is to change the cache key in circle/config.yml, see this.

I think clearing the cache would just hide the problem though. At the moment on master we can not import as we would do from joblib:

On master (5916d57):

import sklearn.externals.joblib.numpy_pickle

Python 3:

ModuleNotFoundError: No module named 'sklearn.externals.joblib.numpy_pickle'; 'sklearn.externals.joblib' is not a package

Python 2:

ImportError: No module named numpy_pickle

On 526aede (i.e. before this PR was merged) the same snippet runs fine.

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jul 2, 2018

I am attempting to fix the problem in #11403. I don't know if the approach will work, but if it does, it's a simple fix.

@lesteve

This comment has been minimized.

Member

lesteve commented Jul 2, 2018

I am attempting to fix the problem in #11403. I don't know if the approach will work, but if it does, it's a simple fix.

Thanks for this! I can reproduce the backward compatibility problem locally (on Python 2.7) and I don't think #11403 fixes it. When loading the pickle sklearn.externals.joblib.numpy_pickle is imported and this import fails even with your fix:

Here is what the pickle contains for completeness (a joblib pickle of np.array([1., 2., 3.]) using 526aede):

In [3]: open('/tmp/test.joblib', 'rb').read()
Out[3]: '\x80\x02csklearn.externals.joblib.numpy_pickle\nNumpyArrayWrapper\nq\x00)\x81q\x01}q\x02(U\x05dtypeq\x03cnumpy\ndtype\nq\x04U\x02f8q\x05K\x00K\x01\x87q\x06Rq\x07(K\x03U\x01<q\x08NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00tq\tbU\x05shapeq\nK\x03\x85q\x0bU\x05orderq\x0cU\x01Cq\rU\x08subclassq\x0ecnumpy\nndarray\nq\x0fU\nallow_mmapq\x10\x88ub\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x08@.'
@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jul 2, 2018

@lesteve

This comment has been minimized.

Member

lesteve commented Jul 2, 2018

Maybe a bit hacky, but what about having only joblib/__init__.py that does this:

import sys
import os

if os.environ.get('SKLEARN_SITE_JOBLIB', False):
    import joblib
    module = sys.modules['joblib']
else:
    import sklearn.externals._joblib
    module = sys.modules['sklearn.externals._joblib']

sys.modules[__name__] = module

I did not test this extensively besides checking that import sklearn.externals.joblib.numpy_pickle works in both cases and looking that sklearn.externals.joblib.numpy_pickle.__file__ makes sense.

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jul 2, 2018

@jnothman

This comment has been minimized.

Member

jnothman commented Jul 2, 2018

@lesteve

This comment has been minimized.

Member

lesteve commented Jul 6, 2018

I tried a bit more and the sys.modules hack does not seem to solve all the pickling problems ... in particular there seems to be some problems with the compressed pickles that I haven't fully understood.

While I am at it, a less hacky way of doing it may be to use sys.meta_path. It does not seem that easy, but maybe good examples to look at would be python-future and/or six.

@rth

This comment has been minimized.

Member

rth commented Jul 6, 2018

Yeah, but how likely is it that it would still work with some other packages that use sys.module / sys.meta_path hacks (for instance pyinstaller/pyinstaller#2246 )?

It seems that our only option is to invert _joblib and joblib. It seems
the wrong API, but I don't see another way of doing it without breaking
backward compat.

If the documentations repeats often enough that everything under sklearn.externals is private and should not be used in user code maybe that would be enough?

Another way of looking at this, could be to raise a warning if the bundled joblib (e.g. sklearn.externals.joblib) is imported from outside of sklearn. That would allow keeping just one joblib module and remove the issue with the public/private naming convention, but I'm not not sure if it's technically feasible (might also involve some sys hacks) and even if it is, desirable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment