Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Fix LinearModelsCV for loky backend. #14264

Merged

Conversation

jeremiedbb
Copy link
Member

Fixes #14249

LinearModelsCV perform inplace operations on input which can cause an error when using loky or multiprocessing backends if the input is sufficiently large to cause a memmapping (1MB).

@jeremiedbb
Copy link
Member Author

should we backport this to the 0.20.X branch ?

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you understand why this doesn't fail in common tests in check_classifiers_train(.., readonly_memmap=True) ?

It would be preferable if it was tested in common tests somehow, because the fact that this was possible means that the same issue may be present in other models.

@jeremiedbb
Copy link
Member Author

Do you understand why this doesn't fail in common tests in check_classifiers_train(.., readonly_memmap=True) ?

It does not occur at the same moment.
copy_X defaults to True, so a copy is made. Then joblib is called on that copied X. If it's too large it will memmap it.

I'm not sure we want to do these tests for all estimators and all backends because we need sufficiently large X to cause memmapping which will make the tests run for a significantly longer time

@jeremiedbb
Copy link
Member Author

prefer='threading' is only used in random forest and iforest. Otherwise the default backend is loky so this kind of bug should have been caught by many users by now.

Pierre has run several benchamrks with many problem sizes for all sklearn estimators and linearmodelscv are the only one he caught.

@rth
Copy link
Member

rth commented Jul 5, 2019

If it's too large it will memmap it.

This is not the first time we run into a bug due to such size dependent behavior.

In a test, could we force or monkeypatch joblib to always mmap and run this test on a very small dataset?

@jeremiedbb
Copy link
Member Author

In a test, could we force or monkeypatch joblib to always mmap and run this test on a very small dataset?

yes, through the max_nbytes parameter. That would make tests run faster indeed.

@jeremiedbb
Copy link
Member Author

through the max_nbytes parameter

Actually, this is an attribute of Parallel, not parallel_backend, and we can't set it anywhere because when sklearn will call Parallel, it will be overridden by the default max_nbytes of Parallel.

The solution would be to make parallel_backend accept max_nbytes. In the meanwhile I suggest to merge it as is. It's only one test so it should not impact the test suite total time.

@jeremiedbb
Copy link
Member Author

I don't understand the coverage decrease...

**_joblib_parallel_args(prefer="threads"))(jobs)
mse_paths = Parallel(
n_jobs=self.n_jobs, verbose=self.verbose,
**_joblib_parallel_args(require="sharedmem"))(jobs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait but this will make sure threading is used and allow to change the shared X object by different threads.

Is this really what we want? I would have though we wanted to avoid inplace operations in _path_residuals (e.g. by triggering a copy)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. prefer='threads' was already doing that, unless the user specifies another backend, right ?

@rth
Copy link
Member

rth commented Jul 25, 2019

@jeremiedbb So were different threads making inplace changes to a different chunks of input X or overlapping ones?

@jeremiedbb
Copy link
Member Author

_path_residuals does this:

X_train = X[train]
y_train = y[train]
X_test = X[test]
y_test = y[test]

where train and test are lists. So copies are always made.

@rth
Copy link
Member

rth commented Jul 26, 2019

If copies are made, when does it then fail when trying to modify read-only data? Or is the read-only property preserved by copies?

@jeremiedbb
Copy link
Member Author

Ok so if X is a read-only memmap, fancy indexing creates a read-only copy. I think it's a bug and I opened numpy/numpy#14132

So I reverted the sharedmem requirement and made the indexed array writeable instead.

@jeremiedbb
Copy link
Member Author

jeremiedbb commented Jul 26, 2019

It brings another issue. The _preprocess_data step does inplace operations. But ElasticNetCV has a copy_X parameter. If X is read-only, it fails.

X, y = make_regression(20000, 10) 
X.setflags(write=False) 
ElasticNetCV(copy_X=False, cv=3).fit(X, y)

We should force a copy when X is read-only. I'll do that in a separate PR

(this is not catch by common tests because default is copy_X=True)

sklearn/linear_model/coordinate_descent.py Outdated Show resolved Hide resolved
sklearn/linear_model/tests/test_coordinate_descent.py Outdated Show resolved Hide resolved
rth
rth approved these changes Jul 31, 2019
Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

sklearn/linear_model/tests/test_coordinate_descent.py Outdated Show resolved Hide resolved
ogrisel
ogrisel approved these changes Oct 2, 2019
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well. Let me have a look at the conflicts.

@ogrisel
Copy link
Member

ogrisel commented Oct 2, 2019

The two failing tests are 60 timeouts when running the multiprocessing test under pytest-xdist with -n 2. I could not reproduce on my local laptop though...

@thomasjpfan
Copy link
Member

I can reproduce this on my desktop and on the CI with pytest-xdist turned off. test_linear_models_cv_fit_for_all_backends is stalled when backend=multiprocessing.

The other two backends works.

@rth
Copy link
Member

rth commented Mar 3, 2020

@jeremiedbb Could you please resolve conflicts?

@jeremiedbb
Copy link
Member Author

@jeremiedbb Could you please resolve conflicts?

Done. I left the debugging stuff for now. Last time we tried CI was timing out for the multiprocessing backend.

@jeremiedbb
Copy link
Member Author

So the reason of the hang is a bad interaction of MKL (especially intel openmp) and python multiprocessing. Below is a reproducible example

import numpy as np
from multiprocessing import Pool

x = np.ones(10000)
x @ x

def f(i):
    a = np.ones(10000)
    a @ a

pool = Pool(2)
pool.map(f, [1, 2])

It's kind of related to numpy/numpy#10060 and numpy/numpy#10993 and numpy/numpy#11734, but might be a bit different. First it's about different versions on libiomp (I tested with 2020.0 and 2019.4) and the workaround KMP_INIT_AT_FORK=FALSE does not work.

@oleksandr-pavlyk do you have any idea ?

conftest.py Outdated

def pytest_runtest_teardown(item, nextitem):
if isinstance(item, DoctestItem):
set_config(print_changed_only=False)
print(" {:.3f}s ".format(time() - tic), end="")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove debugging lines

@@ -21,7 +21,7 @@ except ImportError:
python -c "import multiprocessing as mp; print('%d CPUs' % mp.cpu_count())"
pip list

TEST_CMD="python -m pytest --showlocals --durations=20 --junitxml=$JUNITXML"
TEST_CMD="python -m pytest -v --showlocals --durations=20 --junitxml=$JUNITXML"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe remove this.

@rth rth added this to the 0.23 milestone Apr 19, 2020
@adrinjalali
Copy link
Member

This could also go in 0.23.1 as a fix if it doesn't get merged for the release.

@agramfort
Copy link
Member

yes let's try to merge this quickly. @jeremiedbb can you have a look? thanks

@jeremiedbb
Copy link
Member Author

I resolved the conflicts. But there's an unresolved issue with the multiprocessing backend. It seems to be a bad interaction between python multiprocessing and mkl, see this comment.

I suggest we forget about the multprocessing backend for now since it's not something we can fix on our side and consider this PR only a fix for the loky backend, so we can merge it quickly.

@jeremiedbb jeremiedbb changed the title [MRG] Fix LinearModelsCV for all joblib backends. [MRG] Fix LinearModelsCV for loky backend. Apr 23, 2020
@jeremiedbb
Copy link
Member Author

The coverage issue is because the only job where we have joblib < 0.12 does not have coverage enabled. Should we care ?

@thomasjpfan
Copy link
Member

The coverage issue is because the only job where we have joblib < 0.12 does not have coverage enabled. Should we care ?

In this case its the pytest.skip line that is not getting covered? I am okay with this.

@RemiLacroix-IDRIS
Copy link

RemiLacroix-IDRIS commented Apr 23, 2020

So the reason of the hang is a bad interaction of MKL (especially intel openmp) and python multiprocessing.
(...)
the workaround KMP_INIT_AT_FORK=FALSE does not work.

Have you tried setting KMP_INIT_AT_FORK=true? Weirdly enough, that fixed the problem for me...

Edit: In fact it seems random with KMP_INIT_AT_FORK=true...

@jnothman
Copy link
Member

Should we be merging this? Are we able to protect it from that MKL-multiprocessing interaction (I've not looked into it)?

@jeremiedbb
Copy link
Member Author

It's ready to merge.

Are we able to protect it from that MKL-multiprocessing interaction (I've not looked into it)?

no but it's still an improvement over master. Currently it only works with the threading backend. After this it will also work with the loky backend (but not with the multiprocessing backend).

@jnothman jnothman merged commit 1d3a553 into scikit-learn:master Apr 28, 2020
6 of 7 checks passed
@adrinjalali
Copy link
Member

#17010

gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020
viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ElasticNetCV fails under loky backend
8 participants