Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
[MRG] BUG avoid memmaping large dataframe in permutation_importance #15898
What does this implement/fix? Explain your changes.
When using permutation_importance with a large enough pandas DataFrame and n_jobs > 0, joblib switches to read-only memmap mode, which proceeds to raise, as permutation_importance tries to assign to the DataFrame.
Any other comments?
Verified using below snippet
data = load_iris()
clf = RandomForestClassifier()
r = permutation_importance(clf, df, y, n_jobs=-1)`
…e and n_jobs > 0, joblib switches to read-only memmap mode, which proceeds to raise, as permutation_importance tries to assign to the DataFrame. The bug was fixed by setting the bug by setting max_nbytes to None.
thomasjpfan left a comment
Please add an
Thanks! I think it could be an acceptable workaround, however the initial issue is that we share data between processes, then perform inplace modification of this data.
Triggering inter-process serialization is one way around it, another one could be to keep using mmap and trigger a copy manually. Ideally there may be a way to make copy of the dataframe, where only one columns (the changed one) is a copy and other ones are still views. Also I'm not fully sure how the improvement of inter-process serialization in Python 3.8 would impact the choice of the solution.
I would always recommend having a test written for a known bug - that’s how I caught this error when we swapped our implementation for scikit-learn’s - but there is a cost of course.
The failing example above is fairly cheap though, as the cost is linear per example.
Does the test suite have a
Alternatively, the root cause is that the implementation modifies in-place, wondering out loud if there is a way to test for that instead which might be cheaper?
For this specific case, the following (smallish) example also fails on master:
import pandas as pd from sklearn.datasets import make_classification from sklearn.dummy import DummyClassifier from sklearn.inspection import permutation_importance X, y = make_classification(n_samples=7000) df = pd.DataFrame(X) clf = DummyClassifier(strategy='prior') clf.fit(df, y) r = permutation_importance(clf, df, y, n_jobs=2)
We could make this into a test.
We should also check for the output:
def test_permuation_importance_memmaping_dataframe(): pd = pytest.importorskip("pandas") X, y = make_classification(n_samples=7000) df = pd.DataFrame(X) clf = HistGradientBoostingClassifier(max_iter=5, random_state=42) clf.fit(df, y) importances_parallel = permutation_importance( clf, df, y, n_jobs=2, random_state=0, n_repeats=2 ) importances_sequential = permutation_importance( clf, df, y, n_jobs=1, random_state=0, n_repeats=2 ) assert_allclose( importances_parallel['importances'], importances_sequential['importances'] )
(we have something weird happening with the
To make the test suggested in #15898 (comment) faster yet not too trivial, you can use
Or alternatively use a regression problem with and a