Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] BUG avoid memmaping large dataframe in permutation_importance #15898

Conversation

@shivamgargsya
Copy link
Contributor

shivamgargsya commented Dec 16, 2019

Reference Issues/PRs

Fixes #15810

What does this implement/fix? Explain your changes.

When using permutation_importance with a large enough pandas DataFrame and n_jobs > 0, joblib switches to read-only memmap mode, which proceeds to raise, as permutation_importance tries to assign to the DataFrame.
The bug was fixed by setting the bug by setting max_nbytes to None.

Any other comments?

Verified using below snippet
`import pandas as pd
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance

data = load_iris()
df = pd.DataFrame(data=data.data.repeat(1000, axis=0))
y = data.target.repeat(1000)

clf = RandomForestClassifier()
clf.fit(df, y)

r = permutation_importance(clf, df, y, n_jobs=-1)`

…e and n_jobs > 0, joblib switches to read-only memmap mode, which proceeds to raise, as permutation_importance tries to assign to the DataFrame.

The bug was fixed by setting the bug by setting max_nbytes to None.
@shivamgargsya shivamgargsya changed the title When using permutation_importance with a large enough pandas DataFram… Permutation importance fail fix #15810 Dec 16, 2019
@rth rth requested a review from thomasjpfan Dec 16, 2019
Copy link
Member

thomasjpfan left a comment

Please add an Fix to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

@rth

This comment has been minimized.

Copy link
Member

rth commented Dec 16, 2019

Thanks! I think it could be an acceptable workaround, however the initial issue is that we share data between processes, then perform inplace modification of this data.

Triggering inter-process serialization is one way around it, another one could be to keep using mmap and trigger a copy manually. Ideally there may be a way to make copy of the dataframe, where only one columns (the changed one) is a copy and other ones are still views. Also I'm not fully sure how the improvement of inter-process serialization in Python 3.8 would impact the choice of the solution.

Any other thoughts @jeremiedbb @thomasjpfan ?

@shivamgargsya

This comment has been minimized.

Copy link
Contributor Author

shivamgargsya commented Dec 16, 2019

Please add an Fix to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

Thanks @thomasjpfan

@andersbogsnes

This comment has been minimized.

Copy link

andersbogsnes commented Dec 16, 2019

Since it only seems to fail on DataFrames, would it be an idea to conditionally turn off memmapping based on whether or not the input is a dataframe?

@shivamgargsya

This comment has been minimized.

Copy link
Contributor Author

shivamgargsya commented Dec 17, 2019

@andersbogsnes I have added the check for Dataframe instance now. Please take a look don't think check for label column is necessary

@shivamgargsya shivamgargsya changed the title Permutation importance fail fix #15810 [MRG]Permutation importance fail fix #15810 Dec 17, 2019
@andersbogsnes

This comment has been minimized.

Copy link

andersbogsnes commented on 4588267 Dec 17, 2019

A small change I might suggest would be to conditionally set the max_nbytes parameter:

max_nbytes = None if hasattr(X, 'loc') else '1M'

@shivamgargsya

This comment has been minimized.

Copy link
Contributor Author

shivamgargsya commented Dec 17, 2019

@andersbogsnes aren't we checking for same by checking type of input X. in line 120
type(X).__name__ == "DataFrame":

@andersbogsnes

This comment has been minimized.

Copy link

andersbogsnes commented Dec 17, 2019

That doesn’t allow for duck typing, so if I were to subclass a DataFrame (or use another DataFrame like library) then type would not work. It’s also consistent with the check in _safe_column_indexing

@shivamgargsya

This comment has been minimized.

Copy link
Contributor Author

shivamgargsya commented Dec 17, 2019

Updated the check based on attribute of data frame class.

doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Dec 17, 2019

Thanks, @andersbogsnes for the help!

Should we have a way to include such costly test cases, but ordinarily skip them?

@andersbogsnes

This comment has been minimized.

Copy link

andersbogsnes commented Dec 18, 2019

I would always recommend having a test written for a known bug - that’s how I caught this error when we swapped our implementation for scikit-learn’s - but there is a cost of course.

The failing example above is fairly cheap though, as the cost is linear per example.

Does the test suite have a slow marker or something similar?

Alternatively, the root cause is that the implementation modifies in-place, wondering out loud if there is a way to test for that instead which might be cheaper?

@jnothman jnothman added this to the 0.22.1 milestone Dec 18, 2019
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Dec 18, 2019

Yes, I think it should be fine to add that test. It's not so slow, and we can make the RF fitting faster (by using iris without repetition, for example)

@andersbogsnes

This comment has been minimized.

Copy link

andersbogsnes commented Dec 18, 2019

Well, you need the repetition to hit the 1M limit where joblib turns on the memmap - I didn’t do any calculations on where that was, just that 100 wasn’t enough but 1000 worked :-)

@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Dec 18, 2019

For this specific case, the following (smallish) example also fails on master:

import pandas as pd
from sklearn.datasets import make_classification
from sklearn.dummy import DummyClassifier
from sklearn.inspection import permutation_importance

X, y = make_classification(n_samples=7000)
df = pd.DataFrame(X)

clf = DummyClassifier(strategy='prior')
clf.fit(df, y)

r = permutation_importance(clf, df, y, n_jobs=2)

We could make this into a test.

@glemaitre glemaitre self-assigned this Dec 19, 2019
@glemaitre glemaitre changed the title [MRG]Permutation importance fail fix #15810 [MRG] BUG avoid memmaping large dataframe in permutation_importance Dec 19, 2019
@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Dec 19, 2019

We should also check for the output:

def test_permuation_importance_memmaping_dataframe():
    pd = pytest.importorskip("pandas")

    X, y = make_classification(n_samples=7000)
    df = pd.DataFrame(X)

    clf = HistGradientBoostingClassifier(max_iter=5, random_state=42)
    clf.fit(df, y)

    importances_parallel = permutation_importance(
        clf, df, y, n_jobs=2, random_state=0, n_repeats=2
    )
    importances_sequential  = permutation_importance(
        clf, df, y, n_jobs=1, random_state=0, n_repeats=2
    )

    assert_allclose(
        importances_parallel['importances'],
        importances_sequential['importances']
    )

(we have something weird happening with the random_state thought)

@ogrisel

This comment has been minimized.

Copy link
Member

ogrisel commented Dec 19, 2019

To make the test suggested in #15898 (comment) faster yet not too trivial, you can use max_leaf_nodes=3, max_iter=5.

Or alternatively use a regression problem with and a Ridge model.

DummyClassifier/DummyRegressor are too trivial: they do not use X so permutation importances should always be exactly zero and they cannot highlight problem related to non-shared random state instances when n_jobs > 1.

@ogrisel

This comment has been minimized.

Copy link
Member

ogrisel commented Dec 19, 2019

The issue with different results with n_jobs=1 and n_jobs>1 is being dealt with in #15933 to keep concerns separated (second review welcome).

@ogrisel

This comment has been minimized.

Copy link
Member

ogrisel commented Dec 19, 2019

Actually, reading this code made me realize that the code is not thread safe. In particular it break with the threading backend.

I have updated #15933 to actually fix both issues at the same time.

Therefore I would rather close this PR in favor of #15933 if people agree that this is a better fix.

@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Dec 19, 2019

Therefore I would rather close this PR in favor of #15933 if people agree that this is a better fix.

I agree

@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Dec 19, 2019

I'm + 1 for closing. @shivamgargsya This is a bit unfortunate but we discovered the underlying bug because of your work in this PR. Don't feel like it was a vain effort.

@ogrisel

This comment has been minimized.

Copy link
Member

ogrisel commented Dec 20, 2019

Thanks for your contribution @shivamgargsya! Closing.

@ogrisel ogrisel closed this Dec 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.