Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] BUG avoid memmaping large dataframe in permutation_importance #15898

Conversation

shivamgargsya
Copy link
Contributor

@shivamgargsya shivamgargsya commented Dec 16, 2019

Reference Issues/PRs

Fixes #15810

What does this implement/fix? Explain your changes.

When using permutation_importance with a large enough pandas DataFrame and n_jobs > 0, joblib switches to read-only memmap mode, which proceeds to raise, as permutation_importance tries to assign to the DataFrame.
The bug was fixed by setting the bug by setting max_nbytes to None.

Any other comments?

Verified using below snippet
`import pandas as pd
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance

data = load_iris()
df = pd.DataFrame(data=data.data.repeat(1000, axis=0))
y = data.target.repeat(1000)

clf = RandomForestClassifier()
clf.fit(df, y)

r = permutation_importance(clf, df, y, n_jobs=-1)`

…e and n_jobs > 0, joblib switches to read-only memmap mode, which proceeds to raise, as permutation_importance tries to assign to the DataFrame.

The bug was fixed by setting the bug by setting max_nbytes to None.
@shivamgargsya shivamgargsya changed the title When using permutation_importance with a large enough pandas DataFram… Permutation importance fail fix #15810 Dec 16, 2019
@rth rth requested a review from thomasjpfan Dec 16, 2019
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Please add an Fix to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

@rth
Copy link
Member

@rth rth commented Dec 16, 2019

Thanks! I think it could be an acceptable workaround, however the initial issue is that we share data between processes, then perform inplace modification of this data.

Triggering inter-process serialization is one way around it, another one could be to keep using mmap and trigger a copy manually. Ideally there may be a way to make copy of the dataframe, where only one columns (the changed one) is a copy and other ones are still views. Also I'm not fully sure how the improvement of inter-process serialization in Python 3.8 would impact the choice of the solution.

Any other thoughts @jeremiedbb @thomasjpfan ?

@shivamgargsya
Copy link
Contributor Author

@shivamgargsya shivamgargsya commented Dec 16, 2019

Please add an Fix to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

Thanks @thomasjpfan

@andersbogsnes
Copy link

@andersbogsnes andersbogsnes commented Dec 16, 2019

Since it only seems to fail on DataFrames, would it be an idea to conditionally turn off memmapping based on whether or not the input is a dataframe?

@shivamgargsya
Copy link
Contributor Author

@shivamgargsya shivamgargsya commented Dec 17, 2019

@andersbogsnes I have added the check for Dataframe instance now. Please take a look don't think check for label column is necessary

@shivamgargsya shivamgargsya changed the title Permutation importance fail fix #15810 [MRG]Permutation importance fail fix #15810 Dec 17, 2019
andersbogsnes
Copy link

@andersbogsnes andersbogsnes commented on 4588267 Dec 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small change I might suggest would be to conditionally set the max_nbytes parameter:

max_nbytes = None if hasattr(X, 'loc') else '1M'

@shivamgargsya
Copy link
Contributor Author

@shivamgargsya shivamgargsya commented Dec 17, 2019

@andersbogsnes aren't we checking for same by checking type of input X. in line 120
type(X).__name__ == "DataFrame":

@andersbogsnes
Copy link

@andersbogsnes andersbogsnes commented Dec 17, 2019

That doesn’t allow for duck typing, so if I were to subclass a DataFrame (or use another DataFrame like library) then type would not work. It’s also consistent with the check in _safe_column_indexing

@shivamgargsya
Copy link
Contributor Author

@shivamgargsya shivamgargsya commented Dec 17, 2019

Updated the check based on attribute of data frame class.

doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved
@jnothman
Copy link
Member

@jnothman jnothman commented Dec 17, 2019

Thanks, @andersbogsnes for the help!

Should we have a way to include such costly test cases, but ordinarily skip them?

@andersbogsnes
Copy link

@andersbogsnes andersbogsnes commented Dec 18, 2019

I would always recommend having a test written for a known bug - that’s how I caught this error when we swapped our implementation for scikit-learn’s - but there is a cost of course.

The failing example above is fairly cheap though, as the cost is linear per example.

Does the test suite have a slow marker or something similar?

Alternatively, the root cause is that the implementation modifies in-place, wondering out loud if there is a way to test for that instead which might be cheaper?

@jnothman jnothman added this to the 0.22.1 milestone Dec 18, 2019
@jnothman
Copy link
Member

@jnothman jnothman commented Dec 18, 2019

Yes, I think it should be fine to add that test. It's not so slow, and we can make the RF fitting faster (by using iris without repetition, for example)

@andersbogsnes
Copy link

@andersbogsnes andersbogsnes commented Dec 18, 2019

Well, you need the repetition to hit the 1M limit where joblib turns on the memmap - I didn’t do any calculations on where that was, just that 100 wasn’t enough but 1000 worked :-)

@thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Dec 18, 2019

For this specific case, the following (smallish) example also fails on master:

import pandas as pd
from sklearn.datasets import make_classification
from sklearn.dummy import DummyClassifier
from sklearn.inspection import permutation_importance

X, y = make_classification(n_samples=7000)
df = pd.DataFrame(X)

clf = DummyClassifier(strategy='prior')
clf.fit(df, y)

r = permutation_importance(clf, df, y, n_jobs=2)

We could make this into a test.

@glemaitre glemaitre self-assigned this Dec 19, 2019
@glemaitre glemaitre changed the title [MRG]Permutation importance fail fix #15810 [MRG] BUG avoid memmaping large dataframe in permutation_importance Dec 19, 2019
@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Dec 19, 2019

We should also check for the output:

def test_permuation_importance_memmaping_dataframe():
    pd = pytest.importorskip("pandas")

    X, y = make_classification(n_samples=7000)
    df = pd.DataFrame(X)

    clf = HistGradientBoostingClassifier(max_iter=5, random_state=42)
    clf.fit(df, y)

    importances_parallel = permutation_importance(
        clf, df, y, n_jobs=2, random_state=0, n_repeats=2
    )
    importances_sequential  = permutation_importance(
        clf, df, y, n_jobs=1, random_state=0, n_repeats=2
    )

    assert_allclose(
        importances_parallel['importances'],
        importances_sequential['importances']
    )

(we have something weird happening with the random_state thought)

@ogrisel
Copy link
Member

@ogrisel ogrisel commented Dec 19, 2019

To make the test suggested in #15898 (comment) faster yet not too trivial, you can use max_leaf_nodes=3, max_iter=5.

Or alternatively use a regression problem with and a Ridge model.

DummyClassifier/DummyRegressor are too trivial: they do not use X so permutation importances should always be exactly zero and they cannot highlight problem related to non-shared random state instances when n_jobs > 1.

@ogrisel
Copy link
Member

@ogrisel ogrisel commented Dec 19, 2019

The issue with different results with n_jobs=1 and n_jobs>1 is being dealt with in #15933 to keep concerns separated (second review welcome).

@ogrisel
Copy link
Member

@ogrisel ogrisel commented Dec 19, 2019

Actually, reading this code made me realize that the code is not thread safe. In particular it break with the threading backend.

I have updated #15933 to actually fix both issues at the same time.

Therefore I would rather close this PR in favor of #15933 if people agree that this is a better fix.

@thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Dec 19, 2019

Therefore I would rather close this PR in favor of #15933 if people agree that this is a better fix.

I agree

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Dec 19, 2019

I'm + 1 for closing. @shivamgargsya This is a bit unfortunate but we discovered the underlying bug because of your work in this PR. Don't feel like it was a vain effort.

@ogrisel
Copy link
Member

@ogrisel ogrisel commented Dec 20, 2019

Thanks for your contribution @shivamgargsya! Closing.

@ogrisel ogrisel closed this Dec 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants