Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX permutation_importance with polars dataframe raises warning on feature names #28513

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

Charlie-XIAO
Copy link
Contributor

@Charlie-XIAO Charlie-XIAO commented Feb 23, 2024

Reference Issues/PRs

Closes #28488.

What does this implement/fix? Explain your changes.

In permutation_importance we forcefully convert polars dataframes to arrays, causing UserWarning: X does not have valid feature names, but XXX was fitted with feature names.

I added no extra tests, except for extending all tests on pandas to also include polars.

Copy link

github-actions bot commented Feb 23, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 0fc76e3. Link to the linter CI: here

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glemaitre Here are some similar issues to what you were working on. WDYT?

sklearn/inspection/_permutation_importance.py Outdated Show resolved Hide resolved
sklearn/inspection/_permutation_importance.py Outdated Show resolved Hide resolved
@Charlie-XIAO
Copy link
Contributor Author

Not sure if I'm using the adapters correctly with this new commit? (Haven't added tests for them yet.)

@glemaitre glemaitre self-requested a review February 28, 2024 16:50
@lesteve
Copy link
Member

lesteve commented Mar 18, 2024

Seems like it could be useful and could potentially fix #28488. There are some conflicts to fix and I guess this may need a bit more work as well?

@Charlie-XIAO
Copy link
Contributor Author

Sorry I totally forgot about this PR; will revisit in a few days.

@Charlie-XIAO
Copy link
Contributor Author

I've resolved the merge conflicts and passed the previously failing checks. But note that it is merely a "working" version and I'm not sure whether I'm using adapters in the correct way. @adrinjalali @glemaitre @lesteve do you have bandwidth to review this one :)

@glemaitre
Copy link
Member

I'll have a look.

@adrinjalali
Copy link
Member

ping @glemaitre

doc/whats_new/v1.5.rst Outdated Show resolved Hide resolved
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll like to have the thoughts on @thomasjpfan regarding this PR: we are stretching the _set_output scope because now, we just using it as a "dispatcher" for the different container library. This is quite different from the original purpose.

sklearn/utils/_set_output.py Outdated Show resolved Hide resolved
# pandas may match the indices of `X` and `col` on certain platforms, but we
# want the column replacement to behave as if `col` is an array without index,
# so we reset the index of `col` in the first place
col.index = X.index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the trick annoying. What I find weird is that this was trigger in the debian 32 bits that should not run with pandas. So I'm not really sure what is the reason for the bug. I'll probably try to debug more by pushing a couple of commits.

sklearn/inspection/_permutation_importance.py Outdated Show resolved Hide resolved
@thomasjpfan
Copy link
Member

I'll like to have the thoughts on @thomasjpfan regarding this PR: we are stretching the _set_output scope because now, we just using it as a "dispatcher" for the different container library.

At a glance, I do not like it. I originally intended ContainerAdapterProtocol to the a very minimal set of methods to get any Array container output to work. Adding more methods will make it harder to add more containers in the future.

We have been hiding the complexity of our dataframe operations in functions such as _safe_indexing and _safe_assign. Can we continue to do that with this PR?

@Charlie-XIAO
Copy link
Contributor Author

Charlie-XIAO commented May 22, 2024

Indeed having a nice _safe_assign can avoid bloating ContainerAdapterProtocol, but _safe_assign seems to support very limited use cases at this moment. I may take some time to look at how to extend _safe_assign before revisiting this PR.

@Charlie-XIAO
Copy link
Contributor Author

But I can't think of a better way of doing .copy() (polars uses .clone() instead), and I think it should be easy to implement for any container that we may support in the future.

@Charlie-XIAO
Copy link
Contributor Author

Charlie-XIAO commented May 23, 2024

I believe I'm making _safe_assign too bloated, but really cannot find a very nice way when indexing polars. If it's just setting a column it's simple, but if it's arbitrary indexing-and-setting, then things become really complicated.

I think this currently is a "working" version, but in a sense of numerical arrays/dataframes. When it comes to object or categorical data something could be broken especially for polars. @glemaitre @thomasjpfan Would you possibly provide some insights/suggestions when you have time?

@adrinjalali
Copy link
Member

We might also consider https://github.com/narwhals-dev/narwhals soon here (cc @MarcoGorelli ) 😉

@Charlie-XIAO
Copy link
Contributor Author

Charlie-XIAO commented May 25, 2024

Looks pretty promising (seems like dataframe version of array-api-compat); indeed I don't think we as a downstream consumer should spend too much effort maintaining these (and I'm clearly writing smelly code that works only sometimes 🤣)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

Improve "polars" integration (error, warning & linting examples)
5 participants