Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX safe indexing for polars Series #28521

Merged
merged 10 commits into from Mar 6, 2024

Conversation

Charlie-XIAO
Copy link
Contributor

@Charlie-XIAO Charlie-XIAO commented Feb 23, 2024

Towards #28488.

The initial goal of this PR is to make _safe_indexing work for polars Series and changing _is_polars_df into _is_polars_df_or_series suffices.

However when extending the tests for pandas Series and DataFrame to polars, I found some other places that may need to be fixed (e.g., _polars_indexing).

Copy link

github-actions bot commented Feb 23, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 411b659. Link to the linter CI: here

@Charlie-XIAO Charlie-XIAO marked this pull request as ready for review February 24, 2024 02:06
@Charlie-XIAO Charlie-XIAO changed the title Safe indexing polars FIX safe indexing for polars Feb 26, 2024
@betatim betatim changed the title FIX safe indexing for polars FIX safe indexing for polars Series Feb 27, 2024
@betatim
Copy link
Member

betatim commented Feb 27, 2024

I found some other places that may need to be fixed

Did you fixed the ones you found? Or are there more in addition to what is in this PR?

A general though I had after reading the diff: should we move to using pandas_dataframe, pandas_series, polars_series, etc? Instead of having some prefixed by the library and some not? It would make things more explicit, but probably also increase the diff of this PR by quite a bit? What do you think?

@Charlie-XIAO
Copy link
Contributor Author

Did you fixed the ones you found? Or are there more in addition to what is in this PR?

Sorry that my wording is not accurate. The fixes I made are simply to pass the _safe_indexing-related tests that involve pandas Series (when extending to polars Series).

A general though I had after reading the diff: should we move to using pandas_dataframe, pandas_series, polars_series, etc? Instead of having some prefixed by the library and some not? It would make things more explicit, but probably also increase the diff of this PR by quite a bit? What do you think?

Yes I would prefer explicit naming as well, and since this is not public API changing the names should not cause problems. But somehow "pandas" and "dataframe" both means pandas.DataFrame right now so I'm worrying if there are complex reasons behind it that I do not know. If there is not such reasons I'm happy to switch to the names you proposed :)

but probably also increase the diff of this PR by quite a bit?

If we decide to make namings specific I think we should do it in another PR and then come back to this one. After all _convert_container is not strongly related to this PR (I'm only relying on it to create the tests), and as you said it would indeed increase the diff a lot (and hide what really matters in this PR).

@glemaitre glemaitre self-requested a review February 29, 2024 14:49
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of suggestion. In general, it looks good. I think this is better to add a couple of comments because there is some implicit knowledge in the branching.

doc/whats_new/v1.5.rst Outdated Show resolved Hide resolved
"series", "index", "slice", "sparse_csr", "sparse_csc"}
"series", "index", "slice", "sparse_csr", "sparse_csc", \
"sparse_csr_array", "sparse_csc_array", "pyarrow", "polars", \
"polars_series"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this function start to not be maintainable. We will need to modify the signature and split contructor_name into different parameter such as container_type, constuctor_lib, and something to reflect dense vs. sparse.

So here, let's keep the code that you suggest but we need to follow-up on this code.

sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/utils/__init__.py Outdated Show resolved Hide resolved
@lorentzenchr
Copy link
Member

lorentzenchr commented Mar 1, 2024

I think we must deliberately decide which version of polars we support. Depending on that different methods/syntax are used for indexing, see, e.g., pola-rs/polars#4924.

@Charlie-XIAO
Copy link
Contributor Author

Charlie-XIAO commented Mar 1, 2024

It seems that polars may remove things that used to work, but are currently not creating things that did not work? So does it mean that if our solution works on the current latest version then it should work on all previous versions? I might be mistaken though.

@lorentzenchr
Copy link
Member

lorentzenchr commented Mar 1, 2024

So does it mean that if our solution works on the current latest version then it should work on all previous versions?

Not at all. Over the past year, a lot of deprecations happened.
This is best tested in a test matrix with, at least, oldest supported (from us) and newest release as we do for our required dependencies like numpy and scipy.

@glemaitre
Copy link
Member

I think that we defined for the moment a mean dependency of 0.19.12 and we are testing for it as well as the latest one available. So if there is a change of behaviour we should catch it either via a deprecation warning or a failure (if polars does not warn first).

I assume that regarding the version, we will have to be flexible to bump easily the minimum version since polars is releasing fast for the moment.

@glemaitre
Copy link
Member

If polars really break something in the indexing, I assume that we have to decide either to:

  • fix it with branching depending of the version or if we can have a workaround to get the expected behaviour that makes it consistent with pandas in our use-case
  • bump the dependency to a newer version

@lorentzenchr do you envisage something different?

@Charlie-XIAO
Copy link
Contributor Author

Charlie-XIAO commented Mar 1, 2024

I tested locally that min and latest versions both work. Just out of curiosity in which job(s) do we test against the minimum versions of those libraries?

@glemaitre
Copy link
Member

The only place that we test for the minimum pandas or polars are only in the doc_min_dependencies build.

@lorentzenchr
Copy link
Member

Do we also test latest polars and pandas?

@Charlie-XIAO
Copy link
Contributor Author

The only place that we test for the minimum pandas or polars are only in the doc_min_dependencies build.

Emm but that one does not seem to run the test suite?

Do we also test latest polars and pandas?

For instance pylatest_conda_forge_mkl_linux-64 I think.

@glemaitre
Copy link
Member

Emm but that one does not seem to run the test suite?

Indeed, it will be some indirect testing through examples.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think that we can first merge this as-is and reconsider the way we build our matrix for the CI in another PR.

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
I’ll merge under the assumption of follow-up PRs.

sklearn/utils/__init__.py Show resolved Hide resolved
@lorentzenchr lorentzenchr merged commit de7a43f into scikit-learn:main Mar 6, 2024
30 checks passed
@Charlie-XIAO Charlie-XIAO deleted the safe-indexing-polars branch March 6, 2024 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants