FIX safe indexing for polars `Series` #28521

Charlie-XIAO · 2024-02-23T18:14:01Z

Towards #28488.

The initial goal of this PR is to make _safe_indexing work for polars Series and changing _is_polars_df into _is_polars_df_or_series suffices.

However when extending the tests for pandas Series and DataFrame to polars, I found some other places that may need to be fixed (e.g., _polars_indexing).

github-actions · 2024-02-23T18:16:55Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 411b659. Link to the linter CI: here}

betatim · 2024-02-27T13:02:13Z

I found some other places that may need to be fixed

Did you fixed the ones you found? Or are there more in addition to what is in this PR?

A general though I had after reading the diff: should we move to using pandas_dataframe, pandas_series, polars_series, etc? Instead of having some prefixed by the library and some not? It would make things more explicit, but probably also increase the diff of this PR by quite a bit? What do you think?

Charlie-XIAO · 2024-02-27T13:22:52Z

Did you fixed the ones you found? Or are there more in addition to what is in this PR?

Sorry that my wording is not accurate. The fixes I made are simply to pass the _safe_indexing-related tests that involve pandas Series (when extending to polars Series).

A general though I had after reading the diff: should we move to using pandas_dataframe, pandas_series, polars_series, etc? Instead of having some prefixed by the library and some not? It would make things more explicit, but probably also increase the diff of this PR by quite a bit? What do you think?

Yes I would prefer explicit naming as well, and since this is not public API changing the names should not cause problems. But somehow "pandas" and "dataframe" both means pandas.DataFrame right now so I'm worrying if there are complex reasons behind it that I do not know. If there is not such reasons I'm happy to switch to the names you proposed :)

but probably also increase the diff of this PR by quite a bit?

If we decide to make namings specific I think we should do it in another PR and then come back to this one. After all _convert_container is not strongly related to this PR (I'm only relying on it to create the tests), and as you said it would indeed increase the diff a lot (and hide what really matters in this PR).

glemaitre

A couple of suggestion. In general, it looks good. I think this is better to add a couple of comments because there is some implicit knowledge in the branching.

doc/whats_new/v1.5.rst

glemaitre · 2024-02-29T15:09:44Z

sklearn/utils/_testing.py

-            "series", "index", "slice", "sparse_csr", "sparse_csc"}
+            "series", "index", "slice", "sparse_csr", "sparse_csc", \
+            "sparse_csr_array", "sparse_csc_array", "pyarrow", "polars", \
+            "polars_series"}


I see that this function start to not be maintainable. We will need to modify the signature and split contructor_name into different parameter such as container_type, constuctor_lib, and something to reflect dense vs. sparse.

So here, let's keep the code that you suggest but we need to follow-up on this code.

sklearn/utils/__init__.py

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

lorentzenchr · 2024-03-01T14:06:39Z

I think we must deliberately decide which version of polars we support. Depending on that different methods/syntax are used for indexing, see, e.g., pola-rs/polars#4924.

Charlie-XIAO · 2024-03-01T14:31:55Z

It seems that polars may remove things that used to work, but are currently not creating things that did not work? So does it mean that if our solution works on the current latest version then it should work on all previous versions? I might be mistaken though.

lorentzenchr · 2024-03-01T15:56:22Z

So does it mean that if our solution works on the current latest version then it should work on all previous versions?

Not at all. Over the past year, a lot of deprecations happened.
This is best tested in a test matrix with, at least, oldest supported (from us) and newest release as we do for our required dependencies like numpy and scipy.

glemaitre · 2024-03-01T17:31:51Z

I think that we defined for the moment a mean dependency of 0.19.12 and we are testing for it as well as the latest one available. So if there is a change of behaviour we should catch it either via a deprecation warning or a failure (if polars does not warn first).

I assume that regarding the version, we will have to be flexible to bump easily the minimum version since polars is releasing fast for the moment.

glemaitre · 2024-03-01T17:35:46Z

If polars really break something in the indexing, I assume that we have to decide either to:

fix it with branching depending of the version or if we can have a workaround to get the expected behaviour that makes it consistent with pandas in our use-case
bump the dependency to a newer version

@lorentzenchr do you envisage something different?

Charlie-XIAO · 2024-03-01T17:40:03Z

I tested locally that min and latest versions both work. Just out of curiosity in which job(s) do we test against the minimum versions of those libraries?

glemaitre · 2024-03-01T17:58:45Z

The only place that we test for the minimum pandas or polars are only in the doc_min_dependencies build.

lorentzenchr · 2024-03-01T18:02:27Z

Do we also test latest polars and pandas?

Charlie-XIAO · 2024-03-01T18:07:30Z

The only place that we test for the minimum pandas or polars are only in the doc_min_dependencies build.

Emm but that one does not seem to run the test suite?

Do we also test latest polars and pandas?

For instance pylatest_conda_forge_mkl_linux-64 I think.

glemaitre · 2024-03-01T20:17:40Z

Emm but that one does not seem to run the test suite?

Indeed, it will be some indirect testing through examples.

glemaitre

LGTM. I think that we can first merge this as-is and reconsider the way we build our matrix for the CI in another PR.

lorentzenchr

LGTM
I’ll merge under the assumption of follow-up PRs.

sklearn/utils/__init__.py

Charlie-XIAO added 4 commits February 22, 2024 23:05

FIX _safe_indexing does not work for polars series

392e316

Merge remote-tracking branch 'upstream/main' into safe-indexing-polars

f7053f2

extend tests to polars for safe indexing

4835a0b

remove some failing tests

864b981

github-actions bot added the module:utils label Feb 23, 2024

Charlie-XIAO added 2 commits February 24, 2024 10:03

changelog added

f761d43

Merge remote-tracking branch 'upstream/main' into safe-indexing-polars

7cd4092

Charlie-XIAO marked this pull request as ready for review February 24, 2024 02:06

Charlie-XIAO changed the title ~~Safe indexing polars~~ FIX safe indexing for polars Feb 26, 2024

Merge branch 'main' into safe-indexing-polars

28b7fc2

betatim changed the title ~~FIX safe indexing for polars~~ FIX safe indexing for polars Series Feb 27, 2024

glemaitre self-requested a review February 29, 2024 14:49

glemaitre reviewed Mar 1, 2024

View reviewed changes

Charlie-XIAO and others added 2 commits March 1, 2024 21:13

handle the branching more elegantly

fc90d9c

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Merge remote-tracking branch 'upstream/main' into safe-indexing-polars

1138bae

rewording changelog and some comments

411b659

glemaitre approved these changes Mar 4, 2024

View reviewed changes

lorentzenchr approved these changes Mar 6, 2024

View reviewed changes

sklearn/utils/__init__.py Show resolved Hide resolved

lorentzenchr merged commit de7a43f into scikit-learn:main Mar 6, 2024
30 checks passed

Charlie-XIAO deleted the safe-indexing-polars branch March 6, 2024 15:09

Charlie-XIAO mentioned this pull request Mar 23, 2024

RFC _convert_container #28681

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX safe indexing for polars `Series` #28521

FIX safe indexing for polars `Series` #28521

Charlie-XIAO commented Feb 23, 2024 •

edited

github-actions bot commented Feb 23, 2024 •

edited

betatim commented Feb 27, 2024

Charlie-XIAO commented Feb 27, 2024

glemaitre left a comment

glemaitre Feb 29, 2024

lorentzenchr commented Mar 1, 2024 •

edited

Charlie-XIAO commented Mar 1, 2024 •

edited

lorentzenchr commented Mar 1, 2024 •

edited

glemaitre commented Mar 1, 2024

glemaitre commented Mar 1, 2024

Charlie-XIAO commented Mar 1, 2024 •

edited

glemaitre commented Mar 1, 2024

lorentzenchr commented Mar 1, 2024

Charlie-XIAO commented Mar 1, 2024

glemaitre commented Mar 1, 2024

glemaitre left a comment

lorentzenchr left a comment

FIX safe indexing for polars Series #28521

FIX safe indexing for polars Series #28521

Conversation

Charlie-XIAO commented Feb 23, 2024 • edited

github-actions bot commented Feb 23, 2024 • edited

✔️ Linting Passed

betatim commented Feb 27, 2024

Charlie-XIAO commented Feb 27, 2024

glemaitre left a comment

Choose a reason for hiding this comment

glemaitre Feb 29, 2024

Choose a reason for hiding this comment

lorentzenchr commented Mar 1, 2024 • edited

Charlie-XIAO commented Mar 1, 2024 • edited

lorentzenchr commented Mar 1, 2024 • edited

glemaitre commented Mar 1, 2024

glemaitre commented Mar 1, 2024

Charlie-XIAO commented Mar 1, 2024 • edited

glemaitre commented Mar 1, 2024

lorentzenchr commented Mar 1, 2024

Charlie-XIAO commented Mar 1, 2024

glemaitre commented Mar 1, 2024

glemaitre left a comment

Choose a reason for hiding this comment

lorentzenchr left a comment

Choose a reason for hiding this comment

FIX safe indexing for polars `Series` #28521

FIX safe indexing for polars `Series` #28521

Charlie-XIAO commented Feb 23, 2024 •

edited

github-actions bot commented Feb 23, 2024 •

edited

lorentzenchr commented Mar 1, 2024 •

edited

Charlie-XIAO commented Mar 1, 2024 •

edited

lorentzenchr commented Mar 1, 2024 •

edited

Charlie-XIAO commented Mar 1, 2024 •

edited