ENH Adds polars output support to `set_output` API #27315

thomasjpfan · 2023-09-08T01:34:57Z

Reference Issues/PRs

Related to #25896
Related to #26683
Related to #27258
Related to #26835

What does this implement/fix? Explain your changes.

This PR adds set_output="polars" to all transformers. Overall this PR abstracts the dataframe specific API requirements to set_output into a ContainerAdapaterProtocol. ContainerAdapaterProtocol is generic enough to support other containers. In principle, Xarray support will only require another class that implements the ContainerAdapaterProtocol and everything else should "just work".

Note that polars does not have a "zero round trip" between ndarrays and pl.DataFrame. For transformers in a pipeline, wrapping and unwrapping a polars dataframe will result in memory copies. Pandas dataframes does not have this issue because it uses block manager for 2d ndarrays.

Any other comments?

Merging #27258 or #26683 will make this PR smaller. This PR uses code from those two PRs.

github-actions · 2023-09-08T01:36:09Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 6e640ec. Link to the linter CI: here}

sklearn/utils/_set_output.py

glemaitre

This looks great. Now that you introduced a protocol I am wondering if we should implement an adapater manager (experimentally at first) but that could have all the expected method and that could allow to register any new adapter in the future even outside of scikit-learn.

doc/whats_new/v1.4.rst

sklearn/utils/estimator_checks.py

sklearn/utils/tests/test_set_output.py

sklearn/preprocessing/_function_transformer.py

sklearn/feature_selection/_base.py

sklearn/compose/_column_transformer.py

sklearn/pipeline.py

glemaitre · 2023-10-31T14:46:29Z

@thomasjpfan I missed that #27258 and #26683 where some breakdown. I assume that my reviews can be ported to those PRs.

To be honest, I don't think that the changes in this PRs are too large.

glemaitre · 2023-11-09T09:58:31Z

sklearn/utils/_set_output.py

+    def supported_outputs(self):
+        return {"default"} | set(self.adapters)
+
+    def register(self, adapter):


Suggested change

def register(self, adapter):

def register(self, adapter):

"""Register a container adapter.

Parameters

----------

adapter : :class:`ContainerAdapterProtocol`

A container adapter that follows the protocol defined by

:class:`ContainerAdapterProtocol`.

In this regards, do we want to enforce something like:

if not isintance(adapter, ContainerAdapterProtocol): raise TypeError( "The adapter does not follow the ContainerAdapterProtocol." )

Or we want to be lenient?

glemaitre

It looks good to me. Thanks you @thomasjpfan for the time spend on this one.

I am thinking that in the future, we could start to expose the manager publicly.

lorentzenchr

Partial review until (including) utils/init.py.

In another PR, we should move all the stuff from init into a separate file, in my opinion.

lorentzenchr · 2023-11-09T18:27:56Z

sklearn/preprocessing/_encoders.py

-                ' `ohe.set_output(transform="default").'
+                f"{capitalize_transform_output} output does not support sparse data."
+                f" Set sparse_output=False to output {transform_output} DataFrames or"
+                ' disable pandas output via `ohe.set_output(transform="default").'


Is it only pandas?

sklearn/utils/__init__.py

lorentzenchr · 2023-11-09T18:35:05Z

sklearn/utils/__init__.py

+        idx = _safe_indexing(np.arange(n_columns), key)
+    except IndexError as e:
+        raise ValueError(
+            "all features must be in [0, {}] or [-{}, 0]".format(


Nitpick: Once ruff told me that raise Error statements should only contain strings to avoid raising another error, e.g. in a wrong f-string.

lorentzenchr · 2023-11-09T18:37:59Z

sklearn/utils/__init__.py

+    """Same as _get_column_indices but for X with __dataframe__ protocol."""
+    n_columns = X_interchange.num_columns()
+
+    if isinstance(key, (list, tuple)) and not key:


Why do we return an empty list here?

It was to match the behavior of _get_column_indices on main:

scikit-learn/sklearn/utils/__init__.py

Lines 415 to 417 in 5d83a2e

if isinstance(key, (list, tuple)) and not key:

# we get an empty list

return []

lorentzenchr

2nd review part

lorentzenchr · 2023-11-10T07:13:33Z

sklearn/utils/_set_output.py

-    index : array-like, default=None
-        Index for data. `index` is ignored if `data_to_wrap` is already a DataFrame.
+@runtime_checkable
+class ContainerAdapterProtocol(Protocol):


Would it make sense to make ContainerAdapterProtocol an abstract base class and then inherit from it in each XXXProtocol to ensure that we implement all the methods?

Basically your are pointing out protocol vs abstract class. The advantage of the protocol is that we don't force people to import anything from scikit-learn to create one while we are can still do isinstance(my_custom_protocol, ContainerAdapterProtocol) in our codebase.

I would think this is the right place to use the protocol, here.

NB: I like this blog post -> https://jellis18.github.io/post/2022-01-11-abc-vs-protocol/

I learned something!

sklearn/utils/_set_output.py

sklearn/utils/tests/test_validation.py

sklearn/utils/estimator_checks.py

sklearn/utils/tests/test_utils.py

baggiponte · 2023-11-11T16:11:57Z

Note that polars does not have a "zero round trip" between ndarrays and pl.DataFrame. For transformers in a pipeline, wrapping and unwrapping a polars dataframe will result in memory copies. Pandas dataframes does not have this issue because it uses block manager for 2d ndarrays.

We have the same problem with functime when we pass the training data (polars.DataFrames) into Estimators. 1D array conversion should mostly be zero-copy, unless say there are null values. For 2D arrays, @topher-lo wrote this function that uses zarr to spill the data to disk and returns a numpy array.

ritchie46 · 2023-11-12T15:22:16Z

Note that polars does not have a "zero round trip" between ndarrays and pl.DataFrame. For transformers in a pipeline, wrapping and unwrapping a polars dataframe will result in memory copies. Pandas dataframes does not have this issue because it uses block manager for 2d ndarrays.

I made conversion from numpy to polars zero-copy in: pola-rs/polars#12403

If you have 2D arrays, they can be sliced into the specific columnar arrays and then constructed via the DataFrame constructor. This should still be zero copy.

glemaitre · 2023-11-14T09:06:17Z

Thanks @ritchie46 for this nice feature ;)

ogrisel · 2023-11-15T14:24:10Z

If you have 2D arrays, they can be sliced into the specific columnar arrays and then constructed via the DataFrame constructor. This should still be zero copy.

I assume that this would be the case only for Fortran contiguous 2D numpy arrays, right?

ritchie46 · 2023-11-17T17:44:29Z

If you have 2D arrays, they can be sliced into the specific columnar arrays and then constructed via the DataFrame constructor. This should still be zero copy.

I assume that this would be the case only for Fortran contiguous 2D numpy arrays, right?

Depends on how you slice them. 😁

ogrisel

Here is my pass of review. Mostly minor things. LGTM!

sklearn/utils/__init__.py

sklearn/utils/tests/test_utils.py

sklearn/preprocessing/_encoders.py

sklearn/utils/estimator_checks.py

sklearn/utils/validation.py

sklearn/utils/estimator_checks.py

lorentzenchr · 2023-11-19T21:47:32Z

This might turn out to be very useful. Thank @thomasjpfan!

glemaitre · 2023-11-19T21:48:04Z

Thanks @thomasjpfan for the effort.

jjerphan · 2023-11-20T19:47:04Z

Thank you, @thomasjpfan.

chrstfer · 2023-12-02T22:32:35Z

Thank you very much @thomasjpfan, seriously helpful.

sklearn/pipeline.py

thomasjpfan added 2 commits September 7, 2023 21:28

ENH Adds polars output support

eab658f

DOC Adds PR number

fb8a386

Vincent-Maladiere reviewed Oct 30, 2023

View reviewed changes

sklearn/utils/_set_output.py Outdated Show resolved Hide resolved

glemaitre self-requested a review October 31, 2023 13:29

glemaitre reviewed Oct 31, 2023

View reviewed changes

thomasjpfan added 9 commits November 5, 2023 19:15

Merge remote-tracking branch 'upstream/main' into polars_output

abc9940

FIX Fixes error message

fa2e66a

CLN Address comments

20af5cf

TST Improve coverage

6ba32fb

TST Improve coverage for check_library_installed

424bee3

TST Improve coverage

82faddf

CLN Rename method

52562ba

DOC Adds period

3d87bf1

Add an adapter manager

23cadf1

glemaitre self-requested a review November 9, 2023 09:51

glemaitre reviewed Nov 9, 2023

View reviewed changes

glemaitre approved these changes Nov 9, 2023

View reviewed changes

glemaitre added this to the 1.4 milestone Nov 9, 2023

lorentzenchr reviewed Nov 9, 2023

View reviewed changes

lorentzenchr reviewed Nov 10, 2023

View reviewed changes

Vincent-Maladiere mentioned this pull request Nov 15, 2023

Test polars support skrub-data/skrub#826

Merged

thomasjpfan added 3 commits November 17, 2023 09:16

CLN First round of reviews

28a9b31

CLN Second round of reviews

e283623

TST Fixes failing tests

aacc6aa

ogrisel approved these changes Nov 17, 2023

View reviewed changes

ogrisel reviewed Nov 17, 2023

View reviewed changes

sklearn/utils/estimator_checks.py Outdated Show resolved Hide resolved

lorentzenchr approved these changes Nov 17, 2023

View reviewed changes

sklearn/utils/estimator_checks.py Outdated Show resolved Hide resolved

sklearn/utils/estimator_checks.py Show resolved Hide resolved

thomasjpfan added 2 commits November 19, 2023 12:14

Merge remote-tracking branch 'upstream/main' into polars_output

0d803ac

CLN Address comments

6e640ec

lorentzenchr merged commit 831c49a into scikit-learn:main Nov 19, 2023
27 checks passed

lorentzenchr mentioned this pull request Nov 23, 2023

ENH Adds polars output support to ColumnTransformer #26683

Merged

he7d3r reviewed Dec 20, 2023

View reviewed changes

sklearn/pipeline.py Show resolved Hide resolved

thomasjpfan mentioned this pull request Dec 20, 2023

DOC Fixes pipeline docstring for pandas output #27992

Merged

krz mentioned this pull request Mar 25, 2024

Support polars data frames py-why/dowhy#1151

Open

yuanx749 mentioned this pull request Mar 31, 2024

FIX warning using polars DataFrames in DecisionBoundaryDisplay.from_estimator #28718

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Adds polars output support to `set_output` API #27315

ENH Adds polars output support to `set_output` API #27315

thomasjpfan commented Sep 8, 2023 •

edited

github-actions bot commented Sep 8, 2023 •

edited

glemaitre left a comment

glemaitre commented Oct 31, 2023

glemaitre Nov 9, 2023

glemaitre left a comment

lorentzenchr left a comment

lorentzenchr Nov 9, 2023

lorentzenchr Nov 9, 2023

lorentzenchr Nov 9, 2023

thomasjpfan Nov 17, 2023

lorentzenchr left a comment

lorentzenchr Nov 10, 2023

glemaitre Nov 10, 2023 •

edited

lorentzenchr Nov 10, 2023

baggiponte commented Nov 11, 2023

ritchie46 commented Nov 12, 2023

glemaitre commented Nov 14, 2023

ogrisel commented Nov 15, 2023

ritchie46 commented Nov 17, 2023

ogrisel left a comment

lorentzenchr commented Nov 19, 2023

glemaitre commented Nov 19, 2023

jjerphan commented Nov 20, 2023

chrstfer commented Dec 2, 2023

-    def register(self, adapter):
+    def register(self, adapter):
+        """Register a container adapter.
+        Parameters
+        ----------
+        adapter : :class:`ContainerAdapterProtocol`
+            A container adapter that follows the protocol defined by
+            :class:`ContainerAdapterProtocol`.

	if isinstance(key, (list, tuple)) and not key:
	# we get an empty list
	return []

ENH Adds polars output support to set_output API #27315

ENH Adds polars output support to set_output API #27315

Conversation

thomasjpfan commented Sep 8, 2023 • edited

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

github-actions bot commented Sep 8, 2023 • edited

✔️ Linting Passed

glemaitre left a comment

Choose a reason for hiding this comment

glemaitre commented Oct 31, 2023

glemaitre Nov 9, 2023

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment

lorentzenchr left a comment

Choose a reason for hiding this comment

lorentzenchr Nov 9, 2023

Choose a reason for hiding this comment

lorentzenchr Nov 9, 2023

Choose a reason for hiding this comment

lorentzenchr Nov 9, 2023

Choose a reason for hiding this comment

thomasjpfan Nov 17, 2023

Choose a reason for hiding this comment

lorentzenchr left a comment

Choose a reason for hiding this comment

lorentzenchr Nov 10, 2023

Choose a reason for hiding this comment

glemaitre Nov 10, 2023 • edited

Choose a reason for hiding this comment

lorentzenchr Nov 10, 2023

Choose a reason for hiding this comment

baggiponte commented Nov 11, 2023

ritchie46 commented Nov 12, 2023

glemaitre commented Nov 14, 2023

ogrisel commented Nov 15, 2023

ritchie46 commented Nov 17, 2023

ogrisel left a comment

Choose a reason for hiding this comment

lorentzenchr commented Nov 19, 2023

glemaitre commented Nov 19, 2023

jjerphan commented Nov 20, 2023

chrstfer commented Dec 2, 2023

ENH Adds polars output support to `set_output` API #27315

ENH Adds polars output support to `set_output` API #27315

thomasjpfan commented Sep 8, 2023 •

edited

github-actions bot commented Sep 8, 2023 •

edited

glemaitre Nov 10, 2023 •

edited