Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Adds polars output support to set_output API #27315

Merged
merged 16 commits into from Nov 19, 2023

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Sep 8, 2023

Reference Issues/PRs

Related to #25896
Related to #26683
Related to #27258
Related to #26835

What does this implement/fix? Explain your changes.

This PR adds set_output="polars" to all transformers. Overall this PR abstracts the dataframe specific API requirements to set_output into a ContainerAdapaterProtocol. ContainerAdapaterProtocol is generic enough to support other containers. In principle, Xarray support will only require another class that implements the ContainerAdapaterProtocol and everything else should "just work".

Note that polars does not have a "zero round trip" between ndarrays and pl.DataFrame. For transformers in a pipeline, wrapping and unwrapping a polars dataframe will result in memory copies. Pandas dataframes does not have this issue because it uses block manager for 2d ndarrays.

Any other comments?

Merging #27258 or #26683 will make this PR smaller. This PR uses code from those two PRs.

@github-actions
Copy link

github-actions bot commented Sep 8, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 6e640ec. Link to the linter CI: here

@glemaitre glemaitre self-requested a review October 31, 2023 13:29
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Now that you introduced a protocol I am wondering if we should implement an adapater manager (experimentally at first) but that could have all the expected method and that could allow to register any new adapter in the future even outside of scikit-learn.

doc/whats_new/v1.4.rst Outdated Show resolved Hide resolved
sklearn/utils/estimator_checks.py Show resolved Hide resolved
sklearn/utils/estimator_checks.py Show resolved Hide resolved
sklearn/utils/estimator_checks.py Outdated Show resolved Hide resolved
sklearn/utils/estimator_checks.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_set_output.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_function_transformer.py Outdated Show resolved Hide resolved
sklearn/feature_selection/_base.py Outdated Show resolved Hide resolved
sklearn/compose/_column_transformer.py Outdated Show resolved Hide resolved
sklearn/pipeline.py Outdated Show resolved Hide resolved
@glemaitre
Copy link
Member

@thomasjpfan I missed that #27258 and #26683 where some breakdown. I assume that my reviews can be ported to those PRs.

To be honest, I don't think that the changes in this PRs are too large.

@glemaitre glemaitre self-requested a review November 9, 2023 09:51
def supported_outputs(self):
return {"default"} | set(self.adapters)

def register(self, adapter):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def register(self, adapter):
def register(self, adapter):
"""Register a container adapter.
Parameters
----------
adapter : :class:`ContainerAdapterProtocol`
A container adapter that follows the protocol defined by
:class:`ContainerAdapterProtocol`.

In this regards, do we want to enforce something like:

if not isintance(adapter, ContainerAdapterProtocol):
    raise TypeError(
        "The adapter does not follow the ContainerAdapterProtocol."
    )

Or we want to be lenient?

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me. Thanks you @thomasjpfan for the time spend on this one.

I am thinking that in the future, we could start to expose the manager publicly.

@glemaitre glemaitre added this to the 1.4 milestone Nov 9, 2023
Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review until (including) utils/init.py.

In another PR, we should move all the stuff from init into a separate file, in my opinion.

' `ohe.set_output(transform="default").'
f"{capitalize_transform_output} output does not support sparse data."
f" Set sparse_output=False to output {transform_output} DataFrames or"
' disable pandas output via `ohe.set_output(transform="default").'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it only pandas?

sklearn/utils/__init__.py Outdated Show resolved Hide resolved
idx = _safe_indexing(np.arange(n_columns), key)
except IndexError as e:
raise ValueError(
"all features must be in [0, {}] or [-{}, 0]".format(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Once ruff told me that raise Error statements should only contain strings to avoid raising another error, e.g. in a wrong f-string.

"""Same as _get_column_indices but for X with __dataframe__ protocol."""
n_columns = X_interchange.num_columns()

if isinstance(key, (list, tuple)) and not key:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we return an empty list here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was to match the behavior of _get_column_indices on main:

if isinstance(key, (list, tuple)) and not key:
# we get an empty list
return []

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2nd review part

index : array-like, default=None
Index for data. `index` is ignored if `data_to_wrap` is already a DataFrame.
@runtime_checkable
class ContainerAdapterProtocol(Protocol):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to make ContainerAdapterProtocol an abstract base class and then inherit from it in each XXXProtocol to ensure that we implement all the methods?

Copy link
Member

@glemaitre glemaitre Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically your are pointing out protocol vs abstract class. The advantage of the protocol is that we don't force people to import anything from scikit-learn to create one while we are can still do isinstance(my_custom_protocol, ContainerAdapterProtocol) in our codebase.

I would think this is the right place to use the protocol, here.

NB: I like this blog post -> https://jellis18.github.io/post/2022-01-11-abc-vs-protocol/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I learned something!

sklearn/utils/_set_output.py Outdated Show resolved Hide resolved
sklearn/utils/_set_output.py Show resolved Hide resolved
sklearn/utils/_set_output.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
sklearn/utils/estimator_checks.py Outdated Show resolved Hide resolved
sklearn/utils/estimator_checks.py Outdated Show resolved Hide resolved
sklearn/utils/estimator_checks.py Outdated Show resolved Hide resolved
sklearn/utils/estimator_checks.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_utils.py Outdated Show resolved Hide resolved
@baggiponte
Copy link

Note that polars does not have a "zero round trip" between ndarrays and pl.DataFrame. For transformers in a pipeline, wrapping and unwrapping a polars dataframe will result in memory copies. Pandas dataframes does not have this issue because it uses block manager for 2d ndarrays.

We have the same problem with functime when we pass the training data (polars.DataFrames) into Estimators. 1D array conversion should mostly be zero-copy, unless say there are null values. For 2D arrays, @topher-lo wrote this function that uses zarr to spill the data to disk and returns a numpy array.

@ritchie46
Copy link

Note that polars does not have a "zero round trip" between ndarrays and pl.DataFrame. For transformers in a pipeline, wrapping and unwrapping a polars dataframe will result in memory copies. Pandas dataframes does not have this issue because it uses block manager for 2d ndarrays.

I made conversion from numpy to polars zero-copy in: pola-rs/polars#12403

If you have 2D arrays, they can be sliced into the specific columnar arrays and then constructed via the DataFrame constructor. This should still be zero copy.

@glemaitre
Copy link
Member

Thanks @ritchie46 for this nice feature ;)

@ogrisel
Copy link
Member

ogrisel commented Nov 15, 2023

If you have 2D arrays, they can be sliced into the specific columnar arrays and then constructed via the DataFrame constructor. This should still be zero copy.

I assume that this would be the case only for Fortran contiguous 2D numpy arrays, right?

@ritchie46
Copy link

If you have 2D arrays, they can be sliced into the specific columnar arrays and then constructed via the DataFrame constructor. This should still be zero copy.

I assume that this would be the case only for Fortran contiguous 2D numpy arrays, right?

Depends on how you slice them. 😁

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my pass of review. Mostly minor things. LGTM!

sklearn/utils/__init__.py Show resolved Hide resolved
sklearn/utils/__init__.py Show resolved Hide resolved
sklearn/utils/tests/test_utils.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_utils.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
sklearn/utils/estimator_checks.py Outdated Show resolved Hide resolved
sklearn/utils/estimator_checks.py Outdated Show resolved Hide resolved
sklearn/utils/estimator_checks.py Outdated Show resolved Hide resolved
sklearn/utils/estimator_checks.py Outdated Show resolved Hide resolved
sklearn/utils/validation.py Outdated Show resolved Hide resolved
sklearn/utils/estimator_checks.py Outdated Show resolved Hide resolved
sklearn/utils/estimator_checks.py Show resolved Hide resolved
@lorentzenchr lorentzenchr merged commit 831c49a into scikit-learn:main Nov 19, 2023
27 checks passed
@lorentzenchr
Copy link
Member

This might turn out to be very useful. Thank @thomasjpfan!

@glemaitre
Copy link
Member

Thanks @thomasjpfan for the effort.

@jjerphan
Copy link
Member

Thank you, @thomasjpfan.

@chrstfer
Copy link

chrstfer commented Dec 2, 2023

Thank you very much @thomasjpfan, seriously helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants