New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Adds polars output support to set_output
API
#27315
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Now that you introduced a protocol I am wondering if we should implement an adapater manager (experimentally at first) but that could have all the expected method and that could allow to register any new adapter in the future even outside of scikit-learn.
@thomasjpfan I missed that #27258 and #26683 where some breakdown. I assume that my reviews can be ported to those PRs. To be honest, I don't think that the changes in this PRs are too large. |
def supported_outputs(self): | ||
return {"default"} | set(self.adapters) | ||
|
||
def register(self, adapter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def register(self, adapter): | |
def register(self, adapter): | |
"""Register a container adapter. | |
Parameters | |
---------- | |
adapter : :class:`ContainerAdapterProtocol` | |
A container adapter that follows the protocol defined by | |
:class:`ContainerAdapterProtocol`. |
In this regards, do we want to enforce something like:
if not isintance(adapter, ContainerAdapterProtocol):
raise TypeError(
"The adapter does not follow the ContainerAdapterProtocol."
)
Or we want to be lenient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me. Thanks you @thomasjpfan for the time spend on this one.
I am thinking that in the future, we could start to expose the manager publicly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review until (including) utils/init.py.
In another PR, we should move all the stuff from init into a separate file, in my opinion.
sklearn/preprocessing/_encoders.py
Outdated
' `ohe.set_output(transform="default").' | ||
f"{capitalize_transform_output} output does not support sparse data." | ||
f" Set sparse_output=False to output {transform_output} DataFrames or" | ||
' disable pandas output via `ohe.set_output(transform="default").' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it only pandas?
sklearn/utils/__init__.py
Outdated
idx = _safe_indexing(np.arange(n_columns), key) | ||
except IndexError as e: | ||
raise ValueError( | ||
"all features must be in [0, {}] or [-{}, 0]".format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: Once ruff told me that raise Error statements should only contain strings to avoid raising another error, e.g. in a wrong f-string.
"""Same as _get_column_indices but for X with __dataframe__ protocol.""" | ||
n_columns = X_interchange.num_columns() | ||
|
||
if isinstance(key, (list, tuple)) and not key: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we return an empty list here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was to match the behavior of _get_column_indices
on main
:
scikit-learn/sklearn/utils/__init__.py
Lines 415 to 417 in 5d83a2e
if isinstance(key, (list, tuple)) and not key: | |
# we get an empty list | |
return [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2nd review part
index : array-like, default=None | ||
Index for data. `index` is ignored if `data_to_wrap` is already a DataFrame. | ||
@runtime_checkable | ||
class ContainerAdapterProtocol(Protocol): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to make ContainerAdapterProtocol
an abstract base class and then inherit from it in each XXXProtocol to ensure that we implement all the methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically your are pointing out protocol vs abstract class. The advantage of the protocol is that we don't force people to import anything from scikit-learn to create one while we are can still do isinstance(my_custom_protocol, ContainerAdapterProtocol)
in our codebase.
I would think this is the right place to use the protocol, here.
NB: I like this blog post -> https://jellis18.github.io/post/2022-01-11-abc-vs-protocol/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I learned something!
We have the same problem with functime when we pass the training data ( |
I made conversion from numpy to polars zero-copy in: pola-rs/polars#12403 If you have 2D arrays, they can be sliced into the specific columnar arrays and then constructed via the |
Thanks @ritchie46 for this nice feature ;) |
I assume that this would be the case only for Fortran contiguous 2D numpy arrays, right? |
Depends on how you slice them. 😁 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is my pass of review. Mostly minor things. LGTM!
This might turn out to be very useful. Thank @thomasjpfan! |
Thanks @thomasjpfan for the effort. |
Thank you, @thomasjpfan. |
Thank you very much @thomasjpfan, seriously helpful. |
Reference Issues/PRs
Related to #25896
Related to #26683
Related to #27258
Related to #26835
What does this implement/fix? Explain your changes.
This PR adds
set_output="polars"
to all transformers. Overall this PR abstracts the dataframe specific API requirements toset_output
into aContainerAdapaterProtocol
.ContainerAdapaterProtocol
is generic enough to support other containers. In principle,Xarray
support will only require another class that implements theContainerAdapaterProtocol
and everything else should "just work".Note that polars does not have a "zero round trip" between
ndarrays
andpl.DataFrame
. For transformers in a pipeline, wrapping and unwrapping a polars dataframe will result in memory copies. Pandas dataframes does not have this issue because it uses block manager for 2d ndarrays.Any other comments?
Merging #27258 or #26683 will make this PR smaller. This PR uses code from those two PRs.