-
Notifications
You must be signed in to change notification settings - Fork 3.9k
[python-package] add PyArrow Table support to get_data and add_features_from methods #6892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[python-package] add PyArrow Table support to get_data and add_features_from methods #6892
Conversation
…s raw data despite free_raw_data=False (microsoft#6891)
Thanks! We'll review more thoroughly later, but a first quick comment... please add some tests covering the new code you'd like this project to maintain. |
@microsoft-github-policy-service agree |
…d_features_from method (microsoft#6891)
Thank you for your feedback! I have added a test case for the Pyarrow Table scenario to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much! I was able to review a bit more closely tonight, left another round of suggestions. I will put up a PR trying to drop support for h2o's datatable
, to hopefully make this PR a little bit simpler.
python-package/lightgbm/basic.py
Outdated
"without pyarrow installed. " | ||
"Install pyarrow and restart your session." | ||
) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this else:
. We have a slight preference in this project for just raising exceptions, to reduce unnecessary extra indentation.
Like this:
LightGBM/python-package/lightgbm/basic.py
Lines 2461 to 2462 in a725360
if not (PYARROW_INSTALLED and CFFI_INSTALLED): | |
raise LightGBMError("Cannot init Dataset from Arrow without 'pyarrow' and 'cffi' installed.") |
That's enforced by convention today... if there's some ruff
rule that could enforce that (like no-unnecessary-else
or something), I'd support adding it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this commit, I removed the unnecessary else block as suggested.
python-package/lightgbm/basic.py
Outdated
elif isinstance(other.data, pa_Table): | ||
if not PYARROW_INSTALLED: | ||
raise LightGBMError( | ||
"Cannot add features to pyarrow.Table type of raw data " | ||
"without pyarrow installed. " | ||
"Install pyarrow and restart your session." | ||
) | ||
else: | ||
self.data = dt_DataTable( | ||
np.hstack( | ||
( | ||
self.data.to_numpy(), | ||
np.column_stack( | ||
[other.data.column(i).to_numpy() for i in range(len(other.data.column_names))] | ||
), | ||
) | ||
) | ||
) | ||
else: | ||
self.data = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing more datatable
code getting added here is making me think we should just do what xgboost
did (dmlc/xgboost#11070) and just fully drop support for it now.
In #6662, I'd proposed having deprecation warnings for "2-3 releases", but I'm going to put up a PR just proposing dropping this in the next release. We got a deprecation warning into 4.6.0 (#6670), which was released about 2 months ago, and it'll probably be at least another 2 months until the next LightGBM release... I think that's enough time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #6894
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for thinking through this and for moving things forward!
I appreciate you taking the initiative to propose a clear path for dropping datatable
support.
@@ -3293,6 +3296,8 @@ def get_data(self) -> Optional[_LGBM_TrainDataType]: | |||
self.data = self.data[self.used_indices, :] | |||
elif isinstance(self.data, Sequence): | |||
self.data = self.data[self.used_indices] | |||
elif isinstance(self.data, pa_Table): | |||
self.data = self.data.take(self.used_indices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This was definitely just something we'd missed.
Can you please add a test in https://github.com/microsoft/LightGBM/blob/master/tests/python_package_test/test_arrow.py just for this change to get_data()
? The other changes you made in test_basic.py
do not cover these changes.
When you do that, please check that the content of self.data
AND the returned value are correct (e.g., contain exactly the expected values and data types).
If you'd like, I'd even support opening a new pull request that only has the get_data()
changes + test (and then making this PR only about add_features_from()
). Totally your choice, I want to be respectful of your time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion!
I added tests for both get_data() and add_features_from() directly in test_arrow.py
as part of this PR.
Please let me know if there’s anything else you’d like me to adjust!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jameslamb
I’ve opened a new pull request(#6911) that includes only the changes to get_data() along with the corresponding test. This should help keep things focused. I’d appreciate it if you could take a look when you have time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I'll focus there.
…ix/6891-pyarrow-table-add-features
I’ll mark this PR as a draft for now. Once #6911 is merged, I’ll reopen it and follow up accordingly. |
Fixes #6891
Description
This PR adds proper support for PyArrow Table in Dataset's get_data and add_features_from methods.
Previously, when using a PyArrow Table as input data, there was no handling for:
These missing implementations caused issues when working with PyArrow Tables with free_raw_data=False,
as the data would be incorrectly set to None after operations.
Changes