Skip to content

[python-package] add PyArrow Table support to get_data and add_features_from methods #6892

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

suk1yak1
Copy link
Contributor

@suk1yak1 suk1yak1 commented Apr 23, 2025

Fixes #6891

Description

This PR adds proper support for PyArrow Table in Dataset's get_data and add_features_from methods.
Previously, when using a PyArrow Table as input data, there was no handling for:

  1. Subsetting a PyArrow Table in get_data method when used_indices is set
  2. Merging PyArrow Tables in add_features_from method

These missing implementations caused issues when working with PyArrow Tables with free_raw_data=False,
as the data would be incorrectly set to None after operations.

Changes

  1. Added PyArrow Table support in get_data method to handle subsetting with used_indices
  2. Added PyArrow Table support in add_features_from method to properly merge columns from two tables
  3. Added proper validation and error handling when working with PyArrow Tables

@jameslamb
Copy link
Collaborator

Thanks! We'll review more thoroughly later, but a first quick comment... please add some tests covering the new code you'd like this project to maintain.

@suk1yak1
Copy link
Contributor Author

@microsoft-github-policy-service agree

@jameslamb jameslamb changed the title [python-package]add PyArrow Table support to get_data and add_features_from methods [python-package] add PyArrow Table support to get_data and add_features_from methods Apr 25, 2025
@suk1yak1
Copy link
Contributor Author

Thank you for your feedback! I have added a test case for the Pyarrow Table scenario to test_add_features_from_different_sources. I also updated the code to handle the case where other.data is a Pyarrow Table.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much! I was able to review a bit more closely tonight, left another round of suggestions. I will put up a PR trying to drop support for h2o's datatable, to hopefully make this PR a little bit simpler.

"without pyarrow installed. "
"Install pyarrow and restart your session."
)
else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this else:. We have a slight preference in this project for just raising exceptions, to reduce unnecessary extra indentation.

Like this:

if not (PYARROW_INSTALLED and CFFI_INSTALLED):
raise LightGBMError("Cannot init Dataset from Arrow without 'pyarrow' and 'cffi' installed.")

That's enforced by convention today... if there's some ruff rule that could enforce that (like no-unnecessary-else or something), I'd support adding it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback!
I created an issue to propose adding a Ruff rule for this pattern: #6903.
Following that, I also opened a pull request #6904 to enable the RET506 (superfluous-else-raise) rule and fix the related code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this commit, I removed the unnecessary else block as suggested.

Comment on lines 3583 to 3602
elif isinstance(other.data, pa_Table):
if not PYARROW_INSTALLED:
raise LightGBMError(
"Cannot add features to pyarrow.Table type of raw data "
"without pyarrow installed. "
"Install pyarrow and restart your session."
)
else:
self.data = dt_DataTable(
np.hstack(
(
self.data.to_numpy(),
np.column_stack(
[other.data.column(i).to_numpy() for i in range(len(other.data.column_names))]
),
)
)
)
else:
self.data = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing more datatable code getting added here is making me think we should just do what xgboost did (dmlc/xgboost#11070) and just fully drop support for it now.

In #6662, I'd proposed having deprecation warnings for "2-3 releases", but I'm going to put up a PR just proposing dropping this in the next release. We got a deprecation warning into 4.6.0 (#6670), which was released about 2 months ago, and it'll probably be at least another 2 months until the next LightGBM release... I think that's enough time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #6894

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for thinking through this and for moving things forward!
I appreciate you taking the initiative to propose a clear path for dropping datatable support.

@@ -3293,6 +3296,8 @@ def get_data(self) -> Optional[_LGBM_TrainDataType]:
self.data = self.data[self.used_indices, :]
elif isinstance(self.data, Sequence):
self.data = self.data[self.used_indices]
elif isinstance(self.data, pa_Table):
self.data = self.data.take(self.used_indices)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This was definitely just something we'd missed.

Can you please add a test in https://github.com/microsoft/LightGBM/blob/master/tests/python_package_test/test_arrow.py just for this change to get_data()? The other changes you made in test_basic.py do not cover these changes.

When you do that, please check that the content of self.data AND the returned value are correct (e.g., contain exactly the expected values and data types).

If you'd like, I'd even support opening a new pull request that only has the get_data() changes + test (and then making this PR only about add_features_from()). Totally your choice, I want to be respectful of your time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion!
I added tests for both get_data() and add_features_from() directly in test_arrow.py as part of this PR.
Please let me know if there’s anything else you’d like me to adjust!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameslamb
I’ve opened a new pull request(#6911) that includes only the changes to get_data() along with the corresponding test. This should help keep things focused. I’d appreciate it if you could take a look when you have time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll focus there.

@suk1yak1 suk1yak1 requested a review from jameslamb April 30, 2025 00:12
@suk1yak1 suk1yak1 marked this pull request as draft May 9, 2025 14:42
@suk1yak1
Copy link
Contributor Author

suk1yak1 commented May 9, 2025

I’ll mark this PR as a draft for now. Once #6911 is merged, I’ll reopen it and follow up accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[python-package]add_features_from with PyArrow Table incorrectly frees raw data despite free_raw_data=False
2 participants