[python-package] add PyArrow Table support to get_data and add_features_from methods #6892

suk1yak1 · 2025-04-23T13:55:49Z

Description

This PR adds proper support for PyArrow Table in Dataset's get_data and add_features_from methods.
Previously, when using a PyArrow Table as input data, there was no handling for:

Subsetting a PyArrow Table in get_data method when used_indices is set
Merging PyArrow Tables in add_features_from method

These missing implementations caused issues when working with PyArrow Tables with free_raw_data=False,
as the data would be incorrectly set to None after operations.

Changes

Added PyArrow Table support in get_data method to handle subsetting with used_indices
Added PyArrow Table support in add_features_from method to properly merge columns from two tables
Added proper validation and error handling when working with PyArrow Tables

…s raw data despite free_raw_data=False (microsoft#6891)

jameslamb · 2025-04-23T13:59:32Z

Thanks! We'll review more thoroughly later, but a first quick comment... please add some tests covering the new code you'd like this project to maintain.

suk1yak1 · 2025-04-23T14:00:54Z

@microsoft-github-policy-service agree

…ferent_sources (microsoft#6891)

…d_features_from method (microsoft#6891)

suk1yak1 · 2025-04-26T13:58:58Z

Thank you for your feedback! I have added a test case for the Pyarrow Table scenario to test_add_features_from_different_sources. I also updated the code to handle the case where other.data is a Pyarrow Table.

jameslamb

Thanks very much! I was able to review a bit more closely tonight, left another round of suggestions. I will put up a PR trying to drop support for h2o's datatable, to hopefully make this PR a little bit simpler.

jameslamb · 2025-04-27T05:01:24Z

python-package/lightgbm/basic.py

+                            "without pyarrow installed. "
+                            "Install pyarrow and restart your session."
+                        )
+                    else:


Please remove this else:. We have a slight preference in this project for just raising exceptions, to reduce unnecessary extra indentation.

Like this:

LightGBM/python-package/lightgbm/basic.py

Lines 2461 to 2462 in a725360

if not (PYARROW_INSTALLED and CFFI_INSTALLED):

raise LightGBMError("Cannot init Dataset from Arrow without 'pyarrow' and 'cffi' installed.")

That's enforced by convention today... if there's some ruff rule that could enforce that (like no-unnecessary-else or something), I'd support adding it.

Thanks for the feedback!
I created an issue to propose adding a Ruff rule for this pattern: #6903.
Following that, I also opened a pull request #6904 to enable the RET506 (superfluous-else-raise) rule and fix the related code.

In this commit, I removed the unnecessary else block as suggested.

jameslamb · 2025-04-27T05:12:09Z

python-package/lightgbm/basic.py

+                elif isinstance(other.data, pa_Table):
+                    if not PYARROW_INSTALLED:
+                        raise LightGBMError(
+                            "Cannot add features to pyarrow.Table type of raw data "
+                            "without pyarrow installed. "
+                            "Install pyarrow and restart your session."
+                        )
+                    else:
+                        self.data = dt_DataTable(
+                            np.hstack(
+                                (
+                                    self.data.to_numpy(),
+                                    np.column_stack(
+                                        [other.data.column(i).to_numpy() for i in range(len(other.data.column_names))]
+                                    ),
+                                )
+                            )
+                        )
+                else:
+                    self.data = None


Seeing more datatable code getting added here is making me think we should just do what xgboost did (dmlc/xgboost#11070) and just fully drop support for it now.

In #6662, I'd proposed having deprecation warnings for "2-3 releases", but I'm going to put up a PR just proposing dropping this in the next release. We got a deprecation warning into 4.6.0 (#6670), which was released about 2 months ago, and it'll probably be at least another 2 months until the next LightGBM release... I think that's enough time.

Opened #6894

Thanks for thinking through this and for moving things forward!
I appreciate you taking the initiative to propose a clear path for dropping datatable support.

jameslamb · 2025-04-27T05:15:28Z

python-package/lightgbm/basic.py

@@ -3293,6 +3296,8 @@ def get_data(self) -> Optional[_LGBM_TrainDataType]:
                    self.data = self.data[self.used_indices, :]
                elif isinstance(self.data, Sequence):
                    self.data = self.data[self.used_indices]
+                elif isinstance(self.data, pa_Table):
+                    self.data = self.data.take(self.used_indices)


Thanks! This was definitely just something we'd missed.

Can you please add a test in https://github.com/microsoft/LightGBM/blob/master/tests/python_package_test/test_arrow.py just for this change to get_data()? The other changes you made in test_basic.py do not cover these changes.

When you do that, please check that the content of self.data AND the returned value are correct (e.g., contain exactly the expected values and data types).

If you'd like, I'd even support opening a new pull request that only has the get_data() changes + test (and then making this PR only about add_features_from()). Totally your choice, I want to be respectful of your time.

Thanks for the suggestion!
I added tests for both get_data() and add_features_from() directly in test_arrow.py as part of this PR.
Please let me know if there’s anything else you’d like me to adjust!

@jameslamb
I’ve opened a new pull request(#6911) that includes only the changes to get_data() along with the corresponding test. This should help keep things focused. I’d appreciate it if you could take a look when you have time.

Thanks! I'll focus there.

…ix/6891-pyarrow-table-add-features

suk1yak1 · 2025-05-09T14:45:23Z

I’ll mark this PR as a draft for now. Once #6911 is merged, I’ll reopen it and follow up accordingly.

[python-package]add_features_from with PyArrow Table incorrectly free…

cf13aad

…s raw data despite free_raw_data=False (microsoft#6891)

suk1yak1 requested review from guolinke, jameslamb, shiyu1994, jmoralez, borchero and StrikerRUS as code owners April 23, 2025 13:55

suk1yak1 and others added 3 commits April 23, 2025 23:22

[python-package] add PyArrow Table case to test_add_features_from_dif…

99ee66f

…ferent_sources (microsoft#6891)

[python-package] fix handling and tests for PyArrow Table input in ad…

a3256a3

…d_features_from method (microsoft#6891)

Merge branch 'master' into fix/6891-pyarrow-table-add-features

9de2650

jameslamb changed the title ~~[python-package]add PyArrow Table support to get_data and add_features_from methods~~ [python-package] add PyArrow Table support to get_data and add_features_from methods Apr 25, 2025

jameslamb requested changes Apr 27, 2025

View reviewed changes

jameslamb mentioned this pull request Apr 27, 2025

[python-package] drop support for h2o datatable #6894

Merged

suk1yak1 added 2 commits April 28, 2025 14:19

delete unnecessary-else

01bc668

Merge branch 'master' of https://github.com/microsoft/LightGBM into f…

27e2545

…ix/6891-pyarrow-table-add-features

suk1yak1 mentioned this pull request Apr 28, 2025

[python-package]Add superfluous-else-raise (RET506) rule to detect unnecessary else after raise #6903

Closed

suk1yak1 added 2 commits April 28, 2025 22:53

[python-package]add test for pyarrow table in Dataset.get_data()

219e61c

[python-package]add test for add_features_from with pyarrow tables

d4de309

suk1yak1 requested a review from jameslamb April 30, 2025 00:12

Merge branch 'master' into fix/6891-pyarrow-table-add-features

6686a60

suk1yak1 marked this pull request as draft May 9, 2025 14:42

jameslamb mentioned this pull request Jun 13, 2025

[python-package] add_features_from fails if the dataset is built from pyarrow Tables #6937

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[python-package] add PyArrow Table support to get_data and add_features_from methods #6892

[python-package] add PyArrow Table support to get_data and add_features_from methods #6892

Uh oh!

suk1yak1 commented Apr 23, 2025 •

edited

Loading

Uh oh!

jameslamb commented Apr 23, 2025

Uh oh!

suk1yak1 commented Apr 23, 2025

Uh oh!

suk1yak1 commented Apr 26, 2025

Uh oh!

jameslamb left a comment

Uh oh!

jameslamb Apr 27, 2025

Uh oh!

suk1yak1 Apr 28, 2025

Uh oh!

suk1yak1 Apr 28, 2025

Uh oh!

jameslamb Apr 27, 2025

Uh oh!

jameslamb Apr 27, 2025

Uh oh!

suk1yak1 Apr 28, 2025

Uh oh!

jameslamb Apr 27, 2025

Uh oh!

suk1yak1 Apr 28, 2025

Uh oh!

suk1yak1 May 9, 2025

Uh oh!

jameslamb May 10, 2025

Uh oh!

suk1yak1 commented May 9, 2025

Uh oh!

Uh oh!

	if not (PYARROW_INSTALLED and CFFI_INSTALLED):
	raise LightGBMError("Cannot init Dataset from Arrow without 'pyarrow' and 'cffi' installed.")

[python-package] add PyArrow Table support to get_data and add_features_from methods #6892

Are you sure you want to change the base?

[python-package] add PyArrow Table support to get_data and add_features_from methods #6892

Uh oh!

Conversation

suk1yak1 commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Uh oh!

jameslamb commented Apr 23, 2025

Uh oh!

suk1yak1 commented Apr 23, 2025

Uh oh!

suk1yak1 commented Apr 26, 2025

Uh oh!

jameslamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

suk1yak1 commented May 9, 2025

Uh oh!

Uh oh!

suk1yak1 commented Apr 23, 2025 •

edited

Loading