Skip to content

[python-package] add PyArrow Table to get_data #6911

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

suk1yak1
Copy link
Contributor

@suk1yak1 suk1yak1 commented May 9, 2025

Fixes #6891

Description

This PR adds proper support for PyArrow Table in Dataset's get_data methods.
This missing implementations caused issues when working with PyArrow Tables with free_raw_data=False,
as the data would be incorrectly set to None after operations.

Changes

  • Added PyArrow Table support in get_data method to handle subsetting with used_indices

@@ -456,6 +456,31 @@ def test_arrow_feature_name_manual():
assert booster.feature_name() == ["c", "d"]


def test_get_data_arrow_table():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! But this test is not actually covering the code you've added in basic.py. I just tested, and this passes on master.

The code you've added will only be run when creating a new Dataset that is a subset of another Dataset, because of this if condition:

if self._need_slice and self.used_indices is not None and self.reference is not None:

Here is an example of some minimal code that exhibits the problematic behavior on master:

import lightgbm as lgb
import pyarrow as pa
from sklearn.datasets import make_regression

X, y = make_regression(n_rows=1_000, n_features=3)
X = pa.Table.from_arrays(
    [
        pa.array(X[:,i])
        for i in range(X.shape[1])
    ],
    names=[f"col_{i}" for i in range(X.shape[1])]
)

# create a Dataset from a pyarrow Table
ds1 = lgb.Dataset(X, y, free_raw_data=False)

# create another Dataset that is a subset of that, with the first 100 rows
ds2 = ds1.subset(used_indices=list(range(100)))

# this raises the warning
ds2.construct()

# this returns the entire 1000-row Dataset, instead of one
# built from just the first 100 rows
ds2.get_data()

ds2.construct() emits this warning:

/Users/jlamb/miniforge3/envs/lgb-dev/lib/python3.11/site-packages/lightgbm/basic.py:3273: UserWarning: Cannot subset Table type of raw data.
Returning original raw data
_log_warning(
<lightgbm.basic.Dataset object at 0x16281d7d0>

And returns the entire 1000-row Dataset.

Please use this information to create a new test case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameslamb
Thank you for the detailed feedback!
I’ve added a test case to ensure that when subset of Dataset created from pyarrow.Table, the operation correctly returns only the subset of rows rather than the entire original table. Please let me know if there’s anything else you’d like me to clarify or adjust.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameslamb
I've updated the test to use generate_random_arrow_table() (1000 rows, 3 columns) instead of generate_simple_arrow_table() (5 rows) to make it more realistic and better match your example.

I also added a safe_array_equal_with_nulls() function to handle a tricky issue with PyArrow Tables - when null values are present, the standard equals comparison returns False even if the actual data is identical. This helper function ensures we can properly compare tables that contain nulls and verify they have the same data content.

@jameslamb jameslamb added the fix label May 10, 2025
@suk1yak1 suk1yak1 requested a review from jameslamb May 10, 2025 15:06
@suk1yak1
Copy link
Contributor Author

suk1yak1 commented Jun 2, 2025

@jameslamb
I've updated the test case.
Could you please take another look when you have a chance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[python-package]add_features_from with PyArrow Table incorrectly frees raw data despite free_raw_data=False
3 participants