-
Notifications
You must be signed in to change notification settings - Fork 3.9k
[python-package] add PyArrow Table to get_data #6911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[python-package] add PyArrow Table to get_data #6911
Conversation
@@ -456,6 +456,31 @@ def test_arrow_feature_name_manual(): | |||
assert booster.feature_name() == ["c", "d"] | |||
|
|||
|
|||
def test_get_data_arrow_table(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! But this test is not actually covering the code you've added in basic.py
. I just tested, and this passes on master
.
The code you've added will only be run when creating a new Dataset
that is a subset of another Dataset
, because of this if
condition:
LightGBM/python-package/lightgbm/basic.py
Line 3261 in f91dcfe
if self._need_slice and self.used_indices is not None and self.reference is not None: |
Here is an example of some minimal code that exhibits the problematic behavior on master
:
import lightgbm as lgb
import pyarrow as pa
from sklearn.datasets import make_regression
X, y = make_regression(n_rows=1_000, n_features=3)
X = pa.Table.from_arrays(
[
pa.array(X[:,i])
for i in range(X.shape[1])
],
names=[f"col_{i}" for i in range(X.shape[1])]
)
# create a Dataset from a pyarrow Table
ds1 = lgb.Dataset(X, y, free_raw_data=False)
# create another Dataset that is a subset of that, with the first 100 rows
ds2 = ds1.subset(used_indices=list(range(100)))
# this raises the warning
ds2.construct()
# this returns the entire 1000-row Dataset, instead of one
# built from just the first 100 rows
ds2.get_data()
ds2.construct()
emits this warning:
/Users/jlamb/miniforge3/envs/lgb-dev/lib/python3.11/site-packages/lightgbm/basic.py:3273: UserWarning: Cannot subset Table type of raw data.
Returning original raw data
_log_warning(
<lightgbm.basic.Dataset object at 0x16281d7d0>
And returns the entire 1000-row Dataset.
Please use this information to create a new test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jameslamb
Thank you for the detailed feedback!
I’ve added a test case to ensure that when subset of Dataset created from pyarrow.Table, the operation correctly returns only the subset of rows rather than the entire original table. Please let me know if there’s anything else you’d like me to clarify or adjust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jameslamb
I've updated the test to use generate_random_arrow_table()
(1000 rows, 3 columns) instead of generate_simple_arrow_table()
(5 rows) to make it more realistic and better match your example.
I also added a safe_array_equal_with_nulls()
function to handle a tricky issue with PyArrow Tables - when null values are present, the standard equals comparison returns False even if the actual data is identical. This helper function ensures we can properly compare tables that contain nulls and verify they have the same data content.
@jameslamb |
Fixes #6891
Description
This PR adds proper support for PyArrow Table in Dataset's get_data methods.
This missing implementations caused issues when working with PyArrow Tables with free_raw_data=False,
as the data would be incorrectly set to None after operations.
Changes