Description
Description
When using add_features_from with PyArrow Table as input data and free_raw_data=False, the method incorrectly frees the raw data. This causes get_data() to raise a LightGBMError suggesting that the raw data was freed, even though free_raw_data=False was explicitly set.
Reproducible example
import lightgbm as lgb
import pyarrow as pa
import numpy as np
# Create sample data using PyArrow
data1 = pa.Table.from_arrays(
[pa.array(np.random.rand(100)), pa.array(np.random.rand(100))],
names=['feature1', 'feature2']
)
data2 = pa.Table.from_arrays(
[pa.array(np.random.rand(100))],
names=['feature3']
)
# Create label
label = np.random.randint(2, size=100)
# Create and construct datasets
dataset1 = lgb.Dataset(data1, label=label, free_raw_data=False).construct()
dataset2 = lgb.Dataset(data2, label=label, free_raw_data=False).construct()
# Add features
dataset1.add_features_from(dataset2)
# This raises LightGBMError despite free_raw_data=False
try:
print(dataset1.get_data())
except lgb.basic.LightGBMError as e:
print(e) # Outputs: "Cannot call `get_data` after freed raw data, set free_raw_data=False when construct Dataset to avoid this."
Environment info
LightGBM version or commit hash:
Command(s) you used to install LightGBM
LightGBM version: 4.6.0
PyArrow version: 19.0.1
Python version: >=3.11
pip install lightgbm>=4.6.0 pyarrow>=19.0.1
Additional Comments
This issue specifically affects PyArrow Table support, which was recently added to LightGBM. Other data types (numpy arrays, pandas DataFrames, etc.) work correctly with free_raw_data=False. The problem appears to be in the implementation of add_features_from where the PyArrow Table case is not properly handling the free_raw_data=False flag, causing the raw data to be freed despite the flag being set to False.
The expected behavior would be:
When free_raw_data=False, the raw data should be preserved after add_features_from
get_data() should return the combined PyArrow Table after add_features_from