Skip to content

[python-package]add_features_from with PyArrow Table incorrectly frees raw data despite free_raw_data=False #6891

@suk1yak1

Description

@suk1yak1

Description

When using add_features_from with PyArrow Table as input data and free_raw_data=False, the method incorrectly frees the raw data. This causes get_data() to raise a LightGBMError suggesting that the raw data was freed, even though free_raw_data=False was explicitly set.

Reproducible example

import lightgbm as lgb
import pyarrow as pa
import numpy as np

# Create sample data using PyArrow
data1 = pa.Table.from_arrays(
    [pa.array(np.random.rand(100)), pa.array(np.random.rand(100))],
    names=['feature1', 'feature2']
)
data2 = pa.Table.from_arrays(
    [pa.array(np.random.rand(100))],
    names=['feature3']
)

# Create label
label = np.random.randint(2, size=100)

# Create and construct datasets
dataset1 = lgb.Dataset(data1, label=label, free_raw_data=False).construct()
dataset2 = lgb.Dataset(data2, label=label, free_raw_data=False).construct()

# Add features
dataset1.add_features_from(dataset2)

# This raises LightGBMError despite free_raw_data=False
try:
    print(dataset1.get_data())
except lgb.basic.LightGBMError as e:
    print(e)  # Outputs: "Cannot call `get_data` after freed raw data, set free_raw_data=False when construct Dataset to avoid this."

Environment info

LightGBM version or commit hash:

Command(s) you used to install LightGBM

LightGBM version: 4.6.0
PyArrow version: 19.0.1
Python version: >=3.11

pip install lightgbm>=4.6.0 pyarrow>=19.0.1

Additional Comments

This issue specifically affects PyArrow Table support, which was recently added to LightGBM. Other data types (numpy arrays, pandas DataFrames, etc.) work correctly with free_raw_data=False. The problem appears to be in the implementation of add_features_from where the PyArrow Table case is not properly handling the free_raw_data=False flag, causing the raw data to be freed despite the flag being set to False.
The expected behavior would be:
When free_raw_data=False, the raw data should be preserved after add_features_from
get_data() should return the combined PyArrow Table after add_features_from

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions