Skip to content

[Train] Handle Arrow-backed pandas dtypes in LightGBM examples#63427

Open
pseudo-rnd-thoughts wants to merge 8 commits into
ray-project:masterfrom
pseudo-rnd-thoughts:fix-lightgbm-pyarrow-types
Open

[Train] Handle Arrow-backed pandas dtypes in LightGBM examples#63427
pseudo-rnd-thoughts wants to merge 8 commits into
ray-project:masterfrom
pseudo-rnd-thoughts:fix-lightgbm-pyarrow-types

Conversation

@pseudo-rnd-thoughts
Copy link
Copy Markdown
Member

@pseudo-rnd-thoughts pseudo-rnd-thoughts commented May 18, 2026

Description

#63017 updated Ray Data's Arrow-to-pandas conversion to preserve Arrow-backed pandas dtypes, such as int64[pyarrow], so dtypes can roundtrip more faithfully.

This exposed an incompatibility with LightGBM's pandas input path. Ray Train's LightGBM examples and legacy trainer code convert Ray Data shards to pandas before constructing lightgbm.Dataset. With Arrow-backed Ray Data inputs, those pandas DataFrames can now contain Arrow-backed dtypes, and LightGBM rejects them during pandas dtype validation even when the logical column type is numeric.

This PR updates the LightGBM paths to normalize pandas DataFrames to NumPy-nullable pandas dtypes before passing them to LightGBM. It also updates documentation and examples to show the same conversion for user-authored LightGBM training loops.

Changes

  • Normalize pandas DataFrames in the legacy LightGBMTrainer path before constructing lightgbm.Dataset.
  • Update LightGBM docs and docstring examples to call convert_dtypes(dtype_backend="numpy_nullable") after to_pandas().
  • Add a regression test covering Arrow-backed Ray Data input from ray.data.from_items(...).
  • Keep the restore test focused on trainer restore behavior by using a pandas-backed test dataset.
  • Update the LightGBM release benchmark to avoid passing Arrow-backed pandas dtypes to LightGBM.

Signed-off-by: Mark Towers <mark@anyscale.com>
@pseudo-rnd-thoughts pseudo-rnd-thoughts requested a review from a team as a code owner May 18, 2026 10:04
@pseudo-rnd-thoughts pseudo-rnd-thoughts added the train Ray Train Related Issue label May 18, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the use of convert_dtypes() across LightGBM trainer implementations, documentation examples, and tests to ensure compatibility with Arrow-backed pandas DataFrames from Ray Data. It also re-enables the test_trainer_restore test. The review feedback recommends simplifying the code by removing redundant materialize() calls and omitting the explicit dtype_backend argument to maintain compatibility with pandas versions prior to 2.0.0. Additionally, the reviewer suggests applying these fixes to the legacy LightGBM trainer and warns about potential performance overhead when using convert_dtypes() on very large datasets.

Comment thread python/ray/train/BUILD.bazel
Comment thread doc/source/train/getting-started-lightgbm.rst Outdated
Comment thread python/ray/train/lightgbm/lightgbm_trainer.py Outdated
Comment thread python/ray/train/lightgbm/v2.py Outdated
Comment thread python/ray/train/v2/lightgbm/lightgbm_trainer.py Outdated
Comment thread python/ray/train/v2/tests/test_lightgbm_trainer.py Outdated
Comment thread python/ray/train/v2/tests/test_lightgbm_trainer.py Outdated
Comment thread python/ray/train/v2/tests/test_local_mode.py Outdated
Comment thread python/ray/train/v2/tests/test_local_mode.py Outdated
Comment thread release/train_tests/xgboost_lightgbm/train_batch_inference_benchmark.py Outdated
Mark Towers added 4 commits May 18, 2026 13:44
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 17abd91. Configure here.

elif pa.types.is_unsigned_integer(arrow_dtype):
dtype_mapping[column] = f"UInt{arrow_dtype.bit_width}"
elif pa.types.is_floating(arrow_dtype):
dtype_mapping[column] = f"Float{arrow_dtype.bit_width}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Float16 arrow dtype causes TypeError in normalize function

Low Severity

pa.types.is_floating returns True for float16 (half-float) Arrow types, so the function generates "Float16" as the target pandas dtype. However, pandas only supports Float32 and Float64 as nullable float extension types — there is no Float16. This causes a TypeError at the df.astype(dtype_mapping) call if any column has pd.ArrowDtype(pa.float16()). The pa.types.is_float16() check could be used to skip or handle this case separately.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 17abd91. Configure here.

Mark Towers added 3 commits May 19, 2026 13:45
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant