[data] fix: reduce verbosity of arrow conversion warning logs by cwarre33 · Pull Request #57916 · ray-project/ray

cwarre33 · 2025-10-20T16:57:31Z

Summary

This PR addresses issue #57840 by reducing the verbosity of arrow conversion warning logs.

Problem

When Ray Data encounters Arrow conversion errors with nested datatypes, it falls back to pickled Python objects but logs extremely verbose warnings that include full array dumps (10,000+ characters), making application logs noisy and difficult to read.

Solution

Truncate error messages in warnings to 200 characters
Add indication of truncated content size: [truncated X chars]
Preserve full error details via exc_info for debugging when needed
Maintain backward compatibility

Changes Made

Modified: python/ray/air/util/tensor_extensions/arrow.py
- Added truncation logic for error messages in warning logs
- Configurable max length (200 chars)
Added: python/ray/data/tests/test_arrow_warning_truncation.py
- Test for proper truncation with large arrays
- Test that small errors aren't unnecessarily truncated

Example

Before:

WARNING arrow.py:194 -- Failed to convert column 'flat_images' into pyarrow array
due to: Error converting data to Arrow: [[array([[[130, 118, 255], [132, 117, 255],
[130, 115, 252], ... (10,000+ more characters) ...

After:

WARNING arrow.py:196 -- Failed to convert column 'flat_images' into pyarrow array
due to: Error converting data to Arrow: [[array([[[130, 118, 255], [132, 117, 255]...
[truncated 9834 chars]; falling back to serialize as pickled python objects

Testing

Added comprehensive tests
Tests verify truncation behavior
Tests verify small errors aren't affected
Manually verified with reproduction code from issue

Checklist

Code follows project style guidelines
Self-reviewed the code
Added comments for complex logic
No new warnings introduced
Added tests that prove fix works
Backward compatible - no breaking changes

Related Issue

Fixes #57840

When Ray Data encounters Arrow conversion errors with nested datatypes, it falls back to pickled Python objects but previously logged extremely verbose warnings including full array dumps. This made application logs noisy and difficult to read. Changes: - Truncate error messages in arrow conversion warnings to 200 characters - Add indication of truncated content size - Preserve full error details via exc_info for debugging - Add comprehensive tests for warning truncation behavior The fix maintains backward compatibility while significantly improving log readability. Full error details remain available through the exc_info stack trace when needed for debugging. Fixes #57840

gemini-code-assist

Code Review

This pull request aims to reduce the verbosity of Arrow conversion warning logs by truncating long error messages. The implementation in arrow.py correctly truncates the message, and new tests are added to verify this behavior. My review includes a suggestion to improve maintainability by using a defined constant instead of a magic number for the truncation length. Additionally, I've identified issues in the new test file where a test case doesn't trigger the intended error and an assertion is incorrect, and I've provided a corrected version of the test.

python/ray/data/tests/test_arrow_warning_truncation.py

python/ray/air/util/tensor_extensions/arrow.py

richardliaw · 2025-10-21T20:49:12Z

@cwarre33 awesome!! thanks for pushing out a fix so quickly. Will review soon.

gvspraveen · 2025-10-21T20:51:19Z

@cwarre33 thanks for quick fix!

iamjustinhsu

Thanks for the contribution! some nits

iamjustinhsu · 2025-10-21T21:27:24Z

python/ray/air/util/tensor_extensions/arrow.py

+            error_str = str(ace)
+            max_error_len = 200
+            if len(error_str) > max_error_len:
+                error_str = error_str[:max_error_len] + f"... [truncated {len(error_str) - max_error_len} chars]"


nice! I noticed at least 1 more area where we truncate https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/metadata_exporter.py#L167. Do u think you could unify both areas (here and there)?

iamjustinhsu · 2025-10-21T21:29:52Z

python/ray/data/tests/test_arrow_warning_truncation.py

+    )
+
+
+def test_arrow_conversion_small_error_not_truncated(caplog):


can you combine this test with the other by using pytest.mark.parametrize? One thought that comes to mind is renaming test_arrow_conversion_warning_truncation to test_arrow_conversion_warning and then doing this:

@pytest.mark.parametrize("dataset_size", [1, 50], ids=["full_msg", "truncate_msg"]) def test_arrow_conversion_warning

(warning) haven't tested

…e constants Address reviewer feedback from PR #57840: 1. Use ArrowConversionError.MAX_DATA_STR_LEN constant instead of magic number (200) 2. Rely on ArrowConversionError's built-in truncation instead of double-truncating (the constructor already truncates the data string to MAX_DATA_STR_LEN) 3. Refactor tests to use @pytest.mark.parametrize for cleaner test organization, combining test_arrow_conversion_warning_truncation and test_arrow_conversion_small_error_not_truncated into single parametrized test This improves code maintainability, reduces duplication, and prevents double truncation that could confuse logs. 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>

richardliaw · 2025-10-23T16:16:20Z

could you help fix the ci @cwarre33 ? need to address DCO and the microcheck failure

cursor · 2025-10-27T19:04:47Z

python/ray/data/tests/test_arrow_warning_truncation.py

+        (1, "full_msg"),
+        (50, "truncated_msg"),
+    ],
+)


Bug: Misuse of ids in pytest.mark.parametrize

The pytest.mark.parametrize decorator is incorrectly structured. The parameter "ids" is being used as a test parameter value, but "ids" is a reserved keyword argument in pytest.mark.parametrize meant for providing custom test IDs. This creates confusion and incorrect test parametrization. The correct approach would be to either: (1) rename the parameter to something like "test_case" or "scenario", or (2) use the ids keyword argument properly: @pytest.mark.parametrize("dataset_size", [1, 50], ids=["full_msg", "truncated_msg"]) and remove ids from the function parameters.

gvspraveen · 2025-11-05T19:05:42Z

@cwarre33 Thanks for your contributions! Can you help fix test failures and DCO?

github-actions · 2025-11-27T00:38:32Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

iamjustinhsu · 2026-02-04T19:23:15Z

Hi @cwarre33, are you still working on this?

cwarre33 requested review from a team as code owners October 20, 2025 16:57

gemini-code-assist bot reviewed Oct 20, 2025

View reviewed changes

python/ray/data/tests/test_arrow_warning_truncation.py Outdated Show resolved Hide resolved

python/ray/air/util/tensor_extensions/arrow.py Outdated Show resolved Hide resolved

ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Oct 20, 2025

iamjustinhsu reviewed Oct 21, 2025

View reviewed changes

This comment was marked as outdated.

Sign in to view

cwarre33 force-pushed the fix/issue-57840-arrow-warning-verbosity branch from 8b4b9a1 to ab1a33a Compare October 21, 2025 23:34

Merge branch 'master' into fix/issue-57840-arrow-warning-verbosity

0cd018e

cursor bot reviewed Oct 27, 2025

View reviewed changes

gvspraveen added the @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. label Nov 12, 2025

github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 27, 2025

omatthew98 changed the title ~~fix: reduce verbosity of arrow conversion warning logs~~ [data] fix: reduce verbosity of arrow conversion warning logs Dec 4, 2025

github-actions bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Dec 4, 2025

cwarre33 closed this Feb 27, 2026

		)


		def test_arrow_conversion_small_error_not_truncated(caplog):

Conversation

cwarre33 commented Oct 20, 2025

Summary

Problem

Solution

Changes Made

Example

Testing

Checklist

Related Issue

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

richardliaw commented Oct 21, 2025

Uh oh!

gvspraveen commented Oct 21, 2025

Uh oh!

iamjustinhsu left a comment

Choose a reason for hiding this comment

Uh oh!

iamjustinhsu Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iamjustinhsu Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

richardliaw commented Oct 23, 2025

Uh oh!

cursor bot Oct 27, 2025

Choose a reason for hiding this comment

Bug: Misuse of ids in pytest.mark.parametrize

Uh oh!

gvspraveen commented Nov 5, 2025

Uh oh!

github-actions bot commented Nov 27, 2025

Uh oh!

iamjustinhsu commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

iamjustinhsu Oct 21, 2025 •

edited

Loading

iamjustinhsu Oct 21, 2025 •

edited

Loading

Bug: Misuse of `ids` in `pytest.mark.parametrize`