Skip to content

[data] fix: reduce verbosity of arrow conversion warning logs#57916

Closed
cwarre33 wants to merge 3 commits intoray-project:masterfrom
cwarre33:fix/issue-57840-arrow-warning-verbosity
Closed

[data] fix: reduce verbosity of arrow conversion warning logs#57916
cwarre33 wants to merge 3 commits intoray-project:masterfrom
cwarre33:fix/issue-57840-arrow-warning-verbosity

Conversation

@cwarre33
Copy link
Copy Markdown

Summary

This PR addresses issue #57840 by reducing the verbosity of arrow conversion warning logs.

Problem

When Ray Data encounters Arrow conversion errors with nested datatypes, it falls back to pickled Python objects but logs extremely verbose warnings that include full array dumps (10,000+ characters), making application logs noisy and difficult to read.

Solution

  • Truncate error messages in warnings to 200 characters
  • Add indication of truncated content size: [truncated X chars]
  • Preserve full error details via exc_info for debugging when needed
  • Maintain backward compatibility

Changes Made

  • Modified: python/ray/air/util/tensor_extensions/arrow.py
    • Added truncation logic for error messages in warning logs
    • Configurable max length (200 chars)
  • Added: python/ray/data/tests/test_arrow_warning_truncation.py
    • Test for proper truncation with large arrays
    • Test that small errors aren't unnecessarily truncated

Example

Before:

WARNING arrow.py:194 -- Failed to convert column 'flat_images' into pyarrow array
due to: Error converting data to Arrow: [[array([[[130, 118, 255], [132, 117, 255],
[130, 115, 252], ... (10,000+ more characters) ...

After:

WARNING arrow.py:196 -- Failed to convert column 'flat_images' into pyarrow array
due to: Error converting data to Arrow: [[array([[[130, 118, 255], [132, 117, 255]...
[truncated 9834 chars]; falling back to serialize as pickled python objects

Testing

  • Added comprehensive tests
  • Tests verify truncation behavior
  • Tests verify small errors aren't affected
  • Manually verified with reproduction code from issue

Checklist

  • Code follows project style guidelines
  • Self-reviewed the code
  • Added comments for complex logic
  • No new warnings introduced
  • Added tests that prove fix works
  • Backward compatible - no breaking changes

Related Issue

Fixes #57840

When Ray Data encounters Arrow conversion errors with nested datatypes,
it falls back to pickled Python objects but previously logged extremely
verbose warnings including full array dumps. This made application logs
noisy and difficult to read.

Changes:
- Truncate error messages in arrow conversion warnings to 200 characters
- Add indication of truncated content size
- Preserve full error details via exc_info for debugging
- Add comprehensive tests for warning truncation behavior

The fix maintains backward compatibility while significantly improving
log readability. Full error details remain available through the
exc_info stack trace when needed for debugging.

Fixes #57840
@cwarre33 cwarre33 requested review from a team as code owners October 20, 2025 16:57
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to reduce the verbosity of Arrow conversion warning logs by truncating long error messages. The implementation in arrow.py correctly truncates the message, and new tests are added to verify this behavior. My review includes a suggestion to improve maintainability by using a defined constant instead of a magic number for the truncation length. Additionally, I've identified issues in the new test file where a test case doesn't trigger the intended error and an assertion is incorrect, and I've provided a corrected version of the test.

@ray-gardener ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Oct 20, 2025
@richardliaw
Copy link
Copy Markdown
Contributor

@cwarre33 awesome!! thanks for pushing out a fix so quickly. Will review soon.

@gvspraveen
Copy link
Copy Markdown
Contributor

@cwarre33 thanks for quick fix!

Copy link
Copy Markdown
Contributor

@iamjustinhsu iamjustinhsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! some nits

error_str = str(ace)
max_error_len = 200
if len(error_str) > max_error_len:
error_str = error_str[:max_error_len] + f"... [truncated {len(error_str) - max_error_len} chars]"
Copy link
Copy Markdown
Contributor

@iamjustinhsu iamjustinhsu Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! I noticed at least 1 more area where we truncate https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/metadata_exporter.py#L167. Do u think you could unify both areas (here and there)?

)


def test_arrow_conversion_small_error_not_truncated(caplog):
Copy link
Copy Markdown
Contributor

@iamjustinhsu iamjustinhsu Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you combine this test with the other by using pytest.mark.parametrize? One thought that comes to mind is renaming test_arrow_conversion_warning_truncation to test_arrow_conversion_warning and then doing this:

@pytest.mark.parametrize("dataset_size", [1, 50], ids=["full_msg", "truncate_msg"])
def test_arrow_conversion_warning

(warning) haven't tested

cursor[bot]

This comment was marked as outdated.

@cwarre33 cwarre33 force-pushed the fix/issue-57840-arrow-warning-verbosity branch from 8b4b9a1 to ab1a33a Compare October 21, 2025 23:34
…e constants

Address reviewer feedback from PR #57840:

1. Use ArrowConversionError.MAX_DATA_STR_LEN constant instead of magic number (200)
2. Rely on ArrowConversionError's built-in truncation instead of double-truncating
   (the constructor already truncates the data string to MAX_DATA_STR_LEN)
3. Refactor tests to use @pytest.mark.parametrize for cleaner test organization,
   combining test_arrow_conversion_warning_truncation and
   test_arrow_conversion_small_error_not_truncated into single parametrized test

This improves code maintainability, reduces duplication, and prevents double
truncation that could confuse logs.

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
@richardliaw
Copy link
Copy Markdown
Contributor

could you help fix the ci @cwarre33 ? need to address DCO and the microcheck failure

(1, "full_msg"),
(50, "truncated_msg"),
],
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Misuse of ids in pytest.mark.parametrize

The pytest.mark.parametrize decorator is incorrectly structured. The parameter "ids" is being used as a test parameter value, but "ids" is a reserved keyword argument in pytest.mark.parametrize meant for providing custom test IDs. This creates confusion and incorrect test parametrization. The correct approach would be to either: (1) rename the parameter to something like "test_case" or "scenario", or (2) use the ids keyword argument properly: @pytest.mark.parametrize("dataset_size", [1, 50], ids=["full_msg", "truncated_msg"]) and remove ids from the function parameters.

Fix in Cursor Fix in Web

@gvspraveen
Copy link
Copy Markdown
Contributor

@cwarre33 Thanks for your contributions! Can you help fix test failures and DCO?

@gvspraveen gvspraveen added the @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. label Nov 12, 2025
@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 27, 2025
@omatthew98 omatthew98 changed the title fix: reduce verbosity of arrow conversion warning logs [data] fix: reduce verbosity of arrow conversion warning logs Dec 4, 2025
@github-actions github-actions bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Dec 4, 2025
@iamjustinhsu
Copy link
Copy Markdown
Contributor

Hi @cwarre33, are you still working on this?

@cwarre33 cwarre33 closed this Feb 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] Make arrow conversion warning logs less verbose

4 participants