[data] fix: reduce verbosity of arrow conversion warning logs#57916
[data] fix: reduce verbosity of arrow conversion warning logs#57916cwarre33 wants to merge 3 commits intoray-project:masterfrom
Conversation
When Ray Data encounters Arrow conversion errors with nested datatypes, it falls back to pickled Python objects but previously logged extremely verbose warnings including full array dumps. This made application logs noisy and difficult to read. Changes: - Truncate error messages in arrow conversion warnings to 200 characters - Add indication of truncated content size - Preserve full error details via exc_info for debugging - Add comprehensive tests for warning truncation behavior The fix maintains backward compatibility while significantly improving log readability. Full error details remain available through the exc_info stack trace when needed for debugging. Fixes #57840
There was a problem hiding this comment.
Code Review
This pull request aims to reduce the verbosity of Arrow conversion warning logs by truncating long error messages. The implementation in arrow.py correctly truncates the message, and new tests are added to verify this behavior. My review includes a suggestion to improve maintainability by using a defined constant instead of a magic number for the truncation length. Additionally, I've identified issues in the new test file where a test case doesn't trigger the intended error and an assertion is incorrect, and I've provided a corrected version of the test.
|
@cwarre33 awesome!! thanks for pushing out a fix so quickly. Will review soon. |
|
@cwarre33 thanks for quick fix! |
iamjustinhsu
left a comment
There was a problem hiding this comment.
Thanks for the contribution! some nits
| error_str = str(ace) | ||
| max_error_len = 200 | ||
| if len(error_str) > max_error_len: | ||
| error_str = error_str[:max_error_len] + f"... [truncated {len(error_str) - max_error_len} chars]" |
There was a problem hiding this comment.
nice! I noticed at least 1 more area where we truncate https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/metadata_exporter.py#L167. Do u think you could unify both areas (here and there)?
| ) | ||
|
|
||
|
|
||
| def test_arrow_conversion_small_error_not_truncated(caplog): |
There was a problem hiding this comment.
can you combine this test with the other by using pytest.mark.parametrize? One thought that comes to mind is renaming test_arrow_conversion_warning_truncation to test_arrow_conversion_warning and then doing this:
@pytest.mark.parametrize("dataset_size", [1, 50], ids=["full_msg", "truncate_msg"])
def test_arrow_conversion_warning(warning) haven't tested
8b4b9a1 to
ab1a33a
Compare
…e constants Address reviewer feedback from PR #57840: 1. Use ArrowConversionError.MAX_DATA_STR_LEN constant instead of magic number (200) 2. Rely on ArrowConversionError's built-in truncation instead of double-truncating (the constructor already truncates the data string to MAX_DATA_STR_LEN) 3. Refactor tests to use @pytest.mark.parametrize for cleaner test organization, combining test_arrow_conversion_warning_truncation and test_arrow_conversion_small_error_not_truncated into single parametrized test This improves code maintainability, reduces duplication, and prevents double truncation that could confuse logs. 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
|
could you help fix the ci @cwarre33 ? need to address DCO and the microcheck failure |
| (1, "full_msg"), | ||
| (50, "truncated_msg"), | ||
| ], | ||
| ) |
There was a problem hiding this comment.
Bug: Misuse of ids in pytest.mark.parametrize
The pytest.mark.parametrize decorator is incorrectly structured. The parameter "ids" is being used as a test parameter value, but "ids" is a reserved keyword argument in pytest.mark.parametrize meant for providing custom test IDs. This creates confusion and incorrect test parametrization. The correct approach would be to either: (1) rename the parameter to something like "test_case" or "scenario", or (2) use the ids keyword argument properly: @pytest.mark.parametrize("dataset_size", [1, 50], ids=["full_msg", "truncated_msg"]) and remove ids from the function parameters.
|
@cwarre33 Thanks for your contributions! Can you help fix test failures and DCO? |
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
|
Hi @cwarre33, are you still working on this? |
Summary
This PR addresses issue #57840 by reducing the verbosity of arrow conversion warning logs.
Problem
When Ray Data encounters Arrow conversion errors with nested datatypes, it falls back to pickled Python objects but logs extremely verbose warnings that include full array dumps (10,000+ characters), making application logs noisy and difficult to read.
Solution
[truncated X chars]exc_infofor debugging when neededChanges Made
python/ray/air/util/tensor_extensions/arrow.pypython/ray/data/tests/test_arrow_warning_truncation.pyExample
Before:
After:
Testing
Checklist
Related Issue
Fixes #57840