Skip to content

[core][tune] fix RayTaskError (de)serialization logic #54396

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

matthewdeng
Copy link
Contributor

Summary

Fixes a (de)serialization issue in RayTaskError that happens when the original cause has its own __reduce__ method defined.

Closes #54379.

Details

The existing implementation relies on the definition of BaseException.__reduce__:

# BaseException implements a __reduce__ method that returns
# a tuple with the type and the value of self.args.
# https://stackoverflow.com/a/49715949/2213289

However, since it's also subclassing the cause_cls, if __reduce__ is also defined there then that will override the original definition from BaseException.

This surfaced in #54379 because Ray Train V2 implements exception classes with custom __reduce__ definitions (e.g. TrainingFailedError).

The fix is to explicitly define __reduce__. Also removed the comment since it's no longer an assumption that's needed to understand the logic.

Testing

Added test_errors() test case in ray/tune/tests/test_train_v2_integration.py.

Signed-off-by: Matthew Deng <matt@anyscale.com>
@Copilot Copilot AI review requested due to automatic review settings July 8, 2025 00:43
@matthewdeng matthewdeng requested a review from a team as a code owner July 8, 2025 00:43
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Fix serialization of RayTaskError when the cause exception defines its own __reduce__ and add an integration test for error capture.

  • Explicitly define __reduce__ in dynamically generated dual-inheritance exceptions
  • Remove outdated comment about relying on BaseException.__reduce__
  • Add test_errors in test_train_v2_integration.py and a new Ray fixture

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
python/ray/exceptions.py Add __reduce__ methods in both dual-exception closures to fix de/serialization
python/ray/tune/tests/test_train_v2_integration.py Introduce ray_start_4_cpus fixture and test_errors to validate error capture
Comments suppressed due to low confidence (2)

python/ray/tune/tests/test_train_v2_integration.py:114

  • The test only exercises a standard RuntimeError, but doesn't verify serialization for exceptions with custom __reduce__ methods (e.g., TrainingFailedError). Add a test case for a custom exception subclass to ensure the fix covers that scenario.
def test_errors(ray_start_4_cpus):

python/ray/tune/tests/test_train_v2_integration.py:17

  • Using a function-scoped fixture for ray_start_4_cpus causes Ray to init/shutdown for each test, which may slow down the test suite. Consider using scope="module" to reduce overhead.
@pytest.fixture()

self.args = (cause,)

def __reduce__(self):
return (cls, self.args)
Copy link
Preview

Copilot AI Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This __reduce__ implementation is duplicated in a second closure at line 201. Consider extracting it into a shared helper or mixin to adhere to DRY principles.

Suggested change
return (cls, self.args)
return _reduce_helper(cls, self.args)

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Train V2 + Tune] Exception in Tune while processing results from failed runs using TorchTrainer
1 participant