[train][tests] Abort cancels validation tasks and has deterministic behavior for resumption#61510
Conversation
…ehavior Signed-off-by: Timothy Seah <tseah@anyscale.com>
python/ray/train/v2/_internal/execution/checkpoint/validation_manager.py
Show resolved
Hide resolved
Signed-off-by: Timothy Seah <tseah@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request aims to fix a race condition during validation task cancellation upon run abortion. The changes introduce a before_controller_abort hook to explicitly cancel pending validation tasks and make the corresponding test more deterministic by using SIGINT instead of ray.cancel. The overall approach is sound, but I've found a critical issue where the new before_controller_abort hook is not actually called, which means the fix for cancelling validation tasks is incomplete. I've provided a detailed comment with a suggested fix.
Note: Security Review did not run due to the size of the PR.
python/ray/train/v2/_internal/execution/checkpoint/validation_manager.py
Show resolved
Hide resolved
justinvyu
left a comment
There was a problem hiding this comment.
Here's a summary of the problem:
Good timing (test passes):
1. Validation task still running
2. after_controller_state_update() → polls validation → still pending, nothing to do
3. State → ABORTED
4. Validation task gets cancelled (result thrown away)
5. Second run resumes the unfinished validation ✓
Bad timing (test fails):
1. Validation task gets cancelled (returns a result/error)
2. after_controller_state_update() → polls validation → sees it "finished", saves metrics
3. State → ABORTED
4. Second run finds validation already processed, nothing to resume ✗
The main problem here was the "recursive cancelling" done by ray.cancel(). Can this happen in a regular abort scenario? In that case, the running validations tasks only go out of scope once the controller has exited, so this error may not happen in practice.
Yeah this was my exact thought process, thanks for formalizing it! |
Summary
We observed the following flaky behavior in
test_report_validation_fn_resumptionEssentially what happened was
ray.cancelSIGINTto the trainer and recursively cancelled the validation taskSIGINTmade Ray Train go through the abortion code, which includes transitioning to theABORTEDstateafter_controller_state_updatesometimes (I think there's a race between whenafter_controller_state_updateis called and when the validation task is cancelled) saw the cancelled validation task and updated the metrics accordingly.This PR avoids this race by:
after_controller_state_updateto do nothing if the state is terminal. We will process validation tasks fromFINISHEDandERROREDtrain runs inbefore_controller_shutdownand cancel validation tasks inABORTEDtrain runs inbefore_controller_abort.test_report_validation_fn_resumptionto SIGINT a process (the same pattern astest_sigint_abort) rather thanray.cancela task (the old pattern) to deterministically test the graceful abortion path.Testing
Unit tests.