[train][tests] Abort cancels validation tasks and has deterministic behavior for resumption by TimothySeah · Pull Request #61510 · ray-project/ray

TimothySeah · 2026-03-05T03:31:15Z

Summary

We observed the following flaky behavior in test_report_validation_fn_resumption

  [2026-03-04T02:25:25Z] (TrainController pid=6108) Launched async validation task for checkpoint      
  Checkpoint(filesystem=local, path=/tmp/pytest-of-root/pytest-0/test_report_validation_fn_resu0/valid 
  ation_fn_resumption/checkpoint_2026-03-04_02-21-22.230987)                                           
  [2026-03-04T02:25:25Z] (run_trainer pid=6028) Received SIGINT. Gracefully aborting the training run  
  — this may take a few seconds. To forcefully abort immediately, you can send a different signal,     
  such as SIGKILL.                                                                                     
  [2026-03-04T02:25:25Z] (TrainController pid=6108) Finished async validation task(s) for              
  checkpoint(s): [Checkpoint(filesystem=local, path=/tmp/pytest-of-root/pytest-0/test_report_validatio 
  n_fn_resu0/validation_fn_resumption/checkpoint_2026-03-04_02-21-22.230987)].                         
 ...                                
  [2026-03-04T02:25:25Z] (TrainController pid=6108) ray.exceptions.TaskCancelledError: Task:           
  TaskID(4e6f2dd62797ace7ffffffffffffffffffffffff01000000) was cancelled.                              
  [2026-03-04T02:25:25Z] (TrainController pid=6362) A run snapshot was found in storage folder at:     
  '/tmp/pytest-of-root/pytest-0/test_report_validation_fn_resu0/validation_fn_resumption'              
  [2026-03-04T02:25:25Z] (TrainController pid=6362) This snapshot contains a list of checkpoints       
  reported via `ray.train.report` and will be loaded. This allows the latest checkpoint found in the   
  snapshot to be accessible within your training function via `ray.train.get_checkpoint`.

Essentially what happened was

The unit test called ray.cancel
This sent a SIGINT to the trainer and recursively cancelled the validation task
SIGINT made Ray Train go through the abortion code, which includes transitioning to the ABORTED state
after_controller_state_update sometimes (I think there's a race between when after_controller_state_update is called and when the validation task is cancelled) saw the cancelled validation task and updated the metrics accordingly.
The second train run had no validations to resume.

This PR avoids this race by:

Changing after_controller_state_update to do nothing if the state is terminal. We will process validation tasks from FINISHED and ERRORED train runs in before_controller_shutdown and cancel validation tasks in ABORTED train runs in before_controller_abort.
Changing test_report_validation_fn_resumption to SIGINT a process (the same pattern as test_sigint_abort) rather than ray.cancel a task (the old pattern) to deterministically test the graceful abortion path.

Testing

Unit tests.

…ehavior Signed-off-by: Timothy Seah <tseah@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

python/ray/train/v2/_internal/execution/checkpoint/validation_manager.py

Signed-off-by: Timothy Seah <tseah@anyscale.com>

gemini-code-assist

Code Review

This pull request aims to fix a race condition during validation task cancellation upon run abortion. The changes introduce a before_controller_abort hook to explicitly cancel pending validation tasks and make the corresponding test more deterministic by using SIGINT instead of ray.cancel. The overall approach is sound, but I've found a critical issue where the new before_controller_abort hook is not actually called, which means the fix for cancelling validation tasks is incomplete. I've provided a detailed comment with a suggested fix.

_{Note: Security Review did not run due to the size of the PR.}

python/ray/train/v2/_internal/execution/checkpoint/validation_manager.py

justinvyu

Here's a summary of the problem:

  Good timing (test passes):
  1. Validation task still running
  2. after_controller_state_update() → polls validation → still pending, nothing to do
  3. State → ABORTED
  4. Validation task gets cancelled (result thrown away)
  5. Second run resumes the unfinished validation ✓

  Bad timing (test fails):
  1. Validation task gets cancelled (returns a result/error)
  2. after_controller_state_update() → polls validation → sees it "finished", saves metrics
  3. State → ABORTED
  4. Second run finds validation already processed, nothing to resume ✗

The main problem here was the "recursive cancelling" done by ray.cancel(). Can this happen in a regular abort scenario? In that case, the running validations tasks only go out of scope once the controller has exited, so this error may not happen in practice.

TimothySeah · 2026-03-12T22:04:49Z

Here's a summary of the problem:
  Good timing (test passes):
  1. Validation task still running
  2. after_controller_state_update() → polls validation → still pending, nothing to do
  3. State → ABORTED
  4. Validation task gets cancelled (result thrown away)
  5. Second run resumes the unfinished validation ✓

  Bad timing (test fails):
  1. Validation task gets cancelled (returns a result/error)
  2. after_controller_state_update() → polls validation → sees it "finished", saves metrics
  3. State → ABORTED
  4. Second run finds validation already processed, nothing to resume ✗
The main problem here was the "recursive cancelling" done by ray.cancel(). Can this happen in a regular abort scenario? In that case, the running validations tasks only go out of scope once the controller has exited, so this error may not happen in practice.

Yeah this was my exact thought process, thanks for formalizing it!

[train][tests] Abort cancels validation tasks and has deterministic b…

c072097

…ehavior Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah requested a review from a team as a code owner March 5, 2026 03:31

cursor bot reviewed Mar 5, 2026

View reviewed changes

python/ray/train/v2/_internal/execution/checkpoint/validation_manager.py Show resolved Hide resolved

upstream before_controller_abort changes

7a8fef5

Signed-off-by: Timothy Seah <tseah@anyscale.com>

gemini-code-assist bot reviewed Mar 5, 2026

View reviewed changes

python/ray/train/v2/_internal/execution/checkpoint/validation_manager.py Show resolved Hide resolved

ray-gardener bot added train Ray Train Related Issue stability labels Mar 5, 2026

TimothySeah changed the title ~~[train][tests] Abort cancels validation tasks and has deterministic behavior~~ [train][tests] Abort cancels validation tasks and has deterministic behavior for resumption Mar 9, 2026

justinvyu reviewed Mar 10, 2026

View reviewed changes

justinvyu approved these changes Mar 10, 2026

View reviewed changes

TimothySeah added the go add ONLY when ready to merge, run all tests label Mar 12, 2026

justinvyu enabled auto-merge (squash) March 12, 2026 22:32

justinvyu merged commit 1926bb0 into ray-project:master Mar 12, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train][tests] Abort cancels validation tasks and has deterministic behavior for resumption#61510

[train][tests] Abort cancels validation tasks and has deterministic behavior for resumption#61510
justinvyu merged 2 commits intoray-project:masterfrom
TimothySeah:tseah/fix-ray-train-flaky-tests

TimothySeah commented Mar 5, 2026 •

edited

Loading

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

justinvyu left a comment

Uh oh!

TimothySeah commented Mar 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TimothySeah commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

TimothySeah commented Mar 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TimothySeah commented Mar 5, 2026 •

edited

Loading