Skip to content

[train][tests] Abort cancels validation tasks and has deterministic behavior for resumption#61510

Merged
justinvyu merged 2 commits intoray-project:masterfrom
TimothySeah:tseah/fix-ray-train-flaky-tests
Mar 12, 2026
Merged

[train][tests] Abort cancels validation tasks and has deterministic behavior for resumption#61510
justinvyu merged 2 commits intoray-project:masterfrom
TimothySeah:tseah/fix-ray-train-flaky-tests

Conversation

@TimothySeah
Copy link
Contributor

@TimothySeah TimothySeah commented Mar 5, 2026

Summary

We observed the following flaky behavior in test_report_validation_fn_resumption

  [2026-03-04T02:25:25Z] (TrainController pid=6108) Launched async validation task for checkpoint      
  Checkpoint(filesystem=local, path=/tmp/pytest-of-root/pytest-0/test_report_validation_fn_resu0/valid 
  ation_fn_resumption/checkpoint_2026-03-04_02-21-22.230987)                                           
  [2026-03-04T02:25:25Z] (run_trainer pid=6028) Received SIGINT. Gracefully aborting the training run  
  — this may take a few seconds. To forcefully abort immediately, you can send a different signal,     
  such as SIGKILL.                                                                                     
  [2026-03-04T02:25:25Z] (TrainController pid=6108) Finished async validation task(s) for              
  checkpoint(s): [Checkpoint(filesystem=local, path=/tmp/pytest-of-root/pytest-0/test_report_validatio 
  n_fn_resu0/validation_fn_resumption/checkpoint_2026-03-04_02-21-22.230987)].                         
 ...                                
  [2026-03-04T02:25:25Z] (TrainController pid=6108) ray.exceptions.TaskCancelledError: Task:           
  TaskID(4e6f2dd62797ace7ffffffffffffffffffffffff01000000) was cancelled.                              
  [2026-03-04T02:25:25Z] (TrainController pid=6362) A run snapshot was found in storage folder at:     
  '/tmp/pytest-of-root/pytest-0/test_report_validation_fn_resu0/validation_fn_resumption'              
  [2026-03-04T02:25:25Z] (TrainController pid=6362) This snapshot contains a list of checkpoints       
  reported via `ray.train.report` and will be loaded. This allows the latest checkpoint found in the   
  snapshot to be accessible within your training function via `ray.train.get_checkpoint`.              

Essentially what happened was

  1. The unit test called ray.cancel
  2. This sent a SIGINT to the trainer and recursively cancelled the validation task
  3. SIGINT made Ray Train go through the abortion code, which includes transitioning to the ABORTED state
  4. after_controller_state_update sometimes (I think there's a race between when after_controller_state_update is called and when the validation task is cancelled) saw the cancelled validation task and updated the metrics accordingly.
  5. The second train run had no validations to resume.

This PR avoids this race by:

  1. Changing after_controller_state_update to do nothing if the state is terminal. We will process validation tasks from FINISHED and ERRORED train runs in before_controller_shutdown and cancel validation tasks in ABORTED train runs in before_controller_abort.
  2. Changing test_report_validation_fn_resumption to SIGINT a process (the same pattern as test_sigint_abort) rather than ray.cancel a task (the old pattern) to deterministically test the graceful abortion path.

Testing

Unit tests.

…ehavior

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah requested a review from a team as a code owner March 5, 2026 03:31
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a race condition during validation task cancellation upon run abortion. The changes introduce a before_controller_abort hook to explicitly cancel pending validation tasks and make the corresponding test more deterministic by using SIGINT instead of ray.cancel. The overall approach is sound, but I've found a critical issue where the new before_controller_abort hook is not actually called, which means the fix for cancelling validation tasks is incomplete. I've provided a detailed comment with a suggested fix.

Note: Security Review did not run due to the size of the PR.

@ray-gardener ray-gardener bot added train Ray Train Related Issue stability labels Mar 5, 2026
@TimothySeah TimothySeah changed the title [train][tests] Abort cancels validation tasks and has deterministic behavior [train][tests] Abort cancels validation tasks and has deterministic behavior for resumption Mar 9, 2026
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a summary of the problem:

  Good timing (test passes):
  1. Validation task still running
  2. after_controller_state_update() → polls validation → still pending, nothing to do
  3. State → ABORTED
  4. Validation task gets cancelled (result thrown away)
  5. Second run resumes the unfinished validation ✓

  Bad timing (test fails):
  1. Validation task gets cancelled (returns a result/error)
  2. after_controller_state_update() → polls validation → sees it "finished", saves metrics
  3. State → ABORTED
  4. Second run finds validation already processed, nothing to resume ✗

The main problem here was the "recursive cancelling" done by ray.cancel(). Can this happen in a regular abort scenario? In that case, the running validations tasks only go out of scope once the controller has exited, so this error may not happen in practice.

@TimothySeah
Copy link
Contributor Author

Here's a summary of the problem:

  Good timing (test passes):
  1. Validation task still running
  2. after_controller_state_update() → polls validation → still pending, nothing to do
  3. State → ABORTED
  4. Validation task gets cancelled (result thrown away)
  5. Second run resumes the unfinished validation ✓

  Bad timing (test fails):
  1. Validation task gets cancelled (returns a result/error)
  2. after_controller_state_update() → polls validation → sees it "finished", saves metrics
  3. State → ABORTED
  4. Second run finds validation already processed, nothing to resume ✗

The main problem here was the "recursive cancelling" done by ray.cancel(). Can this happen in a regular abort scenario? In that case, the running validations tasks only go out of scope once the controller has exited, so this error may not happen in practice.

Yeah this was my exact thought process, thanks for formalizing it!

@TimothySeah TimothySeah added the go add ONLY when ready to merge, run all tests label Mar 12, 2026
@justinvyu justinvyu enabled auto-merge (squash) March 12, 2026 22:32
@justinvyu justinvyu merged commit 1926bb0 into ray-project:master Mar 12, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests stability train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants