Skip to content

[Train] Add timeouts to validation functions of ray.train.report#62916

Merged
matthewdeng merged 13 commits into
ray-project:masterfrom
pseudo-rnd-thoughts:train-801-reimplement
May 11, 2026
Merged

[Train] Add timeouts to validation functions of ray.train.report#62916
matthewdeng merged 13 commits into
ray-project:masterfrom
pseudo-rnd-thoughts:train-801-reimplement

Conversation

@pseudo-rnd-thoughts
Copy link
Copy Markdown
Member

Description

For long-running validation tasks its possible that users want the ability to time them out.

This PR adds ValidationTaskConfig.timeout_s: Optional[float] to configure the timeout length either per validation task or for all tasks of a trainer.
To help users spot when a task has timed out then ReportedCheckpointStatus.VALIDATION_TIMEOUT / VALIDATION_FAILED have been added.

To check that this has been implemented correctly, I've added test_report_validation_fn_timeout that a validation task that sleep forever is cancelled and test_trainer_resumption_with_checkpoint_status that all the checkpoints are remembered when a trainer fails and restarted.

Implementation details

The timing out of validation tasks are handled by the ValidationManager where we now record when a task began, if its got a time out.

The CheckpointManager records the validation task's status for each checkpoint therefore, we need to pass back the updated metrics (if there are any), the validation's status (VALIDATED, VALIDATION_FAILED and VALIDATION_TIMEOUT). Further, we need to record this data to the manager state for resumption handling so we save to different lists those that validated successfully, timed out and failed

Signed-off-by: Mark Towers <mark@anyscale.com>
@pseudo-rnd-thoughts pseudo-rnd-thoughts requested a review from a team as a code owner April 24, 2026 19:09
@pseudo-rnd-thoughts pseudo-rnd-thoughts added the train Ray Train Related Issue label Apr 24, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces timeout and failure handling for asynchronous checkpoint validation. Key changes include the addition of VALIDATION_TIMEOUT and VALIDATION_FAILED statuses, updates to the CheckpointManager to track and persist these terminal states, and logic in the ValidationManager to proactively cancel tasks that exceed their defined timeout_s. Feedback focuses on a logic error in _resolve_timeout_s where a value of -1 (intended to disable timeouts) would lead to immediate task cancellation, and an improvement opportunity to process all finished validations in a loop to avoid potential bottlenecks.

Comment thread python/ray/train/v2/_internal/execution/checkpoint/validation_manager.py Outdated
Signed-off-by: Mark Towers <mark@anyscale.com>
…-review

Signed-off-by: Mark Towers <mark@anyscale.com>
Comment thread python/ray/train/v2/_internal/execution/checkpoint/validation_manager.py Outdated
Signed-off-by: Mark Towers <mark@anyscale.com>
@pseudo-rnd-thoughts pseudo-rnd-thoughts added go add ONLY when ready to merge, run all tests and removed go add ONLY when ready to merge, run all tests labels Apr 27, 2026
Copy link
Copy Markdown
Contributor

@TimothySeah TimothySeah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - probably just needs 1 more pass after this.

Comment thread python/ray/train/v2/_internal/execution/checkpoint/validation_manager.py Outdated
Comment thread python/ray/train/v2/_internal/execution/checkpoint/validation_manager.py Outdated
Comment thread python/ray/train/v2/_internal/execution/checkpoint/validation_manager.py Outdated
Comment thread python/ray/train/v2/api/reported_checkpoint.py Outdated
Comment thread python/ray/train/v2/tests/test_async_checkpointing_validation.py Outdated

# Assert checkpoint state after all tasks are done
checkpoint_manager.update_checkpoints_with_metrics(
checkpoint_manager.update_checkpoints_with_validation_result(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add test(s) the call update_checkpoints_with_validation_result with other ReportedCheckpointStatuses?

Copy link
Copy Markdown
Member Author

@pseudo-rnd-thoughts pseudo-rnd-thoughts Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly but I prefer E2E tests like those in test_async_checkpoint_validation rather than component specific tests like these. Happy to add if you think necessary or tests something different

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I agree that the component-specific tests can feel repetitive e.g. test_validation_manager.test_checkpoint_validation_management_reordering is basically the same as test_checkpointmanager.test_pending_checkpoint_management which is basically the same as the e2e tests in test_async_checkpointing_validation.py

assert vm._kick_off_validations() == 0
checkpoint_manager.update_checkpoints_with_metrics.assert_called_with(
{training_reports[2].checkpoint: {"score": 2}}
checkpoint_manager.update_checkpoints_with_validation_result.assert_called_with(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth adding a quick test with ReportedCheckpointStatus.TIMEOUT

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above and I don't think it would fit into this test and would need to be a new test on this

Mark Towers added 2 commits April 28, 2026 14:18
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Comment thread python/ray/train/v2/_internal/execution/checkpoint/validation_manager.py Outdated
Signed-off-by: Mark Towers <mark@anyscale.com>

# Assert checkpoint state after all tasks are done
checkpoint_manager.update_checkpoints_with_metrics(
checkpoint_manager.update_checkpoints_with_validation_result(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I agree that the component-specific tests can feel repetitive e.g. test_validation_manager.test_checkpoint_validation_management_reordering is basically the same as test_checkpointmanager.test_pending_checkpoint_management which is basically the same as the e2e tests in test_async_checkpointing_validation.py

@pseudo-rnd-thoughts pseudo-rnd-thoughts added the go add ONLY when ready to merge, run all tests label Apr 30, 2026
Mark Towers added 2 commits May 1, 2026 12:16
Comment on lines +189 to +206
def _resolve_timeout_s(
self, validation: Union[bool, ValidationTaskConfig]
) -> Optional[float]:
"""Resolve the effective timeout for a validation task.

Per-task ``timeout_s=None`` falls back to the ValidationConfig default.
``timeout_s=-1`` disables the timeout even if a default is set.
"""
if (
isinstance(validation, ValidationTaskConfig)
and validation.timeout_s is not None
):
return None if validation.timeout_s < 0 else validation.timeout_s

default_timeout_s = self._validation_config.task_config.timeout_s
if default_timeout_s is None or default_timeout_s < 0:
return None
return default_timeout_s
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this validation logic with checks for -1? If validation is a ValidationTaskConfig could we always just use what is set there and ignore the default?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was thinking something like this:

Suggested change
def _resolve_timeout_s(
self, validation: Union[bool, ValidationTaskConfig]
) -> Optional[float]:
"""Resolve the effective timeout for a validation task.
Per-task ``timeout_s=None`` falls back to the ValidationConfig default.
``timeout_s=-1`` disables the timeout even if a default is set.
"""
if (
isinstance(validation, ValidationTaskConfig)
and validation.timeout_s is not None
):
return None if validation.timeout_s < 0 else validation.timeout_s
default_timeout_s = self._validation_config.task_config.timeout_s
if default_timeout_s is None or default_timeout_s < 0:
return None
return default_timeout_s
def _resolve_timeout_s(
self, validation: Union[bool, ValidationTaskConfig]
) -> Optional[float]:
"""Resolve the effective timeout for a validation task.
"""
if isinstance(validation, ValidationTaskConfig):
return validation.timeout_s
return self._validation_config.task_config.timeout_s

Copy link
Copy Markdown
Member Author

@pseudo-rnd-thoughts pseudo-rnd-thoughts May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your right, I was over thinking the solution to where users want to use a ValidationTaskConfig but not overwrite the default validation task config however I think if a user specifies a task config then we should just rely on it completely

Will update to your solution

Signed-off-by: Mark Towers <mark@anyscale.com>
Comment thread python/ray/train/v2/api/validation_config.py Outdated
Comment thread python/ray/train/v2/_internal/execution/checkpoint/validation_manager.py Outdated
Signed-off-by: Mark Towers <mark@anyscale.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 0444a4f. Configure here.

Comment thread python/ray/train/v2/api/validation_config.py Outdated
Comment thread python/ray/train/v2/api/validation_config.py Outdated
Signed-off-by: Mark Towers <mark@anyscale.com>
@matthewdeng matthewdeng merged commit fca8cf7 into ray-project:master May 11, 2026
6 checks passed
dancingactor pushed a commit to dancingactor/ray that referenced this pull request May 13, 2026
…ay-project#62916)

## Description
For long-running validation tasks its possible that users want the
ability to time them out.

This PR adds `ValidationTaskConfig.timeout_s: Optional[float]` to
configure the timeout length either per validation task or for all tasks
of a trainer.
To help users spot when a task has timed out then
`ReportedCheckpointStatus.VALIDATION_TIMEOUT / VALIDATION_FAILED` have
been added.

To check that this has been implemented correctly, I've added
`test_report_validation_fn_timeout` that a validation task that sleep
forever is cancelled and
`test_trainer_resumption_with_checkpoint_status` that all the
checkpoints are remembered when a trainer fails and restarted.

### Implementation details
The timing out of validation tasks are handled by the
`ValidationManager` where we now record when a task began, if its got a
time out.

The `CheckpointManager` records the validation task's status for each
checkpoint therefore, we need to pass back the updated metrics (if there
are any), the validation's status (`VALIDATED`, `VALIDATION_FAILED` and
`VALIDATION_TIMEOUT`). Further, we need to record this data to the
manager state for resumption handling so we save to different lists
those that validated successfully, timed out and failed

---------

Signed-off-by: Mark Towers <mark@anyscale.com>
Co-authored-by: Mark Towers <mark@anyscale.com>
am-kinetica pushed a commit to kineticadb/ray that referenced this pull request May 14, 2026
…ay-project#62916)

## Description
For long-running validation tasks its possible that users want the
ability to time them out.

This PR adds `ValidationTaskConfig.timeout_s: Optional[float]` to
configure the timeout length either per validation task or for all tasks
of a trainer.
To help users spot when a task has timed out then
`ReportedCheckpointStatus.VALIDATION_TIMEOUT / VALIDATION_FAILED` have
been added.

To check that this has been implemented correctly, I've added
`test_report_validation_fn_timeout` that a validation task that sleep
forever is cancelled and
`test_trainer_resumption_with_checkpoint_status` that all the
checkpoints are remembered when a trainer fails and restarted.

### Implementation details
The timing out of validation tasks are handled by the
`ValidationManager` where we now record when a task began, if its got a
time out.

The `CheckpointManager` records the validation task's status for each
checkpoint therefore, we need to pass back the updated metrics (if there
are any), the validation's status (`VALIDATED`, `VALIDATION_FAILED` and
`VALIDATION_TIMEOUT`). Further, we need to record this data to the
manager state for resumption handling so we save to different lists
those that validated successfully, timed out and failed

---------

Signed-off-by: Mark Towers <mark@anyscale.com>
Co-authored-by: Mark Towers <mark@anyscale.com>
Signed-off-by: anindyam1969 <amukherjee@kinetica.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…ay-project#62916)

## Description
For long-running validation tasks its possible that users want the
ability to time them out.

This PR adds `ValidationTaskConfig.timeout_s: Optional[float]` to
configure the timeout length either per validation task or for all tasks
of a trainer.
To help users spot when a task has timed out then
`ReportedCheckpointStatus.VALIDATION_TIMEOUT / VALIDATION_FAILED` have
been added.

To check that this has been implemented correctly, I've added
`test_report_validation_fn_timeout` that a validation task that sleep
forever is cancelled and
`test_trainer_resumption_with_checkpoint_status` that all the
checkpoints are remembered when a trainer fails and restarted.

### Implementation details
The timing out of validation tasks are handled by the
`ValidationManager` where we now record when a task began, if its got a
time out.

The `CheckpointManager` records the validation task's status for each
checkpoint therefore, we need to pass back the updated metrics (if there
are any), the validation's status (`VALIDATED`, `VALIDATION_FAILED` and
`VALIDATION_TIMEOUT`). Further, we need to record this data to the
manager state for resumption handling so we save to different lists
those that validated successfully, timed out and failed

---------

Signed-off-by: Mark Towers <mark@anyscale.com>
Co-authored-by: Mark Towers <mark@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants