[Train] Add timeouts to validation functions of ray.train.report#62916
Conversation
Signed-off-by: Mark Towers <mark@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces timeout and failure handling for asynchronous checkpoint validation. Key changes include the addition of VALIDATION_TIMEOUT and VALIDATION_FAILED statuses, updates to the CheckpointManager to track and persist these terminal states, and logic in the ValidationManager to proactively cancel tasks that exceed their defined timeout_s. Feedback focuses on a logic error in _resolve_timeout_s where a value of -1 (intended to disable timeouts) would lead to immediate task cancellation, and an improvement opportunity to process all finished validations in a loop to avoid potential bottlenecks.
Signed-off-by: Mark Towers <mark@anyscale.com>
…-review Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
TimothySeah
left a comment
There was a problem hiding this comment.
LGTM - probably just needs 1 more pass after this.
|
|
||
| # Assert checkpoint state after all tasks are done | ||
| checkpoint_manager.update_checkpoints_with_metrics( | ||
| checkpoint_manager.update_checkpoints_with_validation_result( |
There was a problem hiding this comment.
Should we also add test(s) the call update_checkpoints_with_validation_result with other ReportedCheckpointStatuses?
There was a problem hiding this comment.
Possibly but I prefer E2E tests like those in test_async_checkpoint_validation rather than component specific tests like these. Happy to add if you think necessary or tests something different
There was a problem hiding this comment.
Fair enough, I agree that the component-specific tests can feel repetitive e.g. test_validation_manager.test_checkpoint_validation_management_reordering is basically the same as test_checkpointmanager.test_pending_checkpoint_management which is basically the same as the e2e tests in test_async_checkpointing_validation.py
| assert vm._kick_off_validations() == 0 | ||
| checkpoint_manager.update_checkpoints_with_metrics.assert_called_with( | ||
| {training_reports[2].checkpoint: {"score": 2}} | ||
| checkpoint_manager.update_checkpoints_with_validation_result.assert_called_with( |
There was a problem hiding this comment.
Might be worth adding a quick test with ReportedCheckpointStatus.TIMEOUT
There was a problem hiding this comment.
See comment above and I don't think it would fit into this test and would need to be a new test on this
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
|
|
||
| # Assert checkpoint state after all tasks are done | ||
| checkpoint_manager.update_checkpoints_with_metrics( | ||
| checkpoint_manager.update_checkpoints_with_validation_result( |
There was a problem hiding this comment.
Fair enough, I agree that the component-specific tests can feel repetitive e.g. test_validation_manager.test_checkpoint_validation_management_reordering is basically the same as test_checkpointmanager.test_pending_checkpoint_management which is basically the same as the e2e tests in test_async_checkpointing_validation.py
Signed-off-by: Mark Towers <mark@anyscale.com>
| def _resolve_timeout_s( | ||
| self, validation: Union[bool, ValidationTaskConfig] | ||
| ) -> Optional[float]: | ||
| """Resolve the effective timeout for a validation task. | ||
|
|
||
| Per-task ``timeout_s=None`` falls back to the ValidationConfig default. | ||
| ``timeout_s=-1`` disables the timeout even if a default is set. | ||
| """ | ||
| if ( | ||
| isinstance(validation, ValidationTaskConfig) | ||
| and validation.timeout_s is not None | ||
| ): | ||
| return None if validation.timeout_s < 0 else validation.timeout_s | ||
|
|
||
| default_timeout_s = self._validation_config.task_config.timeout_s | ||
| if default_timeout_s is None or default_timeout_s < 0: | ||
| return None | ||
| return default_timeout_s |
There was a problem hiding this comment.
Do we need this validation logic with checks for -1? If validation is a ValidationTaskConfig could we always just use what is set there and ignore the default?
There was a problem hiding this comment.
Was thinking something like this:
| def _resolve_timeout_s( | |
| self, validation: Union[bool, ValidationTaskConfig] | |
| ) -> Optional[float]: | |
| """Resolve the effective timeout for a validation task. | |
| Per-task ``timeout_s=None`` falls back to the ValidationConfig default. | |
| ``timeout_s=-1`` disables the timeout even if a default is set. | |
| """ | |
| if ( | |
| isinstance(validation, ValidationTaskConfig) | |
| and validation.timeout_s is not None | |
| ): | |
| return None if validation.timeout_s < 0 else validation.timeout_s | |
| default_timeout_s = self._validation_config.task_config.timeout_s | |
| if default_timeout_s is None or default_timeout_s < 0: | |
| return None | |
| return default_timeout_s | |
| def _resolve_timeout_s( | |
| self, validation: Union[bool, ValidationTaskConfig] | |
| ) -> Optional[float]: | |
| """Resolve the effective timeout for a validation task. | |
| """ | |
| if isinstance(validation, ValidationTaskConfig): | |
| return validation.timeout_s | |
| return self._validation_config.task_config.timeout_s |
There was a problem hiding this comment.
I think your right, I was over thinking the solution to where users want to use a ValidationTaskConfig but not overwrite the default validation task config however I think if a user specifies a task config then we should just rely on it completely
Will update to your solution
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Reviewed by Cursor Bugbot for commit 0444a4f. Configure here.
Signed-off-by: Mark Towers <mark@anyscale.com>
…ay-project#62916) ## Description For long-running validation tasks its possible that users want the ability to time them out. This PR adds `ValidationTaskConfig.timeout_s: Optional[float]` to configure the timeout length either per validation task or for all tasks of a trainer. To help users spot when a task has timed out then `ReportedCheckpointStatus.VALIDATION_TIMEOUT / VALIDATION_FAILED` have been added. To check that this has been implemented correctly, I've added `test_report_validation_fn_timeout` that a validation task that sleep forever is cancelled and `test_trainer_resumption_with_checkpoint_status` that all the checkpoints are remembered when a trainer fails and restarted. ### Implementation details The timing out of validation tasks are handled by the `ValidationManager` where we now record when a task began, if its got a time out. The `CheckpointManager` records the validation task's status for each checkpoint therefore, we need to pass back the updated metrics (if there are any), the validation's status (`VALIDATED`, `VALIDATION_FAILED` and `VALIDATION_TIMEOUT`). Further, we need to record this data to the manager state for resumption handling so we save to different lists those that validated successfully, timed out and failed --------- Signed-off-by: Mark Towers <mark@anyscale.com> Co-authored-by: Mark Towers <mark@anyscale.com>
…ay-project#62916) ## Description For long-running validation tasks its possible that users want the ability to time them out. This PR adds `ValidationTaskConfig.timeout_s: Optional[float]` to configure the timeout length either per validation task or for all tasks of a trainer. To help users spot when a task has timed out then `ReportedCheckpointStatus.VALIDATION_TIMEOUT / VALIDATION_FAILED` have been added. To check that this has been implemented correctly, I've added `test_report_validation_fn_timeout` that a validation task that sleep forever is cancelled and `test_trainer_resumption_with_checkpoint_status` that all the checkpoints are remembered when a trainer fails and restarted. ### Implementation details The timing out of validation tasks are handled by the `ValidationManager` where we now record when a task began, if its got a time out. The `CheckpointManager` records the validation task's status for each checkpoint therefore, we need to pass back the updated metrics (if there are any), the validation's status (`VALIDATED`, `VALIDATION_FAILED` and `VALIDATION_TIMEOUT`). Further, we need to record this data to the manager state for resumption handling so we save to different lists those that validated successfully, timed out and failed --------- Signed-off-by: Mark Towers <mark@anyscale.com> Co-authored-by: Mark Towers <mark@anyscale.com> Signed-off-by: anindyam1969 <amukherjee@kinetica.com>
…ay-project#62916) ## Description For long-running validation tasks its possible that users want the ability to time them out. This PR adds `ValidationTaskConfig.timeout_s: Optional[float]` to configure the timeout length either per validation task or for all tasks of a trainer. To help users spot when a task has timed out then `ReportedCheckpointStatus.VALIDATION_TIMEOUT / VALIDATION_FAILED` have been added. To check that this has been implemented correctly, I've added `test_report_validation_fn_timeout` that a validation task that sleep forever is cancelled and `test_trainer_resumption_with_checkpoint_status` that all the checkpoints are remembered when a trainer fails and restarted. ### Implementation details The timing out of validation tasks are handled by the `ValidationManager` where we now record when a task began, if its got a time out. The `CheckpointManager` records the validation task's status for each checkpoint therefore, we need to pass back the updated metrics (if there are any), the validation's status (`VALIDATED`, `VALIDATION_FAILED` and `VALIDATION_TIMEOUT`). Further, we need to record this data to the manager state for resumption handling so we save to different lists those that validated successfully, timed out and failed --------- Signed-off-by: Mark Towers <mark@anyscale.com> Co-authored-by: Mark Towers <mark@anyscale.com>

Description
For long-running validation tasks its possible that users want the ability to time them out.
This PR adds
ValidationTaskConfig.timeout_s: Optional[float]to configure the timeout length either per validation task or for all tasks of a trainer.To help users spot when a task has timed out then
ReportedCheckpointStatus.VALIDATION_TIMEOUT / VALIDATION_FAILEDhave been added.To check that this has been implemented correctly, I've added
test_report_validation_fn_timeoutthat a validation task that sleep forever is cancelled andtest_trainer_resumption_with_checkpoint_statusthat all the checkpoints are remembered when a trainer fails and restarted.Implementation details
The timing out of validation tasks are handled by the
ValidationManagerwhere we now record when a task began, if its got a time out.The
CheckpointManagerrecords the validation task's status for each checkpoint therefore, we need to pass back the updated metrics (if there are any), the validation's status (VALIDATED,VALIDATION_FAILEDandVALIDATION_TIMEOUT). Further, we need to record this data to the manager state for resumption handling so we save to different lists those that validated successfully, timed out and failed