[Core] Allow us to configure new memory and cpu config upon subsequent retries #45257
Labels
core
Issues that should be addressed in Ray Core
enhancement
Request for new feature and/or capability
P2
Important issue, but not time-critical
Description
ray.remote takes in
memory
,cpu
andmax_retries
andretry_exceptions
. We have seen that the most common cause for task failures are OOMs. If we retry them with the same memory config, the task will fail again. Hence, a feature to change the resource config on the subsequent retries of a task would be tremendously useful. We anyway have a workaround to do it on our own but this is a generic improvement that can benefit ray users and can simplify code at our side.Use case
We will be able to overcome out of memory errors by retrying tasks with increased memory upon subsequent retries.
The text was updated successfully, but these errors were encountered: