Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Allow us to configure new memory and cpu config upon subsequent retries #45257

Open
raghumdani opened this issue May 10, 2024 · 2 comments
Labels
core Issues that should be addressed in Ray Core enhancement Request for new feature and/or capability P2 Important issue, but not time-critical

Comments

@raghumdani
Copy link

Description

ray.remote takes in memory, cpu and max_retries and retry_exceptions. We have seen that the most common cause for task failures are OOMs. If we retry them with the same memory config, the task will fail again. Hence, a feature to change the resource config on the subsequent retries of a task would be tremendously useful. We anyway have a workaround to do it on our own but this is a generic improvement that can benefit ray users and can simplify code at our side.

Use case

We will be able to overcome out of memory errors by retrying tasks with increased memory upon subsequent retries.

@raghumdani raghumdani added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 10, 2024
@anyscalesam anyscalesam added the core Issues that should be addressed in Ray Core label May 13, 2024
@rynewang
Copy link
Contributor

Note that memory and CPU are logical resources, that even if you write a small memory task, it can take all memories from the machine.

Would you mind sharing your use case again? Do you have dynamic amount of resources needed for each task? If not, typically you can benchmark your mem usage and set a big enough number for it, and it would schedule well. We don't have plans for "auto-piloting" resource usages for now.

@rynewang rynewang added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 20, 2024
@raghumdani
Copy link
Author

raghumdani commented May 20, 2024

Do you have dynamic amount of resources needed for each task?

Yes, memory requirements are calculated based on file sizes each task reads. We also have different types of tasks requiring different amounts of memory. Now we often under-estimate memory and hence, we have to retry the same task with more memory requirement the next time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core enhancement Request for new feature and/or capability P2 Important issue, but not time-critical
Projects
None yet
Development

No branches or pull requests

3 participants