-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[autoscaler] Avoid launching GPU nodes when the workload only has CPU tasks. #13776
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checkout the comment.
|
||
# Avoid launching GPU nodes if there aren't any GPU tasks at all. Note that | ||
# if there *is* a GPU task, then CPU tasks can be scheduled as well. | ||
if is_gpu_node and not any_gpu_task: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On issue here is that if all the available nodes are gpu nodes you will never scale up if you have CPU tasks only and instead you will print to the user the "The autoscaler could not find a node type to satisfy the ..." message which is bad.
Please fix that before merging this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that, and think in this case this is the right behavior. The user should be adding some CPU nodes to their config, rather than Ray auto launching expensive GPU nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree with this policy. If there are no CPU-only nodes available, launching a node with a GPU seems reasonable. For example, an r5dn.16xlarge costs $5.20/hr while a g3.16xlarge costs $4.56/hr for the same number of CPUs, so it would actually be cheaper to get a GPU in this case.
I obviously cherry-picked instances here, but the point is that I don't think the price difference is extreme enough to warrant this policy.
If we do implement this, can we feature flag it? My example is at least one good use case where you'd want to turn this policy off.
Sure, I added a feature flag. However, I don't think this will be popular flag. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree with this policy. If there are no CPU-only nodes available, launching a node with a GPU seems reasonable. For example, an r5dn.16xlarge costs $5.20/hr while a g3.16xlarge costs $4.56/hr for the same number of CPUs, so it would actually be cheaper to get a GPU in this case.
I obviously cherry-picked instances here, but the point is that I don't think the price difference is extreme enough to warrant this policy.
Still against this policy, but this seems like a reasonable way to implement it.
… tasks. (ray-project#13776) * wip * avoid gpus * update * update
… has CPU tasks. (ray-project#13776)" This reverts commit 7ad571e.
This issue was reported by @robertnishihara dogfooding a numpy workload.