Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] Avoid launching GPU nodes when the workload only has CPU tasks. #13776

Merged
merged 4 commits into from
Jan 29, 2021

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Jan 29, 2021

This issue was reported by @robertnishihara dogfooding a numpy workload.

Copy link
Contributor

@AmeerHajAli AmeerHajAli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkout the comment.


# Avoid launching GPU nodes if there aren't any GPU tasks at all. Note that
# if there *is* a GPU task, then CPU tasks can be scheduled as well.
if is_gpu_node and not any_gpu_task:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On issue here is that if all the available nodes are gpu nodes you will never scale up if you have CPU tasks only and instead you will print to the user the "The autoscaler could not find a node type to satisfy the ..." message which is bad.
Please fix that before merging this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that, and think in this case this is the right behavior. The user should be adding some CPU nodes to their config, rather than Ray auto launching expensive GPU nodes.

Copy link
Contributor

@wuisawesome wuisawesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with this policy. If there are no CPU-only nodes available, launching a node with a GPU seems reasonable. For example, an r5dn.16xlarge costs $5.20/hr while a g3.16xlarge costs $4.56/hr for the same number of CPUs, so it would actually be cheaper to get a GPU in this case.

I obviously cherry-picked instances here, but the point is that I don't think the price difference is extreme enough to warrant this policy.

If we do implement this, can we feature flag it? My example is at least one good use case where you'd want to turn this policy off.

@ericl
Copy link
Contributor Author

ericl commented Jan 29, 2021

Sure, I added a feature flag. However, I don't think this will be popular flag.

Copy link
Contributor

@wuisawesome wuisawesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with this policy. If there are no CPU-only nodes available, launching a node with a GPU seems reasonable. For example, an r5dn.16xlarge costs $5.20/hr while a g3.16xlarge costs $4.56/hr for the same number of CPUs, so it would actually be cheaper to get a GPU in this case.

I obviously cherry-picked instances here, but the point is that I don't think the price difference is extreme enough to warrant this policy.

Still against this policy, but this seems like a reasonable way to implement it.

@ericl ericl merged commit b20a38f into ray-project:master Jan 29, 2021
fishbone pushed a commit to fishbone/ray that referenced this pull request Feb 16, 2021
fishbone added a commit to fishbone/ray that referenced this pull request Feb 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants