-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[autoscaler] Flag flip for resource_demand_scheduler should take into account queue #11615
Conversation
This reverts commit 818a63a.
Here are a few data points (I updated the PR accordingly):
Bottom line, If we want to be on the safe side (always less than 1 second) with 100-1000 nodes, we can set the backlog size to 1k. I also prefer this option since it is also less aggressive. |
@ericl @wuisawesome , I added a couple of more tests to see different resource demand vector sizes and different available nodes. Note that I changed the monitor function not to just trim the backlog bundles but also the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey we need to block on merging this until we make sure we aren't scaling to aggressively. I will describe this in more detail in my simulator PR.
Do you mean beyond the upscaling throttling? (e.g., launch at most 5 instances starting off, then at most 20% pending launches) |
This may actually need some investigating (there's a chance this is a bug in the simulator still). I will get back to y'all on this by EOD. |
Isn't that handled in the max launch concurrency? |
Looks good, but let's raise the timeouts to prevent possible test flakiness. |
I set the max back log size from 1 to 1M and it did not seem to affect much the performance, I set it now to 10k, we can change that in the future if necessary.
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.