Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] Flag flip for resource_demand_scheduler should take into account queue #11615

Merged
merged 67 commits into from
Nov 2, 2020
Merged

[autoscaler] Flag flip for resource_demand_scheduler should take into account queue #11615

merged 67 commits into from
Nov 2, 2020

Conversation

AmeerHajAli
Copy link
Contributor

I set the max back log size from 1 to 1M and it did not seem to affect much the performance, I set it now to 10k, we can change that in the future if necessary.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 26, 2020
@AmeerHajAli
Copy link
Contributor Author

AmeerHajAli commented Oct 27, 2020

Here are a few data points (I updated the PR accordingly):

backlog_size = 10k
number of nodes = 2k (1k cpu nodes and 1k gpu nodes)
demands = [{"GPU": 1}, {"CPU":1}] * (backlog_size=10k)
  • The time get_nodes_to_launch takes is 4.6 seconds.
  • Reducing backlog_size to 1k makes it take 0.32 seconds.
  • Reducing the number of nodes to 200 (100 cpu nodes and 100 gpu nodes) makes it take 1.02 second.
  • Reducing backlog_size to 1k and the number of nodes to 200 (100 cpu nodes and 100 gpu nodes) make it take 0.06 seconds.

Bottom line, If we want to be on the safe side (always less than 1 second) with 100-1000 nodes, we can set the backlog size to 1k. I also prefer this option since it is also less aggressive.

@AmeerHajAli AmeerHajAli removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 27, 2020
python/ray/ray_constants.py Outdated Show resolved Hide resolved
python/ray/monitor.py Outdated Show resolved Hide resolved
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 27, 2020
@AmeerHajAli AmeerHajAli removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 28, 2020
@AmeerHajAli
Copy link
Contributor Author

@ericl @wuisawesome , I added a couple of more tests to see different resource demand vector sizes and different available nodes. Note that I changed the monitor function not to just trim the backlog bundles but also the waiting_bundles + infeasible_bundles So that we really bound the execution time of get_nodes_to_launch.

Copy link
Contributor

@wuisawesome wuisawesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey we need to block on merging this until we make sure we aren't scaling to aggressively. I will describe this in more detail in my simulator PR.

@ericl
Copy link
Contributor

ericl commented Oct 28, 2020

Hey we need to block on merging this until we make sure we aren't scaling to aggressively. I will describe this in more detail in my simulator PR.

Do you mean beyond the upscaling throttling? (e.g., launch at most 5 instances starting off, then at most 20% pending launches)

@wuisawesome
Copy link
Contributor

This may actually need some investigating (there's a chance this is a bug in the simulator still). I will get back to y'all on this by EOD.

@AmeerHajAli
Copy link
Contributor Author

Isn't that handled in the max launch concurrency?

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 29, 2020
@ericl
Copy link
Contributor

ericl commented Oct 30, 2020

Looks good, but let's raise the timeouts to prevent possible test flakiness.

@AmeerHajAli AmeerHajAli added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Nov 1, 2020
@ericl ericl merged commit 8d74a04 into ray-project:master Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants