[autoscaler] Flag flip for resource_demand_scheduler should take into account queue #11615

AmeerHajAli · 2020-10-26T16:11:49Z

I set the max back log size from 1 to 1M and it did not seem to affect much the performance, I set it now to 10k, we can change that in the future if necessary.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

This reverts commit 818a63a.

AmeerHajAli · 2020-10-27T10:54:39Z

Here are a few data points (I updated the PR accordingly):

backlog_size = 10k
number of nodes = 2k (1k cpu nodes and 1k gpu nodes)
demands = [{"GPU": 1}, {"CPU":1}] * (backlog_size=10k)

The time get_nodes_to_launch takes is 4.6 seconds.
Reducing backlog_size to 1k makes it take 0.32 seconds.
Reducing the number of nodes to 200 (100 cpu nodes and 100 gpu nodes) makes it take 1.02 second.
Reducing backlog_size to 1k and the number of nodes to 200 (100 cpu nodes and 100 gpu nodes) make it take 0.06 seconds.

Bottom line, If we want to be on the safe side (always less than 1 second) with 100-1000 nodes, we can set the backlog size to 1k. I also prefer this option since it is also less aggressive.

python/ray/tests/test_resource_demand_scheduler.py

python/ray/ray_constants.py

python/ray/monitor.py

AmeerHajAli · 2020-10-28T01:34:09Z

@ericl @wuisawesome , I added a couple of more tests to see different resource demand vector sizes and different available nodes. Note that I changed the monitor function not to just trim the backlog bundles but also the waiting_bundles + infeasible_bundles So that we really bound the execution time of get_nodes_to_launch.

wuisawesome

Hey we need to block on merging this until we make sure we aren't scaling to aggressively. I will describe this in more detail in my simulator PR.

ericl · 2020-10-28T18:22:42Z

Hey we need to block on merging this until we make sure we aren't scaling to aggressively. I will describe this in more detail in my simulator PR.

Do you mean beyond the upscaling throttling? (e.g., launch at most 5 instances starting off, then at most 20% pending launches)

wuisawesome · 2020-10-28T18:50:04Z

This may actually need some investigating (there's a chance this is a bug in the simulator still). I will get back to y'all on this by EOD.

AmeerHajAli · 2020-10-28T22:57:47Z

Isn't that handled in the max launch concurrency?

python/ray/tests/test_resource_demand_scheduler.py

ericl · 2020-10-30T21:14:22Z

Looks good, but let's raise the timeouts to prevent possible test flakiness.

Ameer Haj Ali and others added 30 commits September 24, 2020 00:26

prepare for head node

7248cf9

move command runner interface outside _private

bc43e46

Merge github.com:ray-project/ray

14be0a1

remove space

ab660a8

Eric

1ea0c1f

flake

dad31ae

Merge github.com:ray-project/ray

49bcf56

min_workers in multi node type

16d736d

Merge github.com:ray-project/ray

06911df

fixing edge cases

0d8dddb

eric not idle

fe69ce3

fix target_workers to consider min_workers of node types

35832ed

idle timeout

ca0be53

minor

c9518bd

minor fix

5452d39

test

9e904cd

lint

f9edcbe

eric v2

cb02267

eric 3

5e5d403

min_workers constraint before bin packing

4d44cd8

Merge github.com:ray-project/ray

614abbf

Update resource_demand_scheduler.py

818a63a

Revert "Update resource_demand_scheduler.py"

539b29c

This reverts commit 818a63a.

reducing diff

9a63866

Merge branch 'master' of github.com:AmeerHajAli/ray

7501623

make get_nodes_to_launch return a dict

b4edd21

Merge github.com:ray-project/ray

fc48725

merge

0aef789

weird merge fix

39245a8

auto fill instance types for AWS

c7eb4ad

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 26, 2020

unit test

c69ed70

AmeerHajAli removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 27, 2020

ericl reviewed Oct 27, 2020

View reviewed changes

python/ray/tests/test_resource_demand_scheduler.py Show resolved Hide resolved

wuisawesome reviewed Oct 27, 2020

View reviewed changes

python/ray/ray_constants.py Outdated Show resolved Hide resolved

python/ray/monitor.py Outdated Show resolved Hide resolved

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 27, 2020

Ameer Haj Ali added 3 commits October 28, 2020 03:05

improving test

65aa128

Merge github.com:ray-project/ray into report_worker_backlog

979c5e8

lint

944e921

AmeerHajAli requested review from ericl and wuisawesome October 28, 2020 01:31

AmeerHajAli removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 28, 2020

wuisawesome approved these changes Oct 28, 2020

View reviewed changes

wuisawesome requested changes Oct 28, 2020

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 29, 2020

wuisawesome approved these changes Oct 30, 2020

View reviewed changes

AmeerHajAli mentioned this pull request Oct 30, 2020

[autoscaler] fix the autoscaling bug for continuously launching failed nodes #11714

Merged

6 tasks

ericl reviewed Oct 30, 2020

View reviewed changes

python/ray/tests/test_resource_demand_scheduler.py Outdated Show resolved Hide resolved

Eric

78db507

AmeerHajAli added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Nov 1, 2020

AmeerHajAli requested a review from ericl November 2, 2020 17:58

ericl merged commit 8d74a04 into ray-project:master Nov 2, 2020

wuisawesome mentioned this pull request Nov 4, 2020

[autoscaler] resource_demand_scheduler should take into account queue of tasks #10872

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[autoscaler] Flag flip for resource_demand_scheduler should take into account queue #11615

[autoscaler] Flag flip for resource_demand_scheduler should take into account queue #11615

AmeerHajAli commented Oct 26, 2020

AmeerHajAli commented Oct 27, 2020 •

edited

AmeerHajAli commented Oct 28, 2020

wuisawesome left a comment

ericl commented Oct 28, 2020 •

edited

wuisawesome commented Oct 28, 2020

AmeerHajAli commented Oct 28, 2020

ericl commented Oct 30, 2020

[autoscaler] Flag flip for resource_demand_scheduler should take into account queue #11615

[autoscaler] Flag flip for resource_demand_scheduler should take into account queue #11615

Conversation

AmeerHajAli commented Oct 26, 2020

Related issue number

Checks

AmeerHajAli commented Oct 27, 2020 • edited

AmeerHajAli commented Oct 28, 2020

wuisawesome left a comment

Choose a reason for hiding this comment

ericl commented Oct 28, 2020 • edited

wuisawesome commented Oct 28, 2020

AmeerHajAli commented Oct 28, 2020

ericl commented Oct 30, 2020

AmeerHajAli commented Oct 27, 2020 •

edited

ericl commented Oct 28, 2020 •

edited