Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hotfix][autoscaler] Request resources refactor2 #12661

Merged
merged 86 commits into from
Dec 9, 2020
Merged

[hotfix][autoscaler] Request resources refactor2 #12661

merged 86 commits into from
Dec 9, 2020

Conversation

AmeerHajAli
Copy link
Contributor

@AmeerHajAli AmeerHajAli commented Dec 8, 2020

Closes #12498 and #12005 and #12503.
Refactors request_resources() by "rewriting" to its min_workers equivalent.
Request_resources() is handled by adding any additional necessary resources when calculating _add_min_workers_nodes in the scheduler. This basically makes handling request_resources() similar to keeping the min_workers.

The PR also prioritized the connected nodes sorted based on last use when "keeping nodes" so that we always have resources available immediately for min_workers and request_resources.

The PR also includes the code necessary to keep the idle nodes necessary for request_resources().
Around 200+ LOC are tests.

The PR unveiled multiple bugs in the tests and autoscaler that required fixing other race conditions for command runner in the tests and auto terminating failed to initialize/update nodes.

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@AmeerHajAli AmeerHajAli added the P0 Issues that should be fixed in short order label Dec 8, 2020
@AmeerHajAli AmeerHajAli added this to the Serverless Autoscaling milestone Dec 8, 2020
@wuisawesome
Copy link
Contributor

Which part is the bug here? this looks really big for a hotfix. Can you also describe what "rewriting" is here?

@AmeerHajAli
Copy link
Contributor Author

AmeerHajAli commented Dec 8, 2020

my bad, @wuisawesome , updated the description.

python/ray/autoscaler/_private/autoscaler.py Show resolved Hide resolved
python/ray/autoscaler/_private/autoscaler.py Outdated Show resolved Hide resolved
python/ray/autoscaler/_private/autoscaler.py Outdated Show resolved Hide resolved
python/ray/autoscaler/_private/autoscaler.py Outdated Show resolved Hide resolved
python/ray/autoscaler/_private/autoscaler.py Outdated Show resolved Hide resolved
python/ray/autoscaler/_private/autoscaler.py Outdated Show resolved Hide resolved
python/ray/tests/test_resource_demand_scheduler.py Outdated Show resolved Hide resolved
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 8, 2020
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please leave TODO comments for followups before merging.

@AmeerHajAli AmeerHajAli removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 9, 2020
@ericl ericl merged commit a4dbb27 into ray-project:master Dec 9, 2020
mfitton pushed a commit that referenced this pull request Dec 10, 2020
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* request_resources -> min workers

* test fixes

* add race condition tests

* Eric

* fixes

* semi final

* semi final

* lint

* lint

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P0 Issues that should be fixed in short order
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[autoscaler] Request_resources and actual actors are counted double
3 participants