-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[autoscaler] Add an aggressive_autoscaling flag #4285
[autoscaler] Add an aggressive_autoscaling flag #4285
Conversation
Looking forward to testing the dashboard again with the binding set to 0.0.0.0, hoping can get that to work on kubernetes through ngnix to a service on kubernetes. |
python/ray/autoscaler/autoscaler.py
Outdated
@@ -53,6 +53,10 @@ | |||
# The number of workers to launch initially, in addition to the head node. | |||
"initial_workers": (int, OPTIONAL), | |||
|
|||
# Whether or not to scale aggressively, e.g. to jump back to at least | |||
# initial_workers if we're ever below it and are scaling up | |||
"aggressive_autoscaling": (bool, OPTIONAL), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the anticipation of further modes, how about "mode": "aggressive"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ls-daniel This should be resolved before we merge this. I agree with @ericl's suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only outstanding comment I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. Right, I'll change this soon, my ability to test changes is broken at the moment.
ideal_num_workers = max(ideal_num_workers, initial_workers) | ||
elif aggressive and ideal_num_workers >= 0: | ||
# If we want any workers, we want at least initial_workers | ||
ideal_num_workers = max(ideal_num_workers, initial_workers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a unit test for this mdoe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ls-daniel are we planning to add a test for this in this pr? I would prefer to have a test added if that's possible.
Test FAILed. |
python/ray/scripts/scripts.py
Outdated
@@ -334,9 +335,6 @@ def start(node_ip_address, redis_address, redis_port, num_redis_shards, | |||
if redis_max_clients is not None: | |||
raise Exception("If --head is not passed in, --redis-max-clients " | |||
"must not be provided.") | |||
if include_webui: | |||
raise Exception("If --head is not passed in, the --include-webui " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove this exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you could start a dashboard on any (or every machine), right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting the reporter process is gated behind include_webui.
This exception makes it impossible to start the reporters on the worker nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I just pushed a small change to remove the gating. That seemed like the right approach to me (instead of passing in --include-webui
everywhere. In principle we should be able to start the actual dashboard on any machine, so it could make sense to remove this exception, but we can do that later I suppose.
Tests failing with
|
Can one of the admins verify this patch? |
1 similar comment
Can one of the admins verify this patch? |
Jenkins, ok to test. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
@ericl @hartikainen can this be merged or are there unit tests that should be added? |
Test PASSed. |
Test FAILed. |
Agreed, I'll take a look at this when I have some time later. |
Test PASSed. |
Test FAILed. |
Test PASSed. |
Test FAILed. |
I've rebased on master. A few fixes have been made to close file descriptors that the tests were warning about. The test fails look unrelated to this change to me, I've written a testAggressiveAutoscaling case which passes. I've just pushed a change to fix flake8. On lint fails: I don't understand the procedure for formatting; as before, when I run format.sh, it runs around reformatting files everywhere; different version of yapf? Different .style.yapf? I'm baffled by this. Example change it makes; completely unrelated to this PR:
Any ideas? Can we merge this as it looks like the test fails are unrelated? |
Test PASSed. |
Test PASSed. |
Test PASSed. |
Test FAILed. |
Test PASSed. |
Test FAILed. |
Now? Tests seem to be failing due to unrelated issues. |
Test PASSed. |
Test FAILed. |
Test PASSed. |
Test FAILed. |
This modifies the behaviour of the autoscaler such that it more aggressively
autoscales if enabled.
Fairly simple one, this: if it's on and the cluster is ever not idle, it'll spin up at least initial_workers worth of nodes straight away so that it doesn't have to slowly grind its' way up from 1, to 2, to 3, ....
There are also some minor fixes in here relating to:
Dashboard/reporter being broken (--include-webui didn't work and its' help string was wrong)
Dashboard didn't bind to 0.0.0.0 by default (now it does).
This fixes #4319.