Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] Add an aggressive_autoscaling flag #4285

Merged
merged 17 commits into from
Apr 14, 2019

Conversation

ls-daniel
Copy link
Contributor

@ls-daniel ls-daniel commented Mar 6, 2019

This modifies the behaviour of the autoscaler such that it more aggressively
autoscales if enabled.

Fairly simple one, this: if it's on and the cluster is ever not idle, it'll spin up at least initial_workers worth of nodes straight away so that it doesn't have to slowly grind its' way up from 1, to 2, to 3, ....

There are also some minor fixes in here relating to:

Dashboard/reporter being broken (--include-webui didn't work and its' help string was wrong)
Dashboard didn't bind to 0.0.0.0 by default (now it does).

This fixes #4319.

@virtualluke
Copy link
Contributor

Looking forward to testing the dashboard again with the binding set to 0.0.0.0, hoping can get that to work on kubernetes through ngnix to a service on kubernetes.

@@ -53,6 +53,10 @@
# The number of workers to launch initially, in addition to the head node.
"initial_workers": (int, OPTIONAL),

# Whether or not to scale aggressively, e.g. to jump back to at least
# initial_workers if we're ever below it and are scaling up
"aggressive_autoscaling": (bool, OPTIONAL),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the anticipation of further modes, how about "mode": "aggressive"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ls-daniel This should be resolved before we merge this. I agree with @ericl's suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only outstanding comment I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Right, I'll change this soon, my ability to test changes is broken at the moment.

ideal_num_workers = max(ideal_num_workers, initial_workers)
elif aggressive and ideal_num_workers >= 0:
# If we want any workers, we want at least initial_workers
ideal_num_workers = max(ideal_num_workers, initial_workers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a unit test for this mdoe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ls-daniel are we planning to add a test for this in this pr? I would prefer to have a test added if that's possible.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12615/
Test FAILed.

@@ -334,9 +335,6 @@ def start(node_ip_address, redis_address, redis_port, num_redis_shards,
if redis_max_clients is not None:
raise Exception("If --head is not passed in, --redis-max-clients "
"must not be provided.")
if include_webui:
raise Exception("If --head is not passed in, the --include-webui "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this exception?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you could start a dashboard on any (or every machine), right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting the reporter process is gated behind include_webui.
This exception makes it impossible to start the reporters on the worker nodes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I just pushed a small change to remove the gating. That seemed like the right approach to me (instead of passing in --include-webui everywhere. In principle we should be able to start the actual dashboard on any machine, so it could make sense to remove this exception, but we can do that later I suppose.

@robertnishihara
Copy link
Collaborator

Tests failing with

2019-03-06 18:20:06,459	ERROR autoscaler.py:383 -- StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
  File "/Users/travis/build/ray-project/ray/python/ray/autoscaler/autoscaler.py", line 381, in update
    self._update()
  File "/Users/travis/build/ray-project/ray/python/ray/autoscaler/autoscaler.py", line 402, in _update
    self.log_info_string(nodes)
  File "/Users/travis/build/ray-project/ray/python/ray/autoscaler/autoscaler.py", line 633, in log_info_string
    logger.info("StandardAutoscaler: {}".format(self.info_string(nodes)))
  File "/Users/travis/build/ray-project/ray/python/ray/autoscaler/autoscaler.py", line 649, in info_string
    len(nodes), self.target_num_workers(), suffix)
  File "/Users/travis/build/ray-project/ray/python/ray/autoscaler/autoscaler.py", line 520, in target_num_workers
    aggressive = self.config["aggressive_autoscaling"]
KeyError: 'aggressive_autoscaling'
2019-03-06 18:20:06,467	CRITICAL autoscaler.py:387 -- StandardAutoscaler: Too many errors, abort.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

1 similar comment
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@robertnishihara
Copy link
Collaborator

Jenkins, ok to test.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/57/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12799/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12810/
Test PASSed.

@robertnishihara
Copy link
Collaborator

@ericl @hartikainen can this be merged or are there unit tests that should be added?

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/78/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12853/
Test FAILed.

@ls-daniel
Copy link
Contributor Author

Agreed, I'll take a look at this when I have some time later.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/236/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13157/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/238/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13163/
Test FAILed.

@ls-daniel
Copy link
Contributor Author

I've rebased on master.

A few fixes have been made to close file descriptors that the tests were warning about.

The test fails look unrelated to this change to me, I've written a testAggressiveAutoscaling case which passes.

I've just pushed a change to fix flake8.

On lint fails: I don't understand the procedure for formatting; as before, when I run format.sh, it runs around reformatting files everywhere; different version of yapf? Different .style.yapf? I'm baffled by this. Example change it makes; completely unrelated to this PR:

-        trials = run_experiments(
-            {
-                "foo": {
-                    "run": create_resettable_class(),
-                    "num_samples": 4,
-                    "config": {},
-                }
-            },
-            reuse_actors=True,
-            scheduler=FrequentPausesScheduler())
+        trials = run_experiments({
+            "foo": {
+                "run": create_resettable_class(),
+                "num_samples": 4,
+                "config": {},
+            }
+        },
+                                 reuse_actors=True,
+                                 scheduler=FrequentPausesScheduler())

Any ideas? Can we merge this as it looks like the test fails are unrelated?

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/247/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/248/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/396/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13635/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/397/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13637/
Test FAILed.

@ls-daniel
Copy link
Contributor Author

Now? Tests seem to be failing due to unrelated issues.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/416/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13700/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/473/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13788/
Test FAILed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Help for ray start --include-webui flag wasn't updated
7 participants