[autoscaler] Add an aggressive_autoscaling flag #4285

ls-daniel · 2019-03-06T17:34:37Z

This modifies the behaviour of the autoscaler such that it more aggressively
autoscales if enabled.

Fairly simple one, this: if it's on and the cluster is ever not idle, it'll spin up at least initial_workers worth of nodes straight away so that it doesn't have to slowly grind its' way up from 1, to 2, to 3, ....

There are also some minor fixes in here relating to:

Dashboard/reporter being broken (--include-webui didn't work and its' help string was wrong)
Dashboard didn't bind to 0.0.0.0 by default (now it does).

This fixes #4319.

virtualluke · 2019-03-06T18:57:41Z

Looking forward to testing the dashboard again with the binding set to 0.0.0.0, hoping can get that to work on kubernetes through ngnix to a service on kubernetes.

ericl · 2019-03-06T19:07:10Z

python/ray/autoscaler/autoscaler.py

@@ -53,6 +53,10 @@
    # The number of workers to launch initially, in addition to the head node.
    "initial_workers": (int, OPTIONAL),

+    # Whether or not to scale aggressively, e.g. to jump back to at least
+    #   initial_workers if we're ever below it and are scaling up
+    "aggressive_autoscaling": (bool, OPTIONAL),


In the anticipation of further modes, how about "mode": "aggressive"?

@ls-daniel This should be resolved before we merge this. I agree with @ericl's suggestion.

This is the only outstanding comment I think?

Oops. Right, I'll change this soon, my ability to test changes is broken at the moment.

ericl · 2019-03-06T19:08:02Z

python/ray/autoscaler/autoscaler.py

+            ideal_num_workers = max(ideal_num_workers, initial_workers)
+        elif aggressive and ideal_num_workers >= 0:
+            # If we want any workers, we want at least initial_workers
+            ideal_num_workers = max(ideal_num_workers, initial_workers)


Could we add a unit test for this mdoe?

@ls-daniel are we planning to add a test for this in this pr? I would prefer to have a test added if that's possible.

AmplabJenkins · 2019-03-06T20:39:20Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12615/
Test FAILed.

robertnishihara · 2019-03-06T23:50:16Z

python/ray/scripts/scripts.py

@@ -334,9 +335,6 @@ def start(node_ip_address, redis_address, redis_port, num_redis_shards,
        if redis_max_clients is not None:
            raise Exception("If --head is not passed in, --redis-max-clients "
                            "must not be provided.")
-        if include_webui:
-            raise Exception("If --head is not passed in, the --include-webui "


Why remove this exception?

I guess you could start a dashboard on any (or every machine), right?

Starting the reporter process is gated behind include_webui.
This exception makes it impossible to start the reporters on the worker nodes.

Good catch. I just pushed a small change to remove the gating. That seemed like the right approach to me (instead of passing in --include-webui everywhere. In principle we should be able to start the actual dashboard on any machine, so it could make sense to remove this exception, but we can do that later I suppose.

robertnishihara · 2019-03-06T23:53:28Z

Tests failing with

2019-03-06 18:20:06,459	ERROR autoscaler.py:383 -- StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
  File "/Users/travis/build/ray-project/ray/python/ray/autoscaler/autoscaler.py", line 381, in update
    self._update()
  File "/Users/travis/build/ray-project/ray/python/ray/autoscaler/autoscaler.py", line 402, in _update
    self.log_info_string(nodes)
  File "/Users/travis/build/ray-project/ray/python/ray/autoscaler/autoscaler.py", line 633, in log_info_string
    logger.info("StandardAutoscaler: {}".format(self.info_string(nodes)))
  File "/Users/travis/build/ray-project/ray/python/ray/autoscaler/autoscaler.py", line 649, in info_string
    len(nodes), self.target_num_workers(), suffix)
  File "/Users/travis/build/ray-project/ray/python/ray/autoscaler/autoscaler.py", line 520, in target_num_workers
    aggressive = self.config["aggressive_autoscaling"]
KeyError: 'aggressive_autoscaling'
2019-03-06 18:20:06,467	CRITICAL autoscaler.py:387 -- StandardAutoscaler: Too many errors, abort.

AmplabJenkins · 2019-03-11T07:09:25Z

Can one of the admins verify this patch?

AmplabJenkins · 2019-03-11T21:48:11Z

Can one of the admins verify this patch?

robertnishihara · 2019-03-12T20:04:30Z

Jenkins, ok to test.

AmplabJenkins · 2019-03-12T20:08:12Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/57/
Test PASSed.

AmplabJenkins · 2019-03-12T20:28:42Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12799/
Test PASSed.

AmplabJenkins · 2019-03-13T02:01:24Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12810/
Test PASSed.

robertnishihara · 2019-03-13T22:00:16Z

@ericl @hartikainen can this be merged or are there unit tests that should be added?

AmplabJenkins · 2019-03-13T22:47:44Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/78/
Test PASSed.

AmplabJenkins · 2019-03-14T01:14:17Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12853/
Test FAILed.

ls-daniel · 2019-03-15T11:15:36Z

Agreed, I'll take a look at this when I have some time later.

AmplabJenkins · 2019-03-21T16:19:45Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/236/
Test PASSed.

AmplabJenkins · 2019-03-21T17:06:10Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13157/
Test FAILed.

AmplabJenkins · 2019-03-21T19:43:46Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/238/
Test PASSed.

AmplabJenkins · 2019-03-21T20:17:05Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13163/
Test FAILed.

ls-daniel · 2019-03-22T10:17:53Z

I've rebased on master.

A few fixes have been made to close file descriptors that the tests were warning about.

The test fails look unrelated to this change to me, I've written a testAggressiveAutoscaling case which passes.

I've just pushed a change to fix flake8.

On lint fails: I don't understand the procedure for formatting; as before, when I run format.sh, it runs around reformatting files everywhere; different version of yapf? Different .style.yapf? I'm baffled by this. Example change it makes; completely unrelated to this PR:

-        trials = run_experiments(
-            {
-                "foo": {
-                    "run": create_resettable_class(),
-                    "num_samples": 4,
-                    "config": {},
-                }
-            },
-            reuse_actors=True,
-            scheduler=FrequentPausesScheduler())
+        trials = run_experiments({
+            "foo": {
+                "run": create_resettable_class(),
+                "num_samples": 4,
+                "config": {},
+            }
+        },
+                                 reuse_actors=True,
+                                 scheduler=FrequentPausesScheduler())

Any ideas? Can we merge this as it looks like the test fails are unrelated?

AmplabJenkins · 2019-03-22T10:23:04Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/247/
Test PASSed.

AmplabJenkins · 2019-03-22T10:45:03Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/248/
Test PASSed.

…ually works

AmplabJenkins · 2019-04-08T10:02:31Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/396/
Test PASSed.

AmplabJenkins · 2019-04-08T10:22:02Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13635/
Test FAILed.

AmplabJenkins · 2019-04-08T12:01:02Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/397/
Test PASSed.

AmplabJenkins · 2019-04-08T12:19:00Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13637/
Test FAILed.

ls-daniel · 2019-04-10T13:37:33Z

Now? Tests seem to be failing due to unrelated issues.

AmplabJenkins · 2019-04-10T17:20:21Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/416/
Test PASSed.

AmplabJenkins · 2019-04-10T17:38:20Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13700/
Test FAILed.

AmplabJenkins · 2019-04-13T21:16:41Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/473/
Test PASSed.

AmplabJenkins · 2019-04-13T23:42:05Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13788/
Test FAILed.

This was referenced Mar 6, 2019

Turn dashboard UI off by default. #4188

Merged

Unable to connect to Ray dashboard on remote machine. #4201

Closed

Where is the dashboard? #4269

Closed

ericl reviewed Mar 6, 2019

View reviewed changes

robertnishihara reviewed Mar 6, 2019

View reviewed changes

robertnishihara mentioned this pull request Mar 9, 2019

Help for ray start --include-webui flag wasn't updated #4319

Closed

robertnishihara assigned hartikainen Mar 13, 2019

robertnishihara mentioned this pull request Mar 20, 2019

Start dashboard on all nodes and other small fixes. #4428

Merged

ls-daniel force-pushed the develop branch from b30ce28 to f47c24a Compare March 21, 2019 16:18

ls-daniel and others added 11 commits April 8, 2019 11:01

dashboard: Fix include-webui flag, fix binding

48d957b

autoscaler: Fix the aggressive_autoscaling algorithm such that it act…

6ab56eb

…ually works

mistakes were made

2320f03

Fix tests

cd0aad9

Remove --include-ui gating for starting reporter.

eb1ef69

Add unit test for aggressive_autoscaling; small tweak

dce3f22

Clean up some dangling file handles

6c19b07

oops

df3ca0f

Update scripts.py

8fe0bb3

flake8 fixes

da39edd

scripts/format.sh

36aca9a

ls-daniel force-pushed the develop branch from 5096a46 to 36aca9a Compare April 8, 2019 10:01

ls-daniel added 2 commits April 8, 2019 12:16

aggressive_autoscaling -> autoscaling_mode

74138ec

Fix autoscaling_mode in dashboard

965e163

Merge branch 'master' into develop

31947a9

ericl approved these changes Apr 13, 2019

View reviewed changes

Fix linting and python 2 error.

7b36b5b

robertnishihara merged commit 3e1adaf into ray-project:master Apr 14, 2019

hartikainen mentioned this pull request Apr 21, 2019

[autoscaler] GCP cluster start hangs on uptime command #4679

Closed

hartikainen mentioned this pull request Apr 29, 2019

[WIP] [autoscaler] Allow nodes to be created/terminated in parallel #3939

Closed

7 tasks

[autoscaler] Add an aggressive_autoscaling flag #4285

[autoscaler] Add an aggressive_autoscaling flag #4285

Conversation

ls-daniel commented Mar 6, 2019 • edited by robertnishihara Loading

virtualluke commented Mar 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Mar 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertnishihara commented Mar 6, 2019

AmplabJenkins commented Mar 11, 2019

AmplabJenkins commented Mar 11, 2019

robertnishihara commented Mar 12, 2019

AmplabJenkins commented Mar 12, 2019

AmplabJenkins commented Mar 12, 2019

AmplabJenkins commented Mar 13, 2019

robertnishihara commented Mar 13, 2019

AmplabJenkins commented Mar 13, 2019

AmplabJenkins commented Mar 14, 2019

ls-daniel commented Mar 15, 2019

AmplabJenkins commented Mar 21, 2019

AmplabJenkins commented Mar 21, 2019

AmplabJenkins commented Mar 21, 2019

AmplabJenkins commented Mar 21, 2019

ls-daniel commented Mar 22, 2019

AmplabJenkins commented Mar 22, 2019

AmplabJenkins commented Mar 22, 2019

AmplabJenkins commented Apr 8, 2019

AmplabJenkins commented Apr 8, 2019

AmplabJenkins commented Apr 8, 2019

AmplabJenkins commented Apr 8, 2019

ls-daniel commented Apr 10, 2019

AmplabJenkins commented Apr 10, 2019

AmplabJenkins commented Apr 10, 2019

AmplabJenkins commented Apr 13, 2019

AmplabJenkins commented Apr 13, 2019

ls-daniel commented Mar 6, 2019 •

edited by robertnishihara

Loading