[autoscaler] Fix tag cache bug, don't kill workers on error #14424

wuisawesome · 2021-03-01T20:05:04Z

Why are these changes needed?

This PR fixes 2 related issues.

There is a use after free error, where we are too eager about freeing our tag cache (AWS node provider), which leads to key errors.
Even when we encounter this bug, or any other bug, we shouldn't terminate all the nodes in the cluster just because we couldn't kill one of them.

Related issue number

Closes #14264

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

wuisawesome · 2021-03-01T20:06:13Z

cc @ericl @AmeerHajAli @DmitriGekhtman

wuisawesome · 2021-03-01T20:07:14Z

python/ray/autoscaler/_private/cli_logger.py

@@ -414,12 +414,19 @@ def _print(self,
            record.levelname = _level_str
            rendered_message = self._formatter.format(record)

+        # We aren't using standard python logging convention, so we hardcode
+        # the log levels for now.


Make sure that errors actually appear in monitor.err

ericl · 2021-03-01T20:22:55Z

python/ray/monitor.py

@@ -239,10 +239,6 @@ def destroy_autoscaler_workers(self):

    def _handle_failure(self, error):
        logger.exception("Error in monitor loop")
-        if self.autoscaler is not None:


Can we please preserve this but put it behind an env var?

ericl · 2021-03-01T20:24:35Z

python/ray/autoscaler/_private/aws/node_provider.py

@@ -120,6 +120,22 @@ def __init__(self, provider_config, cluster_name):
        # excessive DescribeInstances requests.
        self.cached_nodes = {}

+    def _gc_tag_cache(self):


This seems complicated. Why not instead make all accesses to the tag cache fall back on looking up the tags in case of failure?

You mean making an API call to get the tags?

Yes, or returning empty dict if the node is terminated.

Hmm another option is to never GC the tags; it seems harmless enough to keep them forever, since the number of nodes will never be that large.

An empty dict would be an inconsistent state so I'd prefer to avoid that. I guess just leaking the tag cached isn't the end of the world...

ericl

Looks good but can we make the kill change behind an env flag?

AmeerHajAli · 2021-03-01T22:10:13Z

Is there any test that catches the behavior?
@wuisawesome

AmeerHajAli

LGTM. how about adding a test for each of these cases (I guess we have a rule for every bug fix comes with a test)?

wuisawesome · 2021-03-02T18:13:19Z

I don't really see how we can test this without a lot of effort, in general we don't have integration tests for the node provider, and it's not clear how to catch this with unit testing (we generally don't have unit tests for the node providers).

ericl · 2021-03-02T18:37:48Z

Given this is blocking a production user, how about let's merge for now and file a followup to better test the tag cache?

wuisawesome · 2021-03-02T18:39:35Z

I agree. @AmeerHajAli given that we don't have the right framework for testing this, can you merge this if you think it's ok?

Alex Wu added 7 commits February 2, 2021 18:02

.

dc9336a

Merge branch 'master' of github.com:ray-project/ray into master

6432664

done?

ca2afd1

Merge branch 'master' of github.com:ray-project/ray into master

04b12a6

Merge branch 'master' of github.com:ray-project/ray into master

d69ca53

Merge branch 'master' of github.com:ray-project/ray into master

652acb2

.

3b40c29

wuisawesome assigned AmeerHajAli Mar 1, 2021

wuisawesome commented Mar 1, 2021

View reviewed changes

ericl reviewed Mar 1, 2021

View reviewed changes

don't gc

7e2eabe

ericl reviewed Mar 1, 2021

View reviewed changes

flag

bf0a3c2

ericl self-assigned this Mar 2, 2021

AmeerHajAli approved these changes Mar 2, 2021

View reviewed changes

ericl merged commit 4572c6c into ray-project:master Mar 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[autoscaler] Fix tag cache bug, don't kill workers on error #14424

[autoscaler] Fix tag cache bug, don't kill workers on error #14424

wuisawesome commented Mar 1, 2021

wuisawesome commented Mar 1, 2021

wuisawesome Mar 1, 2021

ericl Mar 1, 2021 •

edited

ericl Mar 1, 2021

wuisawesome Mar 1, 2021

ericl Mar 1, 2021

wuisawesome Mar 1, 2021

ericl left a comment

AmeerHajAli commented Mar 1, 2021 •

edited

AmeerHajAli left a comment

wuisawesome commented Mar 2, 2021

ericl commented Mar 2, 2021

wuisawesome commented Mar 2, 2021

[autoscaler] Fix tag cache bug, don't kill workers on error #14424

[autoscaler] Fix tag cache bug, don't kill workers on error #14424

Conversation

wuisawesome commented Mar 1, 2021

Why are these changes needed?

Related issue number

Checks

wuisawesome commented Mar 1, 2021

wuisawesome Mar 1, 2021

Choose a reason for hiding this comment

ericl Mar 1, 2021 • edited

Choose a reason for hiding this comment

ericl Mar 1, 2021

Choose a reason for hiding this comment

wuisawesome Mar 1, 2021

Choose a reason for hiding this comment

ericl Mar 1, 2021

Choose a reason for hiding this comment

wuisawesome Mar 1, 2021

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

AmeerHajAli commented Mar 1, 2021 • edited

AmeerHajAli left a comment

Choose a reason for hiding this comment

wuisawesome commented Mar 2, 2021

ericl commented Mar 2, 2021

wuisawesome commented Mar 2, 2021

ericl Mar 1, 2021 •

edited

AmeerHajAli commented Mar 1, 2021 •

edited