Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] Fix tag cache bug, don't kill workers on error #14424

Merged
merged 9 commits into from
Mar 2, 2021

Conversation

wuisawesome
Copy link
Contributor

Why are these changes needed?

This PR fixes 2 related issues.

  1. There is a use after free error, where we are too eager about freeing our tag cache (AWS node provider), which leads to key errors.
  2. Even when we encounter this bug, or any other bug, we shouldn't terminate all the nodes in the cluster just because we couldn't kill one of them.

Related issue number

Closes #14264

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@wuisawesome
Copy link
Contributor Author

@@ -414,12 +414,19 @@ def _print(self,
record.levelname = _level_str
rendered_message = self._formatter.format(record)

# We aren't using standard python logging convention, so we hardcode
# the log levels for now.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure that errors actually appear in monitor.err

@@ -239,10 +239,6 @@ def destroy_autoscaler_workers(self):

def _handle_failure(self, error):
logger.exception("Error in monitor loop")
if self.autoscaler is not None:
Copy link
Contributor

@ericl ericl Mar 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please preserve this but put it behind an env var?

@@ -120,6 +120,22 @@ def __init__(self, provider_config, cluster_name):
# excessive DescribeInstances requests.
self.cached_nodes = {}

def _gc_tag_cache(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems complicated. Why not instead make all accesses to the tag cache fall back on looking up the tags in case of failure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean making an API call to get the tags?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, or returning empty dict if the node is terminated.

Hmm another option is to never GC the tags; it seems harmless enough to keep them forever, since the number of nodes will never be that large.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An empty dict would be an inconsistent state so I'd prefer to avoid that. I guess just leaking the tag cached isn't the end of the world...

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but can we make the kill change behind an env flag?

@AmeerHajAli
Copy link
Contributor

AmeerHajAli commented Mar 1, 2021

Is there any test that catches the behavior?
@wuisawesome

@ericl ericl self-assigned this Mar 2, 2021
Copy link
Contributor

@AmeerHajAli AmeerHajAli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. how about adding a test for each of these cases (I guess we have a rule for every bug fix comes with a test)?

@wuisawesome
Copy link
Contributor Author

I don't really see how we can test this without a lot of effort, in general we don't have integration tests for the node provider, and it's not clear how to catch this with unit testing (we generally don't have unit tests for the node providers).

@ericl
Copy link
Contributor

ericl commented Mar 2, 2021

Given this is blocking a production user, how about let's merge for now and file a followup to better test the tag cache?

@wuisawesome
Copy link
Contributor Author

I agree. @AmeerHajAli given that we don't have the right framework for testing this, can you merge this if you think it's ok?

@ericl ericl merged commit 4572c6c into ray-project:master Mar 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[autoscaler] wrongly shuts down all nodes due to one bad node.
3 participants