-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[autoscaler] Fix tag cache bug, don't kill workers on error #14424
[autoscaler] Fix tag cache bug, don't kill workers on error #14424
Conversation
@@ -414,12 +414,19 @@ def _print(self, | |||
record.levelname = _level_str | |||
rendered_message = self._formatter.format(record) | |||
|
|||
# We aren't using standard python logging convention, so we hardcode | |||
# the log levels for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure that errors actually appear in monitor.err
@@ -239,10 +239,6 @@ def destroy_autoscaler_workers(self): | |||
|
|||
def _handle_failure(self, error): | |||
logger.exception("Error in monitor loop") | |||
if self.autoscaler is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please preserve this but put it behind an env var?
@@ -120,6 +120,22 @@ def __init__(self, provider_config, cluster_name): | |||
# excessive DescribeInstances requests. | |||
self.cached_nodes = {} | |||
|
|||
def _gc_tag_cache(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems complicated. Why not instead make all accesses to the tag cache fall back on looking up the tags in case of failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean making an API call to get the tags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, or returning empty dict if the node is terminated.
Hmm another option is to never GC the tags; it seems harmless enough to keep them forever, since the number of nodes will never be that large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An empty dict would be an inconsistent state so I'd prefer to avoid that. I guess just leaking the tag cached isn't the end of the world...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good but can we make the kill change behind an env flag?
Is there any test that catches the behavior? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. how about adding a test for each of these cases (I guess we have a rule for every bug fix comes with a test)?
I don't really see how we can test this without a lot of effort, in general we don't have integration tests for the node provider, and it's not clear how to catch this with unit testing (we generally don't have unit tests for the node providers). |
Given this is blocking a production user, how about let's merge for now and file a followup to better test the tag cache? |
I agree. @AmeerHajAli given that we don't have the right framework for testing this, can you merge this if you think it's ok? |
Why are these changes needed?
This PR fixes 2 related issues.
Related issue number
Closes #14264
Checks
scripts/format.sh
to lint the changes in this PR.