Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug 1627769] Fix slow leak of wedged workers #2837

Merged
merged 3 commits into from
May 20, 2020
Merged

[Bug 1627769] Fix slow leak of wedged workers #2837

merged 3 commits into from
May 20, 2020

Conversation

imbstack
Copy link
Contributor

Previous to this, the default lifecycles weren't getting applied. This was
fine in the 99% case where the worker succesfully registered as at that time
the terminateAfter date gets set to the expires of the creds returned.

However for workers that did not register, the reregisterTimeout was null
and these workers were allowed to live forever. To apply this to all currently
existing pools, we should run ci-admin apply/tc-admin apply.

We had to add a default to all places where the lifecycle $ref was used. However Ajv does not allow the default to be specified outside of properties and items so we just do it where it is "imported".

This also changes our reference validation because in Ajv, $ref doesn't actually have to be alone. It can have siblings that it will merge with.

Bugzilla Bug: 1627769

@imbstack imbstack requested a review from a team as a code owner May 15, 2020 01:03
Copy link
Collaborator

@djmitche djmitche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just re-running ci-admin/tc-admin won't actually call updateWorkerPool unless the worker pools have changed.

If I'm understanding the issue correctly, it's that config.lifecycle is omitted in most worker-pool definitions, and that leads to interpretLifecycle returning null's.

In addition to updating the defaults in the schemas, could that function be modified to return default values in that case instead?

@imbstack
Copy link
Contributor Author

Yeah, the only thing about having defaults in two places is that i worried about making sure we would update both places later if we got around to it but that’s a good point I’ll add some defaults there too

@djmitche
Copy link
Collaborator

I think we've also tried to avoid defaults in schemas because their behavior is kind of wonky -- as you've seen here. But I am just parroting jonas here, so who knows :)

@imbstack
Copy link
Contributor Author

I think we fixed a lot of the wonkiness with the patch we made to it forever ago. The defaults are nice for docs at least. Would you rather I just remove the schema default entirely here and just have the default in the code or do both?

@djmitche
Copy link
Collaborator

Both seems good -- the schema defaults are just to {} whereas the code will apply the defaults for the particular properties, so I think they aren't overlapping too badly. The critical change is to treat undefined as {} in the code.

@imbstack
Copy link
Contributor Author

Interestingly setting the default {} allows the defaults of the lifecycle options themselves to take effect so the default is essentially {reregistrationTimeout: NNNN} but that makes sense. Will get something up shortly.

@@ -21,7 +21,7 @@ properties:
reregistrationTimeout:
title: Checkin Timeout
type: integer
default: 345600 # 3 days
default: 345600 # 4 days (note this is also set in interpretLifecycle; update both if changing)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@imbstack
Copy link
Contributor Author

Oh wait, the default schema stuff is going to wreak havoc on tc-admin/ci-admin isn't it? I should just do the code-only bits I guess unless the admin tools understand this?

@djmitche
Copy link
Collaborator

I'm not sure how tc-admin or ci-admin would be affected -- neither tool uses the schemas, that I know of. What are you seeing?

@imbstack
Copy link
Contributor Author

I haven't tried it yet, I'm just thinking this won't work. Since the default is applied to the input, the the actual config that is stored changes and the tc-admin diff will always detect a change until the local config is updated as well.

@djmitche
Copy link
Collaborator

Oh, good thinking. We could work around that in those tools, if it's worthwhile for the better display in the schema.

@imbstack imbstack added this to In progress in Worker Lifecycle Management via automation May 15, 2020
@imbstack
Copy link
Contributor Author

Adding this to worker-lifecycle management sprint and we can decide what the nicest way to do that then. I'm leaning towards just having a default in the code at this point.

@imbstack
Copy link
Contributor Author

ok, just gave up on making it show up in schemas. the schema docs will still show the default though

Copy link
Collaborator

@djmitche djmitche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@djmitche
Copy link
Collaborator

https://bugzilla.mozilla.org/show_bug.cgi?id=1639365 filed for the intermittency

@imbstack imbstack merged commit 6200d45 into master May 20, 2020
Worker Lifecycle Management automation moved this from In progress to Done May 20, 2020
@imbstack imbstack deleted the bug-1627769 branch May 20, 2020 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

2 participants