Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop instances that are not doing anything useful anymore #6333

Merged
merged 5 commits into from
Jun 26, 2023

Conversation

lotas
Copy link
Contributor

@lotas lotas commented Jun 16, 2023

This can happen when docker/generic worker process dies and stops calling queue.claimWork/queue.reclaimTask
Such workers would either be not known to the queue, or would have their last_date_active way in the past.
In such cases killing instance is the only option to free up resources.

This PR introduces queueInactivityTimeout for the worker lifecycle parameter (default being 2h).

It works in following scenarios:

  • if worker is registered, but didn't call queue.claimWork it would be terminated after queueInactivityTimeout (firstClaim is not set)
  • if worker claimed work, but didn't call queue.reclaimTask, or died and stopped calling claimWork, it would be terminated after 2 hours (lastDateActive is older than queueInactivityTimeout)

Quarantined workers are not affected, since they keep calling claimWork but are being ignored there.

Fixes #6142

@lotas lotas force-pushed the feat/wm-zombie-instances branch 2 times, most recently from c4b9dc8 to 9f9bec2 Compare June 16, 2023 14:18
Copy link
Collaborator

@djmitche djmitche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned in the meeting that changing the existing lifetime parameters to something a lot smaller than 96h would have a similar effect -- it will terminate workers that have not (re)registered in that timeframe.

This PR would terminate workers that were successfully re-registering, but were not claiming work. Typically worker-runner handles the worker registration, while of course the worker handles the claiming. The cases where I have seen a worker re-registering but not claiming are when docker-worker is confused into thinking that it has zero capacity (such as when it starts without enough disk space). So I think this is a valid situation for worker-manager to monitor for.

Comment on lines 102 to 103
activityTimeout = activityTimeout || 1000 * 60 * 60 * 2; // last active within 2 hours
createdTimeout = createdTimeout || 1000 * 60 * 60 * 1; // created at least 1 hour ago
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should probably be values specified in the lifecycle property of the worker pool, rather than hard-coded. As mentioned in the meeting today, they can probably be much smaller for AWS and Google.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Dustin, yes, I moved them into the lifecycle and left a single claimTimeout parameter

@lotas
Copy link
Contributor Author

lotas commented Jun 19, 2023

I mentioned in the meeting that changing the existing lifetime parameters to something a lot smaller than 96h would have a similar effect -- it will terminate workers that have not (re)registered in that timeframe.

Do you think we should lower that default here now as well? Do you remember why 96h was chosen initially?

I think it still makes sense to have claimTimeout as it "reacts faster" to the situations where queue interaction is broken for either reason.

@@ -359,6 +360,10 @@ class AwsProvider extends Provider {
await this.removeWorker({ worker, reason: 'terminateAfter time exceeded' });
}
}
const { isZombie, reason } = Provider.isZombie({ worker });
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming is hard .. but I think zombie is a good analogy, same as zombie-processes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok too, but we should probably document somewhere what zombie means.

@djmitche
Copy link
Collaborator

I mentioned in the meeting that changing the existing lifetime parameters to something a lot smaller than 96h would have a similar effect -- it will terminate workers that have not (re)registered in that timeframe.

Do you think we should lower that default here now as well? Do you remember why 96h was chosen initially?

I think it still makes sense to have claimTimeout as it "reacts faster" to the situations where queue interaction is broken for either reason.

The 96h was originally hard-coded in the workers. The aws-provisioner-v1 service would call TerminateInstances after 96 hours as a backup, so workers terminated themselves before that.

We set the lifecycle defaults to match that so that enabling them didn't change behavior, with the idea that someone would followup and reduce those timeouts :)

Copy link
Member

@petemoore petemoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Yarik!

I like that you used the term "non-stopped-workers" to mean all workers not in stopped state, since other terms like "active" would be more vague. So +1 for that! I think claimTimeout though is a bit confusing, as when workers claim a task, they get a claimTimeout already, and it means something different (how long until if they haven't reclaimed the task will be resolved as exception/claim-expired). Other than that, some clarity over lastActiveDate is useful, although maybe that already exists elsewhere in documentation. Thanks!


Get non-stopped workers filtered by the optional arguments,
ordered by `worker_pool_id`, `worker_group`, and `worker_id`.
If the pagination arguments are both NULL, all rows are returned.
Otherwise, page_size rows are returned at offset `page_offset`.
The `quaratine_until` contains NULL or a date in the past if the
worker is not quarantined, otherwise the date until which it is
quaratined. `providers_filter_cond` and `providers_filter_value` used to
quaratined. `first_claim` and `last_date_active` contains information
known to the queue service about the worker.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is last date active defined? Last time any API request was made to the queue, or only a specific subset?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like he's just using what I had added previously https://github.com/taskcluster/taskcluster/pull/5365/files#diff-3dcdf963e193e0ce3968b21e9a477bf3cb166b7480385482d217b5d9773f3116R60-R66

Updated each time a worker calls queue.claimWork, queue.reclaimTask, and queue.declareWorker for this task queue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which, I then used for implementing #5448.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is handled by queue, whenever claimWork or reclaimTask calls happens, queue marks workers this way, to know that they were doing something at this period of time

services/worker-manager/schemas/v1/worker-lifecycle.yml Outdated Show resolved Hide resolved
@@ -359,6 +360,10 @@ class AwsProvider extends Provider {
await this.removeWorker({ worker, reason: 'terminateAfter time exceeded' });
}
}
const { isZombie, reason } = Provider.isZombie({ worker });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok too, but we should probably document somewhere what zombie means.

services/worker-manager/src/providers/provider.js Outdated Show resolved Hide resolved
@lotas lotas marked this pull request as ready for review June 20, 2023 12:23
@lotas lotas requested a review from a team as a code owner June 20, 2023 12:23
@lotas lotas requested review from petemoore and matt-boris and removed request for a team June 20, 2023 12:23
Copy link
Contributor

@matt-boris matt-boris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, looking good! 🎉

working on a task. If worker process dies, or it stops calling `claimWork`
or `reclaimTask` it should be considered dead and terminated.

Minimum allowed value is 10 minutes, to give worker chance to start claiming tasks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worries me a little bit. Did you get this 10 min from anywhere or just hoping that this'll be the case? I prefer to go a little more conservative here. Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is totally debatable. I just wanted something to prevent misconfiguration lead to workers being killed before they even got a chance to do something
I can imagine it takes time to start up and prepare. Also, this would include reboots. On windows it probably takes longer.

@matt-boris do you suggest we increase this number or decrease?

@petemoore from your past experience, do you know what should be average max time between tasks? GC, reboot, etc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matt-boris do you suggest we increase this number or decrease?

I would think 30 is a safer bet. This is quite a bit more conservative, which might be overkill, but it's hard to tell for sure.

Copy link
Contributor Author

@lotas lotas Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been running some queries on the data we have, and noticed that reboots might take 5-15min on average, with some outliers with up to 60+ minutes
Maybe those pools are special, or maybe just the data is not very representative.

Data source is an aggregation already - total number of seconds spent on reboots and number of instances.. So I'm drawing my conclusions from reboot/instances to see what those numbers are

Now I'm also wondering if 2h is a good default 🤔

Checking median values, here seems to be only few pools that can go on 2h+ reboots

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, if I only query for the pools with 20+ instances, then .75%% gives all reboots within 2h, so I think it should be safe

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a difference between a minimum value in the platform and the configured value. If for example we had a provisioner that ran QEMU VMs on demand, 10 minutes might be far too long. It's probably short for a default value, but not for a minimum value.

It's worth wondering if a host that takes 2+ hours to reboot is going to ever manage to run a build or test task in a reasonable amount of time. But, that decision should be up to the deployers (and thus not encoded in the schema) IMHO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I'll put something smaller.

Copy link
Contributor

@matt-boris matt-boris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small, additional cleanup.

services/worker-manager/schemas/v1/worker-lifecycle.yml Outdated Show resolved Hide resolved
services/worker-manager/test/provider_test.js Outdated Show resolved Hide resolved
@lotas lotas force-pushed the feat/wm-zombie-instances branch 3 times, most recently from 2c04cfe to dd7a9a3 Compare June 21, 2023 14:51
matt-boris
matt-boris previously approved these changes Jun 21, 2023
Copy link
Contributor

@matt-boris matt-boris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks Yarik!! 🚀

lotas and others added 5 commits June 26, 2023 16:46
This can happen when docker/generic worker process dies and stops
calling queue.claimWork/queue.reclaimTask
Such workers would either be not known to the queue, or would have their
last_date_active way in the past.
In such cases killing instance is the only option to free up resources.
Co-authored-by: Matt Boris <92693437+matt-boris@users.noreply.github.com>
@lotas lotas merged commit 0f4ed17 into main Jun 26, 2023
66 checks passed
@lotas lotas deleted the feat/wm-zombie-instances branch June 26, 2023 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Worker Manager not killing rogue/zombie instances
4 participants