Stop instances that are not doing anything useful anymore #6333

lotas · 2023-06-16T10:34:51Z

This can happen when docker/generic worker process dies and stops calling queue.claimWork/queue.reclaimTask
Such workers would either be not known to the queue, or would have their last_date_active way in the past.
In such cases killing instance is the only option to free up resources.

This PR introduces queueInactivityTimeout for the worker lifecycle parameter (default being 2h).

It works in following scenarios:

if worker is registered, but didn't call queue.claimWork it would be terminated after queueInactivityTimeout (firstClaim is not set)
if worker claimed work, but didn't call queue.reclaimTask, or died and stopped calling claimWork, it would be terminated after 2 hours (lastDateActive is older than queueInactivityTimeout)

Quarantined workers are not affected, since they keep calling claimWork but are being ignored there.

Fixes #6142

djmitche

I mentioned in the meeting that changing the existing lifetime parameters to something a lot smaller than 96h would have a similar effect -- it will terminate workers that have not (re)registered in that timeframe.

This PR would terminate workers that were successfully re-registering, but were not claiming work. Typically worker-runner handles the worker registration, while of course the worker handles the claiming. The cases where I have seen a worker re-registering but not claiming are when docker-worker is confused into thinking that it has zero capacity (such as when it starts without enough disk space). So I think this is a valid situation for worker-manager to monitor for.

djmitche · 2023-06-16T22:46:07Z

services/worker-manager/src/providers/provider.js

+    activityTimeout = activityTimeout || 1000 * 60 * 60 * 2; // last active within 2 hours
+    createdTimeout = createdTimeout || 1000 * 60 * 60 * 1; // created at least 1 hour ago


These should probably be values specified in the lifecycle property of the worker pool, rather than hard-coded. As mentioned in the meeting today, they can probably be much smaller for AWS and Google.

Thanks Dustin, yes, I moved them into the lifecycle and left a single claimTimeout parameter

lotas · 2023-06-19T15:54:49Z

I mentioned in the meeting that changing the existing lifetime parameters to something a lot smaller than 96h would have a similar effect -- it will terminate workers that have not (re)registered in that timeframe.

Do you think we should lower that default here now as well? Do you remember why 96h was chosen initially?

I think it still makes sense to have claimTimeout as it "reacts faster" to the situations where queue interaction is broken for either reason.

lotas · 2023-06-19T16:00:37Z

services/worker-manager/src/providers/aws.js

@@ -359,6 +360,10 @@ class AwsProvider extends Provider {
          await this.removeWorker({ worker, reason: 'terminateAfter time exceeded' });
        }
      }
+      const { isZombie, reason } = Provider.isZombie({ worker });


naming is hard .. but I think zombie is a good analogy, same as zombie-processes

I think it is ok too, but we should probably document somewhere what zombie means.

djmitche · 2023-06-19T16:26:51Z

I mentioned in the meeting that changing the existing lifetime parameters to something a lot smaller than 96h would have a similar effect -- it will terminate workers that have not (re)registered in that timeframe.

Do you think we should lower that default here now as well? Do you remember why 96h was chosen initially?

I think it still makes sense to have claimTimeout as it "reacts faster" to the situations where queue interaction is broken for either reason.

The 96h was originally hard-coded in the workers. The aws-provisioner-v1 service would call TerminateInstances after 96 hours as a backup, so workers terminated themselves before that.

We set the lifecycle defaults to match that so that enabling them didn't change behavior, with the idea that someone would followup and reduce those timeouts :)

petemoore

Thanks Yarik!

I like that you used the term "non-stopped-workers" to mean all workers not in stopped state, since other terms like "active" would be more vague. So +1 for that! I think claimTimeout though is a bit confusing, as when workers claim a task, they get a claimTimeout already, and it means something different (how long until if they haven't reclaimed the task will be resolved as exception/claim-expired). Other than that, some clarity over lastActiveDate is useful, although maybe that already exists elsewhere in documentation. Thanks!

petemoore · 2023-06-20T09:21:31Z

db/fns.md


 Get non-stopped workers filtered by the optional arguments,
 ordered by `worker_pool_id`, `worker_group`, and  `worker_id`.
 If the pagination arguments are both NULL, all rows are returned.
 Otherwise, page_size rows are returned at offset `page_offset`.
 The `quaratine_until` contains NULL or a date in the past if the
 worker is not quarantined, otherwise the date until which it is
-quaratined. `providers_filter_cond` and `providers_filter_value` used to
+quaratined. `first_claim` and `last_date_active` contains information
+known to the queue service about the worker.


How is last date active defined? Last time any API request was made to the queue, or only a specific subset?

This looks like he's just using what I had added previously https://github.com/taskcluster/taskcluster/pull/5365/files#diff-3dcdf963e193e0ce3968b21e9a477bf3cb166b7480385482d217b5d9773f3116R60-R66

Updated each time a worker calls queue.claimWork, queue.reclaimTask, and queue.declareWorker for this task queue.

Which, I then used for implementing #5448.

Yes, this is handled by queue, whenever claimWork or reclaimTask calls happens, queue marks workers this way, to know that they were doing something at this period of time

services/worker-manager/schemas/v1/worker-lifecycle.yml

petemoore · 2023-06-20T09:30:09Z

services/worker-manager/src/providers/aws.js

@@ -359,6 +360,10 @@ class AwsProvider extends Provider {
          await this.removeWorker({ worker, reason: 'terminateAfter time exceeded' });
        }
      }
+      const { isZombie, reason } = Provider.isZombie({ worker });


I think it is ok too, but we should probably document somewhere what zombie means.

services/worker-manager/src/providers/provider.js

ui/docs/reference/core/worker-manager/worker-interaction.mdx

matt-boris

Nice, looking good! 🎉

matt-boris · 2023-06-20T12:27:50Z

services/worker-manager/schemas/v1/worker-lifecycle.yml

+      working on a task. If worker process dies, or it stops calling `claimWork`
+      or `reclaimTask` it should be considered dead and terminated.
+
+      Minimum allowed value is 10 minutes, to give worker chance to start claiming tasks.


This worries me a little bit. Did you get this 10 min from anywhere or just hoping that this'll be the case? I prefer to go a little more conservative here. Wdyt?

yeah, this is totally debatable. I just wanted something to prevent misconfiguration lead to workers being killed before they even got a chance to do something
I can imagine it takes time to start up and prepare. Also, this would include reboots. On windows it probably takes longer.

@matt-boris do you suggest we increase this number or decrease?

@petemoore from your past experience, do you know what should be average max time between tasks? GC, reboot, etc

@matt-boris do you suggest we increase this number or decrease?

I would think 30 is a safer bet. This is quite a bit more conservative, which might be overkill, but it's hard to tell for sure.

I've been running some queries on the data we have, and noticed that reboots might take 5-15min on average, with some outliers with up to 60+ minutes
Maybe those pools are special, or maybe just the data is not very representative.

Data source is an aggregation already - total number of seconds spent on reboots and number of instances.. So I'm drawing my conclusions from reboot/instances to see what those numbers are

Now I'm also wondering if 2h is a good default 🤔

Checking median values, here seems to be only few pools that can go on 2h+ reboots

ok, if I only query for the pools with 20+ instances, then .75%% gives all reboots within 2h, so I think it should be safe

I think there's a difference between a minimum value in the platform and the configured value. If for example we had a provisioner that ran QEMU VMs on demand, 10 minutes might be far too long. It's probably short for a default value, but not for a minimum value.

It's worth wondering if a host that takes 2+ hours to reboot is going to ever manage to run a build or test task in a reasonable amount of time. But, that decision should be up to the deployers (and thus not encoded in the schema) IMHO.

Agree, I'll put something smaller.

ui/docs/reference/core/worker-manager/worker-interaction.mdx

matt-boris

Some small, additional cleanup.

services/worker-manager/schemas/v1/worker-lifecycle.yml

services/worker-manager/test/provider_test.js

matt-boris

LGTM thanks Yarik!! 🚀

This can happen when docker/generic worker process dies and stops calling queue.claimWork/queue.reclaimTask Such workers would either be not known to the queue, or would have their last_date_active way in the past. In such cases killing instance is the only option to free up resources.

Co-authored-by: Matt Boris <92693437+matt-boris@users.noreply.github.com>

lotas force-pushed the feat/wm-zombie-instances branch 2 times, most recently from c4b9dc8 to 9f9bec2 Compare June 16, 2023 14:18

djmitche reviewed Jun 16, 2023

View reviewed changes

lotas force-pushed the feat/wm-zombie-instances branch from 9f9bec2 to 07f9942 Compare June 19, 2023 15:43

lotas commented Jun 19, 2023

View reviewed changes

petemoore reviewed Jun 20, 2023

View reviewed changes

lotas force-pushed the feat/wm-zombie-instances branch from 07f9942 to 83abe9c Compare June 20, 2023 12:17

lotas marked this pull request as ready for review June 20, 2023 12:23

lotas requested a review from a team as a code owner June 20, 2023 12:23

lotas requested review from petemoore and matt-boris and removed request for a team June 20, 2023 12:23

matt-boris reviewed Jun 20, 2023

View reviewed changes

lotas force-pushed the feat/wm-zombie-instances branch from 83abe9c to 36cf222 Compare June 20, 2023 14:09

matt-boris reviewed Jun 20, 2023

View reviewed changes

services/worker-manager/schemas/v1/worker-lifecycle.yml Outdated Show resolved Hide resolved

services/worker-manager/test/provider_test.js Outdated Show resolved Hide resolved

lotas force-pushed the feat/wm-zombie-instances branch 3 times, most recently from 2c04cfe to dd7a9a3 Compare June 21, 2023 14:51

matt-boris previously approved these changes Jun 21, 2023

View reviewed changes

lotas dismissed matt-boris’s stale review via 4f87748 June 26, 2023 14:08

lotas force-pushed the feat/wm-zombie-instances branch from dd7a9a3 to 4f87748 Compare June 26, 2023 14:08

matt-boris approved these changes Jun 26, 2023

View reviewed changes

lotas and others added 5 commits June 26, 2023 16:46

Moved claimTimeout to lifecycle parameter of worker pool config.

2210f15

Apply suggestions from code review

7adb3c3

Co-authored-by: Matt Boris <92693437+matt-boris@users.noreply.github.com>

Set minimum queue inactivity timeout to 30 minutes.

6e4248a

Set minimum queueInactivityTimeout to 1 minute

dba510d

lotas force-pushed the feat/wm-zombie-instances branch from 4f87748 to dba510d Compare June 26, 2023 14:46

lotas merged commit 0f4ed17 into main Jun 26, 2023
66 checks passed

lotas deleted the feat/wm-zombie-instances branch June 26, 2023 15:51

lotas mentioned this pull request Apr 9, 2024

generic worker didn't handle a failure to register well #6963

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop instances that are not doing anything useful anymore #6333

Stop instances that are not doing anything useful anymore #6333

lotas commented Jun 16, 2023 •

edited

djmitche left a comment

djmitche Jun 16, 2023

lotas Jun 19, 2023

lotas commented Jun 19, 2023

lotas Jun 19, 2023

petemoore Jun 20, 2023

djmitche commented Jun 19, 2023

petemoore left a comment

petemoore Jun 20, 2023

matt-boris Jun 20, 2023

matt-boris Jun 20, 2023

lotas Jun 20, 2023

petemoore Jun 20, 2023

matt-boris left a comment

matt-boris Jun 20, 2023

lotas Jun 21, 2023

matt-boris Jun 21, 2023

lotas Jun 21, 2023 •

edited

lotas Jun 21, 2023

djmitche Jun 22, 2023

lotas Jun 26, 2023

matt-boris left a comment

matt-boris left a comment

		activityTimeout = activityTimeout \|\| 1000 * 60 * 60 * 2; // last active within 2 hours
		createdTimeout = createdTimeout \|\| 1000 * 60 * 60 * 1; // created at least 1 hour ago

Stop instances that are not doing anything useful anymore #6333

Stop instances that are not doing anything useful anymore #6333

Conversation

lotas commented Jun 16, 2023 • edited

djmitche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lotas commented Jun 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

djmitche commented Jun 19, 2023

petemoore left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matt-boris left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lotas Jun 21, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matt-boris left a comment

Choose a reason for hiding this comment

matt-boris left a comment

Choose a reason for hiding this comment

lotas commented Jun 16, 2023 •

edited

lotas Jun 21, 2023 •

edited