-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop instances that are not doing anything useful anymore #6333
Conversation
c4b9dc8
to
9f9bec2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned in the meeting that changing the existing lifetime
parameters to something a lot smaller than 96h would have a similar effect -- it will terminate workers that have not (re)registered in that timeframe.
This PR would terminate workers that were successfully re-registering, but were not claiming work. Typically worker-runner handles the worker registration, while of course the worker handles the claiming. The cases where I have seen a worker re-registering but not claiming are when docker-worker is confused into thinking that it has zero capacity (such as when it starts without enough disk space). So I think this is a valid situation for worker-manager to monitor for.
activityTimeout = activityTimeout || 1000 * 60 * 60 * 2; // last active within 2 hours | ||
createdTimeout = createdTimeout || 1000 * 60 * 60 * 1; // created at least 1 hour ago |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should probably be values specified in the lifecycle
property of the worker pool, rather than hard-coded. As mentioned in the meeting today, they can probably be much smaller for AWS and Google.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Dustin, yes, I moved them into the lifecycle
and left a single claimTimeout
parameter
9f9bec2
to
07f9942
Compare
Do you think we should lower that default here now as well? Do you remember why I think it still makes sense to have |
@@ -359,6 +360,10 @@ class AwsProvider extends Provider { | |||
await this.removeWorker({ worker, reason: 'terminateAfter time exceeded' }); | |||
} | |||
} | |||
const { isZombie, reason } = Provider.isZombie({ worker }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming is hard .. but I think zombie
is a good analogy, same as zombie-processes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is ok too, but we should probably document somewhere what zombie means.
The 96h was originally hard-coded in the workers. The We set the lifecycle defaults to match that so that enabling them didn't change behavior, with the idea that someone would followup and reduce those timeouts :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Yarik!
I like that you used the term "non-stopped-workers" to mean all workers not in stopped state, since other terms like "active" would be more vague. So +1 for that! I think claimTimeout though is a bit confusing, as when workers claim a task, they get a claimTimeout already, and it means something different (how long until if they haven't reclaimed the task will be resolved as exception/claim-expired). Other than that, some clarity over lastActiveDate is useful, although maybe that already exists elsewhere in documentation. Thanks!
|
||
Get non-stopped workers filtered by the optional arguments, | ||
ordered by `worker_pool_id`, `worker_group`, and `worker_id`. | ||
If the pagination arguments are both NULL, all rows are returned. | ||
Otherwise, page_size rows are returned at offset `page_offset`. | ||
The `quaratine_until` contains NULL or a date in the past if the | ||
worker is not quarantined, otherwise the date until which it is | ||
quaratined. `providers_filter_cond` and `providers_filter_value` used to | ||
quaratined. `first_claim` and `last_date_active` contains information | ||
known to the queue service about the worker. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is last date active defined? Last time any API request was made to the queue, or only a specific subset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like he's just using what I had added previously https://github.com/taskcluster/taskcluster/pull/5365/files#diff-3dcdf963e193e0ce3968b21e9a477bf3cb166b7480385482d217b5d9773f3116R60-R66
Updated each time a worker calls
queue.claimWork
,queue.reclaimTask
, andqueue.declareWorker
for this task queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which, I then used for implementing #5448.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is handled by queue, whenever claimWork
or reclaimTask
calls happens, queue marks workers this way, to know that they were doing something at this period of time
@@ -359,6 +360,10 @@ class AwsProvider extends Provider { | |||
await this.removeWorker({ worker, reason: 'terminateAfter time exceeded' }); | |||
} | |||
} | |||
const { isZombie, reason } = Provider.isZombie({ worker }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is ok too, but we should probably document somewhere what zombie means.
07f9942
to
83abe9c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, looking good! 🎉
working on a task. If worker process dies, or it stops calling `claimWork` | ||
or `reclaimTask` it should be considered dead and terminated. | ||
|
||
Minimum allowed value is 10 minutes, to give worker chance to start claiming tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This worries me a little bit. Did you get this 10 min from anywhere or just hoping that this'll be the case? I prefer to go a little more conservative here. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this is totally debatable. I just wanted something to prevent misconfiguration lead to workers being killed before they even got a chance to do something
I can imagine it takes time to start up and prepare. Also, this would include reboots. On windows it probably takes longer.
@matt-boris do you suggest we increase this number or decrease?
@petemoore from your past experience, do you know what should be average max time between tasks? GC, reboot, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matt-boris do you suggest we increase this number or decrease?
I would think 30 is a safer bet. This is quite a bit more conservative, which might be overkill, but it's hard to tell for sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been running some queries on the data we have, and noticed that reboots might take 5-15min on average, with some outliers with up to 60+ minutes
Maybe those pools are special, or maybe just the data is not very representative.
Data source is an aggregation already - total number of seconds spent on reboots and number of instances.. So I'm drawing my conclusions from reboot/instances to see what those numbers are
Now I'm also wondering if 2h is a good default 🤔
Checking median values, here seems to be only few pools that can go on 2h+ reboots
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, if I only query for the pools with 20+ instances, then .75%% gives all reboots within 2h, so I think it should be safe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a difference between a minimum value in the platform and the configured value. If for example we had a provisioner that ran QEMU VMs on demand, 10 minutes might be far too long. It's probably short for a default value, but not for a minimum value.
It's worth wondering if a host that takes 2+ hours to reboot is going to ever manage to run a build or test task in a reasonable amount of time. But, that decision should be up to the deployers (and thus not encoded in the schema) IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I'll put something smaller.
83abe9c
to
36cf222
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small, additional cleanup.
2c04cfe
to
dd7a9a3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks Yarik!! 🚀
dd7a9a3
to
4f87748
Compare
This can happen when docker/generic worker process dies and stops calling queue.claimWork/queue.reclaimTask Such workers would either be not known to the queue, or would have their last_date_active way in the past. In such cases killing instance is the only option to free up resources.
Co-authored-by: Matt Boris <92693437+matt-boris@users.noreply.github.com>
4f87748
to
dba510d
Compare
This can happen when docker/generic worker process dies and stops calling
queue.claimWork
/queue.reclaimTask
Such workers would either be not known to the queue, or would have their last_date_active way in the past.
In such cases killing instance is the only option to free up resources.
This PR introduces
queueInactivityTimeout
for the workerlifecycle
parameter (default being2h
).It works in following scenarios:
queue.claimWork
it would be terminated afterqueueInactivityTimeout
(firstClaim
is not set)queue.reclaimTask
, or died and stopped callingclaimWork
, it would be terminated after2 hours
(lastDateActive
is older thanqueueInactivityTimeout
)Quarantined workers are not affected, since they keep calling
claimWork
but are being ignored there.Fixes #6142