Provide worker counts and capacity by state for worker pools #4942

jwhitlock · 2021-08-04T15:46:04Z

A worker manager pool provides "Current Capacity", which is the sum of the related worker capacities when the state is not stopped. This is calculated in the database function get_worker_pool_with_capacity, so it should be reasonably efficient. A worker can handle a single task at a time (capacity=1) or multiple tasks.

stopped is one of four possible worker states:

requested - A new worker is being provisioned, is not yet running
running - A worker is ready to run tasks, and may be running tasks
stopping - A worker is being shutdown and related resources released
stopped - A worker has been shutdown

Operations is often interested in counts of workers by state. @davehouse uses the Taskcluster APIs to fetch all related workers for statically-created pools, counts by state, and feeds the data to Graphana to visualize worker pools over time. M. Cornmesser is debugging issues with an Azure provisioner in bug 1723789, and a breakdown by worker state may help diagnose delays backlogs of pending tasks and other provisioning issues. Getting these statistics requires iterating over the workers by pool, which transfers a great deal of data over many requests to get a few aggregated numbers.

This suggests new data returned by DB function get_worker_pool_with_capacity and related functions, or their replacements:

Worker count in requested state
Worker count in running state
Worker count in stopping state

These related stats could be included for completeness:

Worker count in stopped state
Total worker count
Worker capacity in requested state
Worker capacity in running state
Worker capacity in stopping state
Worker capacity in stopped state

The new data could be present in the worker manager API calls:

workerPool - Get details of a worker pool
listWorkerPools - Paginated list of all worker pools

The data could be displayed in the UI as part of the worker pool detail view (such as gecko-t/win10-64-2004).

The data could potentially be columns in the worker manager pool list view. The worker capacity in the running state seems to be the most useful data for this view, and could be labelled "Running Capacity" or similar.

The text was updated successfully, but these errors were encountered:

davehouse · 2022-04-04T20:27:49Z

I think the most valuable metric is how many workers are actively running tasks (and not running "ready" but running a task currently). "ready" is interesting to see if workers are neglecting or failing to pull tasks from the queue, but not necessary. The "requested" is interesting to see if a pool is being scaled_up, or "stopping" for scaling_down.
The primary user for any metrics on pools is the sheriffs.

@Archaeopteryx What metrics does your team use or want for workers+pools?

aerickson · 2022-04-05T20:00:19Z

The plan sounds good.

Will the workerManager API be able to provide data on static pools (we still have quite a few of those)?

Archaeopteryx · 2022-04-06T09:14:53Z

Sheriffs investigate delays in scheduled but not yet running tasks with the taskcluster worker pool list (hardware pool on different page) and the Grafana worker page.

worker pool not at max capacity? Checking error messages of provisioner.
worker pool at max capacity?
- Are the workers running tasks right now?
  - Yes: check what they run and identify what causes the backlog (more tasks scheduled than usual, longer run time for tasks, ....)
  - No: check how long they have been inactive, if there is a pattern in the last tasks executed on them, escalate to RelOps/RelEng.

Having the counts as additional columns in the taskcluster worker pool list will provide a faster picture of the situation.

matt-boris · 2022-04-06T18:21:58Z

@Archaeopteryx which of the new counts/capacity metrics would you like to see displayed in that worker pool list as additional columns?

The initial issue mentions adding running capacity as another column. Would love your thoughts here as well, knowing we'll have worker counts in {requested, running, stopping, stopped} states as well as worker capacities in those same states.

I want to also be sure that list view doesn't get cluttered with too many columns. I will definitely put all the data in the worker pool detail view as additional chips at the top of the page.

davehouse · 2022-04-06T18:36:43Z

Can we add a state for actively running a task? Maybe I'm misunderstanding--My impression is that the "running" state in the PR is that the worker is "running" but it is unknown if it is idle or active. That active/idle state (and count) is something I think we're often looking for. We currently count this by iterating over workers in a pool and checking the last task state.

Archaeopteryx · 2022-04-06T19:14:15Z

Code sheriffs looks how many workers are in the pool, then checks if they are active (= run a task). Having a column which displays this count is the most interesting feature for us because the debugging of workers stuck in other states is done by other teams (which are interested in those other states, @davehouse already provided very good insight).

The 2 new columns we are interested in:

running (= worker in running state, currently only the count of all workers in the pool independent of state if shown as far as I tracked the discussion)
active (= task is being executed on worker)

djmitche · 2022-04-06T22:49:58Z

There were plans to create a kind of "unified" worker view that contains both data from the queue (last time it tried to (re)claim a task, number of current tasks, last task) and data from the worker manager (worker state, last time it renewed its credentials). Together, these can provide quite a bit of information about a specific worker (for example, if a worker is renewing with worker-manager but not claiming tasks and not running any tasks, that leads to a particular set of possibilities)

However, this issue is about worker pools, which is a bit different because it does not correlate queue- and worker-manager-related data about a single worker. I think the number @davehouse is requesting is actually a purely queue-related question: how many tasks are running in this particular task queue?

davehouse · 2022-04-06T23:19:06Z

+1 there is an overlap, but "how many workers are active?" and "how many tasks are currently being executed?" aren't always the same.
So these could be separated:
counting active tasks for a queue
count of workers that are active

we(relops) also monitor which workers are active, but that may not be needed for sheriffs

jwhitlock added the worker management label Aug 4, 2021

lotas added this to Backlog in Worker Manager Azure improvements Mar 30, 2022

lotas removed this from Backlog in Worker Manager Azure improvements Mar 30, 2022

lotas added this to To do in Worker Manager monitoring via automation Mar 30, 2022

matt-boris self-assigned this Apr 1, 2022

matt-boris moved this from To do to In progress in Worker Manager monitoring Apr 1, 2022

matt-boris mentioned this issue Apr 4, 2022

feat(db): provide worker counts and capacities by state for worker pools #5340

Merged

matt-boris mentioned this issue Apr 12, 2022

feat(ui): add worker capacity by state to ui #5349

Merged

matt-boris moved this from In progress to In review in Worker Manager monitoring Apr 12, 2022

matt-boris moved this from In review to Done in Worker Manager monitoring Apr 13, 2022

matt-boris closed this as completed Apr 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide worker counts and capacity by state for worker pools #4942

Provide worker counts and capacity by state for worker pools #4942

jwhitlock commented Aug 4, 2021 •

edited by matt-boris

davehouse commented Apr 4, 2022

aerickson commented Apr 5, 2022

Archaeopteryx commented Apr 6, 2022

matt-boris commented Apr 6, 2022

davehouse commented Apr 6, 2022

Archaeopteryx commented Apr 6, 2022

djmitche commented Apr 6, 2022

davehouse commented Apr 6, 2022

Provide worker counts and capacity by state for worker pools #4942

Provide worker counts and capacity by state for worker pools #4942

Comments

jwhitlock commented Aug 4, 2021 • edited by matt-boris

davehouse commented Apr 4, 2022

aerickson commented Apr 5, 2022

Archaeopteryx commented Apr 6, 2022

matt-boris commented Apr 6, 2022

davehouse commented Apr 6, 2022

Archaeopteryx commented Apr 6, 2022

djmitche commented Apr 6, 2022

davehouse commented Apr 6, 2022

jwhitlock commented Aug 4, 2021 •

edited by matt-boris