Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide worker counts and capacity by state for worker pools #4942

Closed
jwhitlock opened this issue Aug 4, 2021 · 8 comments
Closed

Provide worker counts and capacity by state for worker pools #4942

jwhitlock opened this issue Aug 4, 2021 · 8 comments

Comments

@jwhitlock
Copy link
Contributor

jwhitlock commented Aug 4, 2021

A worker manager pool provides "Current Capacity", which is the sum of the related worker capacities when the state is not stopped. This is calculated in the database function get_worker_pool_with_capacity, so it should be reasonably efficient. A worker can handle a single task at a time (capacity=1) or multiple tasks.

stopped is one of four possible worker states:

  • requested - A new worker is being provisioned, is not yet running
  • running - A worker is ready to run tasks, and may be running tasks
  • stopping - A worker is being shutdown and related resources released
  • stopped - A worker has been shutdown

Operations is often interested in counts of workers by state. @davehouse uses the Taskcluster APIs to fetch all related workers for statically-created pools, counts by state, and feeds the data to Graphana to visualize worker pools over time. M. Cornmesser is debugging issues with an Azure provisioner in bug 1723789, and a breakdown by worker state may help diagnose delays backlogs of pending tasks and other provisioning issues. Getting these statistics requires iterating over the workers by pool, which transfers a great deal of data over many requests to get a few aggregated numbers.

This suggests new data returned by DB function get_worker_pool_with_capacity and related functions, or their replacements:

  • Worker count in requested state
  • Worker count in running state
  • Worker count in stopping state

These related stats could be included for completeness:

  • Worker count in stopped state
  • Total worker count
  • Worker capacity in requested state
  • Worker capacity in running state
  • Worker capacity in stopping state
  • Worker capacity in stopped state

The new data could be present in the worker manager API calls:

  • workerPool - Get details of a worker pool
  • listWorkerPools - Paginated list of all worker pools

The data could be displayed in the UI as part of the worker pool detail view (such as gecko-t/win10-64-2004).

The data could potentially be columns in the worker manager pool list view. The worker capacity in the running state seems to be the most useful data for this view, and could be labelled "Running Capacity" or similar.

@davehouse
Copy link
Contributor

I think the most valuable metric is how many workers are actively running tasks (and not running "ready" but running a task currently). "ready" is interesting to see if workers are neglecting or failing to pull tasks from the queue, but not necessary. The "requested" is interesting to see if a pool is being scaled_up, or "stopping" for scaling_down.
The primary user for any metrics on pools is the sheriffs.

@Archaeopteryx What metrics does your team use or want for workers+pools?

@aerickson
Copy link
Contributor

The plan sounds good.

Will the workerManager API be able to provide data on static pools (we still have quite a few of those)?

@Archaeopteryx
Copy link

Sheriffs investigate delays in scheduled but not yet running tasks with the taskcluster worker pool list (hardware pool on different page) and the Grafana worker page.

  • worker pool not at max capacity? Checking error messages of provisioner.
  • worker pool at max capacity?
    • Are the workers running tasks right now?
      • Yes: check what they run and identify what causes the backlog (more tasks scheduled than usual, longer run time for tasks, ....)
      • No: check how long they have been inactive, if there is a pattern in the last tasks executed on them, escalate to RelOps/RelEng.

Having the counts as additional columns in the taskcluster worker pool list will provide a faster picture of the situation.

@matt-boris
Copy link
Contributor

@Archaeopteryx which of the new counts/capacity metrics would you like to see displayed in that worker pool list as additional columns?

The initial issue mentions adding running capacity as another column. Would love your thoughts here as well, knowing we'll have worker counts in {requested, running, stopping, stopped} states as well as worker capacities in those same states.

I want to also be sure that list view doesn't get cluttered with too many columns. I will definitely put all the data in the worker pool detail view as additional chips at the top of the page.

@davehouse
Copy link
Contributor

Can we add a state for actively running a task? Maybe I'm misunderstanding--My impression is that the "running" state in the PR is that the worker is "running" but it is unknown if it is idle or active. That active/idle state (and count) is something I think we're often looking for. We currently count this by iterating over workers in a pool and checking the last task state.

@Archaeopteryx
Copy link

Code sheriffs looks how many workers are in the pool, then checks if they are active (= run a task). Having a column which displays this count is the most interesting feature for us because the debugging of workers stuck in other states is done by other teams (which are interested in those other states, @davehouse already provided very good insight).

The 2 new columns we are interested in:

  • running (= worker in running state, currently only the count of all workers in the pool independent of state if shown as far as I tracked the discussion)
  • active (= task is being executed on worker)

@djmitche
Copy link
Collaborator

djmitche commented Apr 6, 2022

There were plans to create a kind of "unified" worker view that contains both data from the queue (last time it tried to (re)claim a task, number of current tasks, last task) and data from the worker manager (worker state, last time it renewed its credentials). Together, these can provide quite a bit of information about a specific worker (for example, if a worker is renewing with worker-manager but not claiming tasks and not running any tasks, that leads to a particular set of possibilities)

However, this issue is about worker pools, which is a bit different because it does not correlate queue- and worker-manager-related data about a single worker. I think the number @davehouse is requesting is actually a purely queue-related question: how many tasks are running in this particular task queue?

@davehouse
Copy link
Contributor

+1 there is an overlap, but "how many workers are active?" and "how many tasks are currently being executed?" aren't always the same.
So these could be separated:
counting active tasks for a queue
count of workers that are active

we(relops) also monitor which workers are active, but that may not be needed for sheriffs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

6 participants