New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide worker counts and capacity by state for worker pools #4942
Comments
I think the most valuable metric is how many workers are actively running tasks (and not running "ready" but running a task currently). "ready" is interesting to see if workers are neglecting or failing to pull tasks from the queue, but not necessary. The "requested" is interesting to see if a pool is being scaled_up, or "stopping" for scaling_down. @Archaeopteryx What metrics does your team use or want for workers+pools? |
The plan sounds good. Will the workerManager API be able to provide data on static pools (we still have quite a few of those)? |
Sheriffs investigate delays in scheduled but not yet running tasks with the taskcluster worker pool list (hardware pool on different page) and the Grafana worker page.
Having the counts as additional columns in the taskcluster worker pool list will provide a faster picture of the situation. |
@Archaeopteryx which of the new counts/capacity metrics would you like to see displayed in that worker pool list as additional columns? The initial issue mentions adding running capacity as another column. Would love your thoughts here as well, knowing we'll have worker counts in {requested, running, stopping, stopped} states as well as worker capacities in those same states. I want to also be sure that list view doesn't get cluttered with too many columns. I will definitely put all the data in the worker pool detail view as additional chips at the top of the page. |
Can we add a state for actively running a task? Maybe I'm misunderstanding--My impression is that the "running" state in the PR is that the worker is "running" but it is unknown if it is idle or active. That active/idle state (and count) is something I think we're often looking for. We currently count this by iterating over workers in a pool and checking the last task state. |
Code sheriffs looks how many workers are in the pool, then checks if they are active (= run a task). Having a column which displays this count is the most interesting feature for us because the debugging of workers stuck in other states is done by other teams (which are interested in those other states, @davehouse already provided very good insight). The 2 new columns we are interested in:
|
There were plans to create a kind of "unified" worker view that contains both data from the queue (last time it tried to (re)claim a task, number of current tasks, last task) and data from the worker manager (worker state, last time it renewed its credentials). Together, these can provide quite a bit of information about a specific worker (for example, if a worker is renewing with worker-manager but not claiming tasks and not running any tasks, that leads to a particular set of possibilities) However, this issue is about worker pools, which is a bit different because it does not correlate queue- and worker-manager-related data about a single worker. I think the number @davehouse is requesting is actually a purely queue-related question: how many tasks are running in this particular task queue? |
+1 there is an overlap, but "how many workers are active?" and "how many tasks are currently being executed?" aren't always the same. we(relops) also monitor which workers are active, but that may not be needed for sheriffs |
A worker manager pool provides "Current Capacity", which is the sum of the related worker capacities when the state is not
stopped
. This is calculated in the database function get_worker_pool_with_capacity, so it should be reasonably efficient. A worker can handle a single task at a time (capacity=1
) or multiple tasks.stopped
is one of four possible worker states:requested
- A new worker is being provisioned, is not yet runningrunning
- A worker is ready to run tasks, and may be running tasksstopping
- A worker is being shutdown and related resources releasedstopped
- A worker has been shutdownOperations is often interested in counts of workers by state. @davehouse uses the Taskcluster APIs to fetch all related workers for statically-created pools, counts by state, and feeds the data to Graphana to visualize worker pools over time. M. Cornmesser is debugging issues with an Azure provisioner in bug 1723789, and a breakdown by worker state may help diagnose delays backlogs of pending tasks and other provisioning issues. Getting these statistics requires iterating over the workers by pool, which transfers a great deal of data over many requests to get a few aggregated numbers.
This suggests new data returned by DB function
get_worker_pool_with_capacity
and related functions, or their replacements:requested
staterunning
statestopping
stateThese related stats could be included for completeness:
stopped
staterequested
staterunning
statestopping
statestopped
stateThe new data could be present in the worker manager API calls:
workerPool
- Get details of a worker poollistWorkerPools
- Paginated list of all worker poolsThe data could be displayed in the UI as part of the worker pool detail view (such as gecko-t/win10-64-2004).
The data could potentially be columns in the worker manager pool list view. The worker capacity in the
running
state seems to be the most useful data for this view, and could be labelled "Running Capacity" or similar.The text was updated successfully, but these errors were encountered: