[RFC 160] Process monitoring #160

srfraser · 2020-06-09T12:31:48Z

No description provided.

sciurus · 2020-06-09T14:14:09Z

rfcs/0155-worker-resource-monitoring.md

+
+Querying system-wide cpu and memory information isn't as useful, given that an instance can have multiple tasks running. So where possible, recording per-process stats is the goal. `psutil` and equivalents often have helper functions to allow quick discovery of all the process's children, and most platforms allow collection of per-process cpu times and memory usage.
+
+Disk and Network IO are valuable metrics to collect, with the caveats that only Linux has good support for these. Even querying per-process on Linux the data is system-wide, so we should be careful how that is presented, and not sum the data points along with the cpu times.


You can get some per-process io metrics on linux. See for example pidstat -d. I think this is sourced from /prod/<pid>/io.

I suspect there's similar mechanisms for OS X and Windows, since I can see Activity Monitor and Task Manager reporting per-process io.

Apologies, my initial testing showed that network wasn't per-process and I mixed that up in this RFC, I can correct that. psutil equivalents don't seem able to report on network or disk io for mac, but we can write other components.

glandium · 2020-06-09T20:34:06Z

rfcs/0155-worker-resource-monitoring.md

+2. Every 1 second, a sample will be taken:
+   a. Determine the number of processes being run by a task's payload
+   b. Sum CPU and memory usage for each of those processes
+   c. Record system-wide available memory at this timestamp


Note that any system-wide data you get will contain data from other tasks if there are multiple tasks running on a given worker.

Indeed, that's why I want to collect cpu and memory usage for the process tree, rather than the system-wide stats the current mozharness Mixin collects, and only collect system-wide data if that's the only option. Available memory is somewhat different as it lest us estimate memory pressure at the time

OTOH, it might be useful to also have system-wide data for CPU and memory usage so as to know whether the pressure is because there are multiple tasks using too much resources at the same time.

I've just double checked the timing information we're already collecting, and it doesn't store the instance ID, so we could either record just that and query for overlaps later on, or record extra samples at this stage. I'd lean towards the former, but could be convinced otherwise

Would srfraser from 6 months from now remember to correlate with other tasks on the same instance? How about other people?

I was imagining it would be part of the report, rather than a manual process each time. For our own use we'd want a pipeline for this that could be left alone and alert us when something looks like it could be acted on.

In any case, we're recording the total system memory, if we have available and "Memory I am using" then it's clear what's in use by other things, so it's a balance between collecting and storing extra data and how easy it is to work with. CPU times obviously don't work the same way, so it's a question of whether comparing system-wide cpu times to process-tree ones would show us anything actionable.

escapewindow · 2020-06-09T21:17:34Z

We may want to have a worker setting to disable this on hardware performance test pools.

glandium · 2020-06-09T21:41:51Z

It could be interesting to have a way for the task itself to give some form of markers. We have that for build resource usage, and allows us to know what part of the build is being processed.

srfraser · 2020-06-10T08:44:22Z

It could be interesting to have a way for the task itself to give some form of markers. We have that for build resource usage, and allows us to know what part of the build is being processed.

I couldn't think of a straightforward way to add this without some awkward IPC going on, although I'm open to suggestions. As far as I'm aware the payload itself is always in a state where it could be run outside a task and so its external communication is fairly generic: artifacts, logging, any necessary downloads.

We do already process the live log and produce timing information for each phase, so it's possible to link these up afterwards. For example for AUmp7fbpRmWB_Rniw0s5eA we have component and subcomponents discovered:

component,subComponent
task,total
taskcluster,setup
taskcluster,task
vcs,update
mozharness,get-secrets,step
mozharness,build,step
build_metrics,configure
build_metrics,pre-export
build_metrics,export
build_metrics,compile
build_metrics,misc
build_metrics,libs
build_metrics,tools
build_metrics,package-generated-sources
build_metrics,package
build_metrics,upload
taskcluster,teardown

srfraser · 2020-06-10T08:49:08Z

We may want to have a worker setting to disable this on hardware performance test pools.

It's certainly an option. The existing mixin-based resource monitoring is already running for these tasks as it doesn't differentiate, but that doesn't mean it's always a good thing

djmitche

I like this proposal! A few big-picture thoughts to add:

I agree that trying to include specific moments in the task execution in this event stream is difficult and this probably isn't the right place to do it. The successful approach for this kind of measurement seems to be to record the data as simply as possible and then do all of the correlations and analysis after the fact. So that correlation could be done by something after the fact that reads both the task log and the monitoring time-series and extrapolates phases from the task log.
@imbstack raised the concern with baking this functionality into workers that we'll end up doing a lot of iteration on workers to add features that different users need. For example, a game company might want to gather stats on GPU usage or metrics from a device emulator. So, I think that building a nice interface between the thing gathering the metrics and the worker is a good idea. A process boundary seems like a good choice, too. Then we can create an executable that does the commonly-useful stuff you've highlighted here, with the possibility of users substituting other binaries that meet the same interface.
- Then the path to the executable can be in the worker config, and any additional configuration for it (how often to sample, etc.) can be in the task payload
- With that arrangement, this RFC need not cover the details of what is gathered (which will probably take some time to agree on) but can focus on the interface with the data-gathering process and how its results are made available to users (which is probably easier to agree on).

djmitche · 2020-06-12T15:47:57Z

rfcs/0155-worker-resource-monitoring.md

+Outline:
+
+1. Add code to the existing generic-worker and docker-worker
+2. Every 1 second, a sample will be taken:


Firefox-CI runs about 2.5 million hours of compute per month, and if you figure 0.5k per sample, that's 4GB per week (approximate figures obviously). So the data storage for this is surprisingly small. Even so, I agree that this should be something that can be configured, probably both on/off and sampling frequency. Maybe it could be configured both at the worker-config level and at the task level?

Are you sure about that math? Unless I'm counting wrong, 2.5 million hours is 9 billion seconds. At 0.5 kilobytes of storage per second, that's 4.5 terabytes per month.

At these data volumes it might be preferable to summarize the samples client-side and store that in bigquery rather than the raw data.

Haha, I double-checked that and still got it wrong. You are correct, it is 4 billion kilobytes :)

djmitche · 2020-06-12T15:55:29Z

rfcs/0155-worker-resource-monitoring.md

+
+Cons:
+
+1. Each worker will depend on a separate binary being deployed.


This is already the case -- we package things like tc-proxy and livelog for use from both docker-worker and generic-worker. So perhaps this con isn't as bad as it seems. However, communication and management of those processes has been a source of a lot of bugs and friction.

Also, monitoring docker-worker tasks is going to be a very different exercise than monitoring generic-worker tasks. So it might make sense to use different utilities in those two cases, even if they are both in Go.

djmitche · 2020-06-12T15:56:05Z

rfcs/0155-worker-resource-monitoring.md

+```
+Schema({
+    Required('start'): int, # Epoch time of earliest sample
+    Required('end'): int,   # Epoch time of last sample


A top-level version flag would be good, too.

srfraser · 2020-06-17T09:48:53Z

From my understanding so far:

We're leaning towards a separate tool that the worker understands how to interact with, partly for cross-worker issues and partly for ease of third-party taskcluster deployments. This also answers 'how to turn it off on some worker types' since we just don't deploy the binary.
Some output format changes to do with version and field types
Some questions about what would be monitored differently in docker-worker need more discussion
Data retention and storage costs need to be addressed - if the artifact is compressed this is mitigated

What have I missed?

catlee · 2020-06-17T13:24:41Z

I really doubt we need to store 1s resolution resource data. Probably every 15s or 30s is sufficient for most of our use cases.

petemoore · 2020-06-18T15:51:06Z

Simon, Mihai and I met today, and on reflection we think there may be more flexibility afforded and fewer barriers to delivery if indeed the metrics are collected as part of the run-task machinery.

Simon had the great idea of writing the tool in go and shipping as a standalone statically linked executable per platform. This could be mounted in place by run-task like it mounts toolchains, and the go source code could live in tree. If that source code changes, it would automatically be rebuilt and run-task would automatically get the latest built version of the tool.

This offers some advantages:

changes to the profiling tool can be made with a try push
there is full transparency of the tool's existence and function in the task
there are no barriers to delivery - no deployment operations required, i.e. no person needs to wait on the actions of another person, to roll this out (except code reviews, of course, outside of try)
the tool can be easily enabled/disabled per platform/project/branch/task type/worker type etc from in-tree configs
the (go) tool can be shared by other projects

My feeling is that the advantages of making it part of the platform would be if there was custom behaviour was highly tied to the cloud environment that the task runs in (e.g. different behaviour required for gcp/aws/azure workers etc), or if the metrics collection needed to run with higher privileges than the task, or if the tool would be difficult to share between projects but was generic enough that we'd want to have it running everywhere, without a lot of customisation, indefinitely, with no anticipated change to the metrics collection over time. I don't think any of these conditions apply.

So my vote is that we do this in-task, set up the go tool as a toolchain that gets built by the existing toolchain building mechanics we have in firefox, put all the code in mozilla-central, and make it transparent and self-serve for devs.

glandium · 2020-06-18T20:46:25Z

There's a chicken and egg problem with that approach though: toolchain tasks use run-task. I guess we could just say that we don't care about the stats in that case. The other problem is that docker-worker doesn't support mounts, so any change to the go program would need to be baked into docker images, alongside run-task... which is another chicken and egg problem because the docker image that is used to build toolchains itself will need to have it baked in. Well, I guess we could build the go program with an external docker image...
(if docker-worker could support mounts, that would be even better)

srfraser · 2020-06-24T09:13:43Z

There's a chicken and egg problem with that approach though: toolchain tasks use run-task. I guess we could just say that we don't care about the stats in that case.

Hm, a circular dependency on the monitoring, but yes, as you suggest it may not have a huge resource impact.

The other problem is that docker-worker doesn't support mounts, so any change to the go program would need to be baked into docker images, alongside run-task... which is another chicken and egg problem because the docker image that is used to build toolchains itself will need to have it baked in. Well, I guess we could build the go program with an external docker image...
(if docker-worker could support mounts, that would be even better)

Using fetches ensures run-task just downloads the required artifacts using HTTPS, so that could be a reasonable alternative to mounts.

glandium · 2020-06-24T20:31:45Z

But then you don't do resource monitoring for fetches and everything that precedes it. (I think we clone before fetches)

djmitche · 2021-04-16T13:32:10Z

closing as stale, but still here for reference / revival

Add process monitoring RFC

9c5e03d

srfraser changed the title ~~Add process monitoring RFC~~ Process monitoring RFC Jun 9, 2020

Update RFC number

9abbb21

srfraser changed the title ~~Process monitoring RFC~~ [RFC 160] Process monitoring Jun 9, 2020

Simon Fraser added 2 commits June 9, 2020 14:53

Update TOC

5441ee6

Update CPU data type, as times are floats

3283307

sciurus reviewed Jun 9, 2020

View reviewed changes

djmitche requested review from djmitche and petemoore June 9, 2020 17:19

glandium reviewed Jun 9, 2020

View reviewed changes

djmitche reviewed Jun 12, 2020

View reviewed changes

djmitche mentioned this pull request Jun 15, 2020

[resource usage] report worker memory/cpu usage taskcluster/taskcluster#3062

Open

petemoore removed their request for review July 7, 2020 10:23

ccooper changed the base branch from master to main July 28, 2020 22:08

djmitche closed this Apr 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC 160] Process monitoring #160

[RFC 160] Process monitoring #160

srfraser commented Jun 9, 2020

sciurus Jun 9, 2020

srfraser Jun 9, 2020

glandium Jun 9, 2020

srfraser Jun 9, 2020 •

edited

Loading

glandium Jun 9, 2020

srfraser Jun 10, 2020

glandium Jun 10, 2020

srfraser Jun 10, 2020

escapewindow commented Jun 9, 2020

glandium commented Jun 9, 2020

srfraser commented Jun 10, 2020

srfraser commented Jun 10, 2020

djmitche left a comment

djmitche Jun 12, 2020

sciurus Jun 12, 2020 •

edited

Loading

djmitche Jun 12, 2020

djmitche Jun 12, 2020

djmitche Jun 12, 2020

srfraser commented Jun 17, 2020

catlee commented Jun 17, 2020

petemoore commented Jun 18, 2020

glandium commented Jun 18, 2020

srfraser commented Jun 24, 2020

glandium commented Jun 24, 2020

djmitche commented Apr 16, 2021


		Querying system-wide cpu and memory information isn't as useful, given that an instance can have multiple tasks running. So where possible, recording per-process stats is the goal. `psutil` and equivalents often have helper functions to allow quick discovery of all the process's children, and most platforms allow collection of per-process cpu times and memory usage.

		Disk and Network IO are valuable metrics to collect, with the caveats that only Linux has good support for these. Even querying per-process on Linux the data is system-wide, so we should be careful how that is presented, and not sum the data points along with the cpu times.


		Cons:

		1. Each worker will depend on a separate binary being deployed.

[RFC 160] Process monitoring #160

[RFC 160] Process monitoring #160

Conversation

srfraser commented Jun 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srfraser Jun 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

escapewindow commented Jun 9, 2020

glandium commented Jun 9, 2020

srfraser commented Jun 10, 2020

srfraser commented Jun 10, 2020

djmitche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sciurus Jun 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srfraser commented Jun 17, 2020

catlee commented Jun 17, 2020

petemoore commented Jun 18, 2020

glandium commented Jun 18, 2020

srfraser commented Jun 24, 2020

glandium commented Jun 24, 2020

djmitche commented Apr 16, 2021

srfraser Jun 9, 2020 •

edited

Loading

sciurus Jun 12, 2020 •

edited

Loading