-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC 160] Process monitoring #160
Conversation
|
||
Querying system-wide cpu and memory information isn't as useful, given that an instance can have multiple tasks running. So where possible, recording per-process stats is the goal. `psutil` and equivalents often have helper functions to allow quick discovery of all the process's children, and most platforms allow collection of per-process cpu times and memory usage. | ||
|
||
Disk and Network IO are valuable metrics to collect, with the caveats that only Linux has good support for these. Even querying per-process on Linux the data is system-wide, so we should be careful how that is presented, and not sum the data points along with the cpu times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can get some per-process io metrics on linux. See for example pidstat -d
. I think this is sourced from /prod/<pid>/io
.
I suspect there's similar mechanisms for OS X and Windows, since I can see Activity Monitor and Task Manager reporting per-process io.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, my initial testing showed that network wasn't per-process and I mixed that up in this RFC, I can correct that. psutil equivalents don't seem able to report on network or disk io for mac, but we can write other components.
2. Every 1 second, a sample will be taken: | ||
a. Determine the number of processes being run by a task's payload | ||
b. Sum CPU and memory usage for each of those processes | ||
c. Record system-wide available memory at this timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that any system-wide data you get will contain data from other tasks if there are multiple tasks running on a given worker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, that's why I want to collect cpu and memory usage for the process tree, rather than the system-wide stats the current mozharness Mixin collects, and only collect system-wide data if that's the only option. Available memory is somewhat different as it lest us estimate memory pressure at the time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OTOH, it might be useful to also have system-wide data for CPU and memory usage so as to know whether the pressure is because there are multiple tasks using too much resources at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just double checked the timing information we're already collecting, and it doesn't store the instance ID, so we could either record just that and query for overlaps later on, or record extra samples at this stage. I'd lean towards the former, but could be convinced otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would srfraser from 6 months from now remember to correlate with other tasks on the same instance? How about other people?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was imagining it would be part of the report, rather than a manual process each time. For our own use we'd want a pipeline for this that could be left alone and alert us when something looks like it could be acted on.
In any case, we're recording the total system memory, if we have available and "Memory I am using" then it's clear what's in use by other things, so it's a balance between collecting and storing extra data and how easy it is to work with. CPU times obviously don't work the same way, so it's a question of whether comparing system-wide cpu times to process-tree ones would show us anything actionable.
We may want to have a worker setting to disable this on hardware performance test pools. |
It could be interesting to have a way for the task itself to give some form of markers. We have that for build resource usage, and allows us to know what part of the build is being processed. |
I couldn't think of a straightforward way to add this without some awkward IPC going on, although I'm open to suggestions. As far as I'm aware the payload itself is always in a state where it could be run outside a task and so its external communication is fairly generic: artifacts, logging, any necessary downloads. We do already process the live log and produce timing information for each phase, so it's possible to link these up afterwards. For example for AUmp7fbpRmWB_Rniw0s5eA we have component and subcomponents discovered: component,subComponent |
It's certainly an option. The existing mixin-based resource monitoring is already running for these tasks as it doesn't differentiate, but that doesn't mean it's always a good thing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this proposal! A few big-picture thoughts to add:
-
I agree that trying to include specific moments in the task execution in this event stream is difficult and this probably isn't the right place to do it. The successful approach for this kind of measurement seems to be to record the data as simply as possible and then do all of the correlations and analysis after the fact. So that correlation could be done by something after the fact that reads both the task log and the monitoring time-series and extrapolates phases from the task log.
-
@imbstack raised the concern with baking this functionality into workers that we'll end up doing a lot of iteration on workers to add features that different users need. For example, a game company might want to gather stats on GPU usage or metrics from a device emulator. So, I think that building a nice interface between the thing gathering the metrics and the worker is a good idea. A process boundary seems like a good choice, too. Then we can create an executable that does the commonly-useful stuff you've highlighted here, with the possibility of users substituting other binaries that meet the same interface.
-
Then the path to the executable can be in the worker config, and any additional configuration for it (how often to sample, etc.) can be in the task payload
-
With that arrangement, this RFC need not cover the details of what is gathered (which will probably take some time to agree on) but can focus on the interface with the data-gathering process and how its results are made available to users (which is probably easier to agree on).
-
Outline: | ||
|
||
1. Add code to the existing generic-worker and docker-worker | ||
2. Every 1 second, a sample will be taken: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firefox-CI runs about 2.5 million hours of compute per month, and if you figure 0.5k per sample, that's 4GB per week (approximate figures obviously). So the data storage for this is surprisingly small. Even so, I agree that this should be something that can be configured, probably both on/off and sampling frequency. Maybe it could be configured both at the worker-config level and at the task level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure about that math? Unless I'm counting wrong, 2.5 million hours is 9 billion seconds. At 0.5 kilobytes of storage per second, that's 4.5 terabytes per month.
At these data volumes it might be preferable to summarize the samples client-side and store that in bigquery rather than the raw data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, I double-checked that and still got it wrong. You are correct, it is 4 billion kilobytes :)
|
||
Cons: | ||
|
||
1. Each worker will depend on a separate binary being deployed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already the case -- we package things like tc-proxy and livelog for use from both docker-worker and generic-worker. So perhaps this con isn't as bad as it seems. However, communication and management of those processes has been a source of a lot of bugs and friction.
Also, monitoring docker-worker tasks is going to be a very different exercise than monitoring generic-worker tasks. So it might make sense to use different utilities in those two cases, even if they are both in Go.
``` | ||
Schema({ | ||
Required('start'): int, # Epoch time of earliest sample | ||
Required('end'): int, # Epoch time of last sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A top-level version
flag would be good, too.
From my understanding so far:
What have I missed? |
I really doubt we need to store 1s resolution resource data. Probably every 15s or 30s is sufficient for most of our use cases. |
Simon, Mihai and I met today, and on reflection we think there may be more flexibility afforded and fewer barriers to delivery if indeed the metrics are collected as part of the run-task machinery. Simon had the great idea of writing the tool in go and shipping as a standalone statically linked executable per platform. This could be mounted in place by run-task like it mounts toolchains, and the go source code could live in tree. If that source code changes, it would automatically be rebuilt and run-task would automatically get the latest built version of the tool. This offers some advantages:
My feeling is that the advantages of making it part of the platform would be if there was custom behaviour was highly tied to the cloud environment that the task runs in (e.g. different behaviour required for gcp/aws/azure workers etc), or if the metrics collection needed to run with higher privileges than the task, or if the tool would be difficult to share between projects but was generic enough that we'd want to have it running everywhere, without a lot of customisation, indefinitely, with no anticipated change to the metrics collection over time. I don't think any of these conditions apply. So my vote is that we do this in-task, set up the go tool as a toolchain that gets built by the existing toolchain building mechanics we have in firefox, put all the code in mozilla-central, and make it transparent and self-serve for devs. |
There's a chicken and egg problem with that approach though: toolchain tasks use run-task. I guess we could just say that we don't care about the stats in that case. The other problem is that docker-worker doesn't support mounts, so any change to the go program would need to be baked into docker images, alongside run-task... which is another chicken and egg problem because the docker image that is used to build toolchains itself will need to have it baked in. Well, I guess we could build the go program with an external docker image... |
Hm, a circular dependency on the monitoring, but yes, as you suggest it may not have a huge resource impact.
Using |
But then you don't do resource monitoring for fetches and everything that precedes it. (I think we clone before fetches) |
closing as stale, but still here for reference / revival |
No description provided.