Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC 160] Process monitoring #160

Closed
wants to merge 4 commits into from

Conversation

srfraser
Copy link

@srfraser srfraser commented Jun 9, 2020

No description provided.

@srfraser srfraser changed the title Add process monitoring RFC Process monitoring RFC Jun 9, 2020
@srfraser srfraser changed the title Process monitoring RFC [RFC 160] Process monitoring Jun 9, 2020

Querying system-wide cpu and memory information isn't as useful, given that an instance can have multiple tasks running. So where possible, recording per-process stats is the goal. `psutil` and equivalents often have helper functions to allow quick discovery of all the process's children, and most platforms allow collection of per-process cpu times and memory usage.

Disk and Network IO are valuable metrics to collect, with the caveats that only Linux has good support for these. Even querying per-process on Linux the data is system-wide, so we should be careful how that is presented, and not sum the data points along with the cpu times.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get some per-process io metrics on linux. See for example pidstat -d. I think this is sourced from /prod/<pid>/io.

I suspect there's similar mechanisms for OS X and Windows, since I can see Activity Monitor and Task Manager reporting per-process io.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, my initial testing showed that network wasn't per-process and I mixed that up in this RFC, I can correct that. psutil equivalents don't seem able to report on network or disk io for mac, but we can write other components.

2. Every 1 second, a sample will be taken:
a. Determine the number of processes being run by a task's payload
b. Sum CPU and memory usage for each of those processes
c. Record system-wide available memory at this timestamp
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that any system-wide data you get will contain data from other tasks if there are multiple tasks running on a given worker.

Copy link
Author

@srfraser srfraser Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, that's why I want to collect cpu and memory usage for the process tree, rather than the system-wide stats the current mozharness Mixin collects, and only collect system-wide data if that's the only option. Available memory is somewhat different as it lest us estimate memory pressure at the time

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTOH, it might be useful to also have system-wide data for CPU and memory usage so as to know whether the pressure is because there are multiple tasks using too much resources at the same time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just double checked the timing information we're already collecting, and it doesn't store the instance ID, so we could either record just that and query for overlaps later on, or record extra samples at this stage. I'd lean towards the former, but could be convinced otherwise

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would srfraser from 6 months from now remember to correlate with other tasks on the same instance? How about other people?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was imagining it would be part of the report, rather than a manual process each time. For our own use we'd want a pipeline for this that could be left alone and alert us when something looks like it could be acted on.

In any case, we're recording the total system memory, if we have available and "Memory I am using" then it's clear what's in use by other things, so it's a balance between collecting and storing extra data and how easy it is to work with. CPU times obviously don't work the same way, so it's a question of whether comparing system-wide cpu times to process-tree ones would show us anything actionable.

@escapewindow
Copy link
Contributor

We may want to have a worker setting to disable this on hardware performance test pools.

@glandium
Copy link

glandium commented Jun 9, 2020

It could be interesting to have a way for the task itself to give some form of markers. We have that for build resource usage, and allows us to know what part of the build is being processed.

@srfraser
Copy link
Author

It could be interesting to have a way for the task itself to give some form of markers. We have that for build resource usage, and allows us to know what part of the build is being processed.

I couldn't think of a straightforward way to add this without some awkward IPC going on, although I'm open to suggestions. As far as I'm aware the payload itself is always in a state where it could be run outside a task and so its external communication is fairly generic: artifacts, logging, any necessary downloads.

We do already process the live log and produce timing information for each phase, so it's possible to link these up afterwards. For example for AUmp7fbpRmWB_Rniw0s5eA we have component and subcomponents discovered:

component,subComponent
task,total
taskcluster,setup
taskcluster,task
vcs,update
mozharness,get-secrets,step
mozharness,build,step
build_metrics,configure
build_metrics,pre-export
build_metrics,export
build_metrics,compile
build_metrics,misc
build_metrics,libs
build_metrics,tools
build_metrics,package-generated-sources
build_metrics,package
build_metrics,upload
taskcluster,teardown

@srfraser
Copy link
Author

We may want to have a worker setting to disable this on hardware performance test pools.

It's certainly an option. The existing mixin-based resource monitoring is already running for these tasks as it doesn't differentiate, but that doesn't mean it's always a good thing

Copy link
Contributor

@djmitche djmitche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this proposal! A few big-picture thoughts to add:

  • I agree that trying to include specific moments in the task execution in this event stream is difficult and this probably isn't the right place to do it. The successful approach for this kind of measurement seems to be to record the data as simply as possible and then do all of the correlations and analysis after the fact. So that correlation could be done by something after the fact that reads both the task log and the monitoring time-series and extrapolates phases from the task log.

  • @imbstack raised the concern with baking this functionality into workers that we'll end up doing a lot of iteration on workers to add features that different users need. For example, a game company might want to gather stats on GPU usage or metrics from a device emulator. So, I think that building a nice interface between the thing gathering the metrics and the worker is a good idea. A process boundary seems like a good choice, too. Then we can create an executable that does the commonly-useful stuff you've highlighted here, with the possibility of users substituting other binaries that meet the same interface.

    • Then the path to the executable can be in the worker config, and any additional configuration for it (how often to sample, etc.) can be in the task payload

    • With that arrangement, this RFC need not cover the details of what is gathered (which will probably take some time to agree on) but can focus on the interface with the data-gathering process and how its results are made available to users (which is probably easier to agree on).

Outline:

1. Add code to the existing generic-worker and docker-worker
2. Every 1 second, a sample will be taken:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firefox-CI runs about 2.5 million hours of compute per month, and if you figure 0.5k per sample, that's 4GB per week (approximate figures obviously). So the data storage for this is surprisingly small. Even so, I agree that this should be something that can be configured, probably both on/off and sampling frequency. Maybe it could be configured both at the worker-config level and at the task level?

Copy link

@sciurus sciurus Jun 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about that math? Unless I'm counting wrong, 2.5 million hours is 9 billion seconds. At 0.5 kilobytes of storage per second, that's 4.5 terabytes per month.

At these data volumes it might be preferable to summarize the samples client-side and store that in bigquery rather than the raw data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, I double-checked that and still got it wrong. You are correct, it is 4 billion kilobytes :)


Cons:

1. Each worker will depend on a separate binary being deployed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already the case -- we package things like tc-proxy and livelog for use from both docker-worker and generic-worker. So perhaps this con isn't as bad as it seems. However, communication and management of those processes has been a source of a lot of bugs and friction.

Also, monitoring docker-worker tasks is going to be a very different exercise than monitoring generic-worker tasks. So it might make sense to use different utilities in those two cases, even if they are both in Go.

```
Schema({
Required('start'): int, # Epoch time of earliest sample
Required('end'): int, # Epoch time of last sample
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A top-level version flag would be good, too.

@srfraser
Copy link
Author

From my understanding so far:

  1. We're leaning towards a separate tool that the worker understands how to interact with, partly for cross-worker issues and partly for ease of third-party taskcluster deployments. This also answers 'how to turn it off on some worker types' since we just don't deploy the binary.
  2. Some output format changes to do with version and field types
  3. Some questions about what would be monitored differently in docker-worker need more discussion
  4. Data retention and storage costs need to be addressed - if the artifact is compressed this is mitigated

What have I missed?

@catlee
Copy link

catlee commented Jun 17, 2020

I really doubt we need to store 1s resolution resource data. Probably every 15s or 30s is sufficient for most of our use cases.

@petemoore
Copy link
Member

Simon, Mihai and I met today, and on reflection we think there may be more flexibility afforded and fewer barriers to delivery if indeed the metrics are collected as part of the run-task machinery.

Simon had the great idea of writing the tool in go and shipping as a standalone statically linked executable per platform. This could be mounted in place by run-task like it mounts toolchains, and the go source code could live in tree. If that source code changes, it would automatically be rebuilt and run-task would automatically get the latest built version of the tool.

This offers some advantages:

  • changes to the profiling tool can be made with a try push
  • there is full transparency of the tool's existence and function in the task
  • there are no barriers to delivery - no deployment operations required, i.e. no person needs to wait on the actions of another person, to roll this out (except code reviews, of course, outside of try)
  • the tool can be easily enabled/disabled per platform/project/branch/task type/worker type etc from in-tree configs
  • the (go) tool can be shared by other projects

My feeling is that the advantages of making it part of the platform would be if there was custom behaviour was highly tied to the cloud environment that the task runs in (e.g. different behaviour required for gcp/aws/azure workers etc), or if the metrics collection needed to run with higher privileges than the task, or if the tool would be difficult to share between projects but was generic enough that we'd want to have it running everywhere, without a lot of customisation, indefinitely, with no anticipated change to the metrics collection over time. I don't think any of these conditions apply.

So my vote is that we do this in-task, set up the go tool as a toolchain that gets built by the existing toolchain building mechanics we have in firefox, put all the code in mozilla-central, and make it transparent and self-serve for devs.

@glandium
Copy link

There's a chicken and egg problem with that approach though: toolchain tasks use run-task. I guess we could just say that we don't care about the stats in that case. The other problem is that docker-worker doesn't support mounts, so any change to the go program would need to be baked into docker images, alongside run-task... which is another chicken and egg problem because the docker image that is used to build toolchains itself will need to have it baked in. Well, I guess we could build the go program with an external docker image...
(if docker-worker could support mounts, that would be even better)

@srfraser
Copy link
Author

There's a chicken and egg problem with that approach though: toolchain tasks use run-task. I guess we could just say that we don't care about the stats in that case.

Hm, a circular dependency on the monitoring, but yes, as you suggest it may not have a huge resource impact.

The other problem is that docker-worker doesn't support mounts, so any change to the go program would need to be baked into docker images, alongside run-task... which is another chicken and egg problem because the docker image that is used to build toolchains itself will need to have it baked in. Well, I guess we could build the go program with an external docker image...
(if docker-worker could support mounts, that would be even better)

Using fetches ensures run-task just downloads the required artifacts using HTTPS, so that could be a reasonable alternative to mounts.

@glandium
Copy link

But then you don't do resource monitoring for fetches and everything that precedes it. (I think we clone before fetches)

@petemoore petemoore removed their request for review July 7, 2020 10:23
@ccooper ccooper changed the base branch from master to main July 28, 2020 22:08
@djmitche
Copy link
Contributor

closing as stale, but still here for reference / revival

@djmitche djmitche closed this Apr 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants