New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cgroup support? #19
Comments
Hi @sargun - what kind of metrics would you like to see on a per-cgroup basis? |
Some of the perf metrics would be very valuable to see per cgroup. For example, it'd be valuable to say "all cgroups under dir X" should have perf events cycles, and instructions monitored. |
Per-cgroup performance counters are quite expensive in-practice. While possible to collect at this level, some careful thought would need to go into strategies to help mitigate the performance overhead. While I do feel this would fit into the project, I don't have any plans to work on this soon. |
Perhaps, do you have an opinion on how they should be exported, or how API / config should look? |
I think for regular exposition, it'd be best to gather the metrics under paths like: Ideally, for prometheus exposition, we would probably use Configuration gets a little tricky, we need an optional list of cgroups to instrument, and we need to indicate which samplers should collect per-cgroup telemetry. For instance, both More importantly, we need to consider the performance impact. Even having a single hardware event, such as cpu instructions, instrumented with perf on a per-cgroup basis can have significant performance penalty for the running services. This becomes apparent when multiple cgroups are used for isolation. Ideally, Rezolus should not cause measurable impact to running services. We want to be sure we're keeping the overhead low both within the project and in terms of impact across the system. A strategy needs to be developed to help mitigate the performance penalty. Based on prior experiments with this, I don't believe this work is trivial. This might not make a good first issue. |
How do you feel about supporting this for non-hardware events? For example, context-switches, and syscalls? |
Same. There's measurable performance impact to sensitive workloads when gathering even SW events per-cgroup when there are several cgroups on the system. I'll give some thought to this and see if there's some way to enable this work to happen or if it'll be possible for me to take this on myself. |
Is there any intent to support gathering metrics per cgroup?
The text was updated successfully, but these errors were encountered: