scx_layered: Add support for OpenMetrics format #101

dschatzberg · 2024-01-24T21:05:39Z

Currently scx_layered outputs statistics periodically as info! logs. The format of this is largely unstructured and mostly suitable for running scx_layered interactively (e.g. observing its behavior on the command line or via logs after the fact).

In order to run scx_layered at larger scale, it's desireable to have statistics output in some format that is amenable to being ingested into monitoring databases (e.g. Prometheseus). This allows collection of stats across many machines.

This commit adds a command line flag (-o) that outputs statistics to stderr in OpenMetrics format instead of the normal log mechanism. OpenMetrics has a public format
specification (https://github.com/OpenObservability/OpenMetrics) and is in use by many projects.

The library for producing OpenMetrics metrics is lightweight but does induce some changes. Primarily, metrics need to be pre-registered (see OpenMetricsStats::new()).

Without -o, the output looks as before, for example:

19:39:54 [INFO] CPUs: online/possible=52/52 nr_cores=26
19:39:54 [INFO] Layered Scheduler Attached
19:39:56 [INFO] tot=   9912 local=76.71 open_idle= 0.00 affn_viol= 2.63 tctx_err=0 proc=21ms
19:39:56 [INFO] busy=  1.3 util=   65.2 load=    263.4 fallback_cpu=  1
19:39:56 [INFO]   batch    : util/frac=   49.7/ 76.3 load/frac=    252.0: 95.7 tasks=   458
19:39:56 [INFO]              tot=   2842 local=45.04 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
19:39:56 [INFO]              cpus=  2 [  0,  2] 04000001 00000000
19:39:56 [INFO]   immediate: util/frac=    0.0/  0.0 load/frac=      0.0:  0.0 tasks=     0
19:39:56 [INFO]              tot=      0 local= 0.00 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
19:39:56 [INFO]              cpus= 50 [  0, 50] fbfffffe 000fffff
19:39:56 [INFO]   normal   : util/frac=   15.4/ 23.7 load/frac=     11.4:  4.3 tasks=   556
19:39:56 [INFO]              tot=   7070 local=89.43 open_idle= 0.00 preempt= 0.00 affn_viol= 3.69
19:39:56 [INFO]              cpus= 50 [  0, 50] fbfffffe 000fffff
19:39:58 [INFO] tot=   7091 local=84.91 open_idle= 0.00 affn_viol= 2.64 tctx_err=0 proc=21ms
19:39:58 [INFO] busy=  0.6 util=   31.2 load=    107.1 fallback_cpu=  1
19:39:58 [INFO]   batch    : util/frac=   18.3/ 58.5 load/frac=     93.9: 87.7 tasks=   589
19:39:58 [INFO]              tot=   2011 local=60.67 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
19:39:58 [INFO]              cpus=  2 [  2,  2] 04000001 00000000
19:39:58 [INFO]   immediate: util/frac=    0.0/  0.0 load/frac=      0.0:  0.0 tasks=     0
19:39:58 [INFO]              tot=      0 local= 0.00 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00
19:39:58 [INFO]              cpus= 50 [ 50, 50] fbfffffe 000fffff
19:39:58 [INFO]   normal   : util/frac=   13.0/ 41.5 load/frac=     13.2: 12.3 tasks=   650
19:39:58 [INFO]              tot=   5080 local=94.51 open_idle= 0.00 preempt= 0.00 affn_viol= 3.68
19:39:58 [INFO]              cpus= 50 [ 50, 50] fbfffffe 000fffff
^C19:39:59 [INFO] EXIT: BPF scheduler unregistered

With -o passed, the output is in OpenMetrics format:

19:40:08 [INFO] CPUs: online/possible=52/52 nr_cores=26
19:40:08 [INFO] Layered Scheduler Attached
 # HELP total Total scheduling events in the period.
 # TYPE total gauge
total 8489
 # HELP local % that got scheduled directly into an idle CPU.
 # TYPE local gauge
local 86.45305689716104
 # HELP open_idle % of open layer tasks scheduled into occupied idle CPUs.
 # TYPE open_idle gauge
open_idle 0.0
 # HELP affn_viol % which violated configured policies due to CPU affinity restrictions.
 # TYPE affn_viol gauge
affn_viol 2.332430203793144
 # HELP tctx_err Failures to free task contexts.
 # TYPE tctx_err gauge
tctx_err 0
 # HELP proc_ms CPU time this binary has consumed during the period.
 # TYPE proc_ms gauge
proc_ms 20
 # HELP busy CPU busy % (100% means all CPUs were fully occupied).
 # TYPE busy gauge
busy 0.5294061026085283
 # HELP util CPU utilization % (100% means one CPU was fully occupied).
 # TYPE util gauge
util 27.37195512782239
 # HELP load Sum of weight * duty_cycle for all tasks.
 # TYPE load gauge
load 81.55024768702126
 # HELP layer_util CPU utilization of the layer (100% means one CPU was fully occupied).
 # TYPE layer_util gauge
layer_util{layer_name="immediate"} 0.0
layer_util{layer_name="normal"} 19.340849995024997
layer_util{layer_name="batch"} 8.031105132797393
 # HELP layer_util_frac Fraction of total CPU utilization consumed by the layer.
 # TYPE layer_util_frac gauge
layer_util_frac{layer_name="batch"} 29.34063385422595
layer_util_frac{layer_name="immediate"} 0.0
layer_util_frac{layer_name="normal"} 70.65936614577405
 # HELP layer_load Sum of weight * duty_cycle for tasks in the layer.
 # TYPE layer_load gauge
layer_load{layer_name="immediate"} 0.0
layer_load{layer_name="normal"} 11.14363313258934
layer_load{layer_name="batch"} 70.40661455443191
 # HELP layer_load_frac Fraction of total load consumed by the layer.
 # TYPE layer_load_frac gauge
layer_load_frac{layer_name="normal"} 13.664744680306903
layer_load_frac{layer_name="immediate"} 0.0
layer_load_frac{layer_name="batch"} 86.33525531969309
 # HELP layer_tasks Number of tasks in the layer.
 # TYPE layer_tasks gauge
layer_tasks{layer_name="immediate"} 0
layer_tasks{layer_name="normal"} 490
layer_tasks{layer_name="batch"} 343
 # HELP layer_total Number of scheduling events in the layer.
 # TYPE layer_total gauge
layer_total{layer_name="normal"} 6711
layer_total{layer_name="batch"} 1778
layer_total{layer_name="immediate"} 0
 # HELP layer_local % of scheduling events directly into an idle CPU.
 # TYPE layer_local gauge
layer_local{layer_name="batch"} 69.79752530933632
layer_local{layer_name="immediate"} 0.0
layer_local{layer_name="normal"} 90.86574281031143
 # HELP layer_open_idle % of scheduling events into idle CPUs occupied by other layers.
 # TYPE layer_open_idle gauge
layer_open_idle{layer_name="immediate"} 0.0
layer_open_idle{layer_name="batch"} 0.0
layer_open_idle{layer_name="normal"} 0.0
 # HELP layer_preempt % of scheduling events that preempted other tasks. #
 # TYPE layer_preempt gauge
layer_preempt{layer_name="normal"} 0.0
layer_preempt{layer_name="batch"} 0.0
layer_preempt{layer_name="immediate"} 0.0
 # HELP layer_affn_viol % of scheduling events that violated configured policies due to CPU affinity restrictions.
 # TYPE layer_affn_viol gauge
layer_affn_viol{layer_name="normal"} 2.950379973178364
layer_affn_viol{layer_name="batch"} 0.0
layer_affn_viol{layer_name="immediate"} 0.0
 # HELP layer_cur_nr_cpus Current  # of CPUs assigned to the layer.
 # TYPE layer_cur_nr_cpus gauge
layer_cur_nr_cpus{layer_name="normal"} 50
layer_cur_nr_cpus{layer_name="batch"} 2
layer_cur_nr_cpus{layer_name="immediate"} 50
 # HELP layer_min_nr_cpus Minimum  # of CPUs assigned to the layer.
 # TYPE layer_min_nr_cpus gauge
layer_min_nr_cpus{layer_name="normal"} 0
layer_min_nr_cpus{layer_name="batch"} 0
layer_min_nr_cpus{layer_name="immediate"} 0
 # HELP layer_max_nr_cpus Maximum  # of CPUs assigned to the layer.
 # TYPE layer_max_nr_cpus gauge
layer_max_nr_cpus{layer_name="immediate"} 50
layer_max_nr_cpus{layer_name="normal"} 50
layer_max_nr_cpus{layer_name="batch"} 2
 # EOF
^C19:40:11 [INFO] EXIT: BPF scheduler unregistered

htejun

Generally looks fine to me. Left mostly code formatting comments. A question about output fd tho. Why stderr? Wouldn't that conflict with warning / error logging that we'd want to store elsewhere? Would it make more sense to output to stdout or a user provided fd?

scheds/rust/scx_layered/src/main.rs

dschatzberg · 2024-01-25T15:34:36Z

I debated having a separate socket or something to write to. It just didn't feel necessary - it's easy to parse the OM output - The beginning of every output will be a line beginning with # HELP and it will end with EOF. So it can pretty easily be disambiguated from other log lines.

I'm not opposed to making it work as you suggest though.

htejun · 2024-01-25T15:50:46Z

I debated having a separate socket or something to write to. It just didn't feel necessary - it's easy to parse the OM output - The beginning of every output will be a line beginning with # HELP and it will end with EOF. So it can pretty easily be disambiguated from other log lines.

So, I was thinking more about the other direction. Even when sending stats to e.g. ods, we'd still want the warnings and errors to be logged normally so that we can determine what happened when something goes wrong. If the OM outputs are mixed with warnings and errors, wouldn't it be difficult to log the warnings and errors?

dschatzberg · 2024-01-25T16:24:43Z

Yeah, we want both outputs. My point is that the structure of the OM output is such that we can disambiguate OM output from log output even within the same stream - we can just consume stderr and filter out the OM outputs (and send them where we want) and let the warning and error logs go elsewhere.

htejun · 2024-01-25T16:46:08Z

Yeah, we want both outputs. My point is that the structure of the OM output is such that we can disambiguate OM output from log output even within the same stream - we can just consume stderr and filter out the OM outputs (and send them where we want) and let the warning and error logs go elsewhere.

I was a bit confused. info! is logged to stdout and warn! and error! stderr. So, they're gonna get mixed but I think it'd be easier if we output OM outputs to stdout instead of stderr. We can then at least log stderr to system log directly and most likely can throw away !OM stdout logs.

dschatzberg · 2024-01-25T17:40:50Z

Ok, I'll make the change so OM goes to stdout

Currently scx_layered outputs statistics periodically as info! logs. The format of this is largely unstructured and mostly suitable for running scx_layered interactively (e.g. observing its behavior on the command line or via logs after the fact). In order to run scx_layered at larger scale, it's desireable to have statistics output in some format that is amenable to being ingested into monitoring databases (e.g. Prometheseus). This allows collection of stats across many machines. This commit adds a command line flag (-o) that outputs statistics to stdout in OpenMetrics format instead of the normal log mechanism. OpenMetrics has a public format specification (https://github.com/OpenObservability/OpenMetrics) and is in use by many projects. The library for producing OpenMetrics metrics is lightweight but does induce some changes. Primarily, metrics need to be pre-registered (see OpenMetricsStats::new()). Without -o, the output looks as before, for example: ``` 19:39:54 [INFO] CPUs: online/possible=52/52 nr_cores=26 19:39:54 [INFO] Layered Scheduler Attached 19:39:56 [INFO] tot= 9912 local=76.71 open_idle= 0.00 affn_viol= 2.63 tctx_err=0 proc=21ms 19:39:56 [INFO] busy= 1.3 util= 65.2 load= 263.4 fallback_cpu= 1 19:39:56 [INFO] batch : util/frac= 49.7/ 76.3 load/frac= 252.0: 95.7 tasks= 458 19:39:56 [INFO] tot= 2842 local=45.04 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00 19:39:56 [INFO] cpus= 2 [ 0, 2] 04000001 00000000 19:39:56 [INFO] immediate: util/frac= 0.0/ 0.0 load/frac= 0.0: 0.0 tasks= 0 19:39:56 [INFO] tot= 0 local= 0.00 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00 19:39:56 [INFO] cpus= 50 [ 0, 50] fbfffffe 000fffff 19:39:56 [INFO] normal : util/frac= 15.4/ 23.7 load/frac= 11.4: 4.3 tasks= 556 19:39:56 [INFO] tot= 7070 local=89.43 open_idle= 0.00 preempt= 0.00 affn_viol= 3.69 19:39:56 [INFO] cpus= 50 [ 0, 50] fbfffffe 000fffff 19:39:58 [INFO] tot= 7091 local=84.91 open_idle= 0.00 affn_viol= 2.64 tctx_err=0 proc=21ms 19:39:58 [INFO] busy= 0.6 util= 31.2 load= 107.1 fallback_cpu= 1 19:39:58 [INFO] batch : util/frac= 18.3/ 58.5 load/frac= 93.9: 87.7 tasks= 589 19:39:58 [INFO] tot= 2011 local=60.67 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00 19:39:58 [INFO] cpus= 2 [ 2, 2] 04000001 00000000 19:39:58 [INFO] immediate: util/frac= 0.0/ 0.0 load/frac= 0.0: 0.0 tasks= 0 19:39:58 [INFO] tot= 0 local= 0.00 open_idle= 0.00 preempt= 0.00 affn_viol= 0.00 19:39:58 [INFO] cpus= 50 [ 50, 50] fbfffffe 000fffff 19:39:58 [INFO] normal : util/frac= 13.0/ 41.5 load/frac= 13.2: 12.3 tasks= 650 19:39:58 [INFO] tot= 5080 local=94.51 open_idle= 0.00 preempt= 0.00 affn_viol= 3.68 19:39:58 [INFO] cpus= 50 [ 50, 50] fbfffffe 000fffff ^C19:39:59 [INFO] EXIT: BPF scheduler unregistered ``` With -o passed, the output is in OpenMetrics format: ``` 19:40:08 [INFO] CPUs: online/possible=52/52 nr_cores=26 19:40:08 [INFO] Layered Scheduler Attached # HELP total Total scheduling events in the period. # TYPE total gauge total 8489 # HELP local % that got scheduled directly into an idle CPU. # TYPE local gauge local 86.45305689716104 # HELP open_idle % of open layer tasks scheduled into occupied idle CPUs. # TYPE open_idle gauge open_idle 0.0 # HELP affn_viol % which violated configured policies due to CPU affinity restrictions. # TYPE affn_viol gauge affn_viol 2.332430203793144 # HELP tctx_err Failures to free task contexts. # TYPE tctx_err gauge tctx_err 0 # HELP proc_ms CPU time this binary has consumed during the period. # TYPE proc_ms gauge proc_ms 20 # HELP busy CPU busy % (100% means all CPUs were fully occupied). # TYPE busy gauge busy 0.5294061026085283 # HELP util CPU utilization % (100% means one CPU was fully occupied). # TYPE util gauge util 27.37195512782239 # HELP load Sum of weight * duty_cycle for all tasks. # TYPE load gauge load 81.55024768702126 # HELP layer_util CPU utilization of the layer (100% means one CPU was fully occupied). # TYPE layer_util gauge layer_util{layer_name="immediate"} 0.0 layer_util{layer_name="normal"} 19.340849995024997 layer_util{layer_name="batch"} 8.031105132797393 # HELP layer_util_frac Fraction of total CPU utilization consumed by the layer. # TYPE layer_util_frac gauge layer_util_frac{layer_name="batch"} 29.34063385422595 layer_util_frac{layer_name="immediate"} 0.0 layer_util_frac{layer_name="normal"} 70.65936614577405 # HELP layer_load Sum of weight * duty_cycle for tasks in the layer. # TYPE layer_load gauge layer_load{layer_name="immediate"} 0.0 layer_load{layer_name="normal"} 11.14363313258934 layer_load{layer_name="batch"} 70.40661455443191 # HELP layer_load_frac Fraction of total load consumed by the layer. # TYPE layer_load_frac gauge layer_load_frac{layer_name="normal"} 13.664744680306903 layer_load_frac{layer_name="immediate"} 0.0 layer_load_frac{layer_name="batch"} 86.33525531969309 # HELP layer_tasks Number of tasks in the layer. # TYPE layer_tasks gauge layer_tasks{layer_name="immediate"} 0 layer_tasks{layer_name="normal"} 490 layer_tasks{layer_name="batch"} 343 # HELP layer_total Number of scheduling events in the layer. # TYPE layer_total gauge layer_total{layer_name="normal"} 6711 layer_total{layer_name="batch"} 1778 layer_total{layer_name="immediate"} 0 # HELP layer_local % of scheduling events directly into an idle CPU. # TYPE layer_local gauge layer_local{layer_name="batch"} 69.79752530933632 layer_local{layer_name="immediate"} 0.0 layer_local{layer_name="normal"} 90.86574281031143 # HELP layer_open_idle % of scheduling events into idle CPUs occupied by other layers. # TYPE layer_open_idle gauge layer_open_idle{layer_name="immediate"} 0.0 layer_open_idle{layer_name="batch"} 0.0 layer_open_idle{layer_name="normal"} 0.0 # HELP layer_preempt % of scheduling events that preempted other tasks. # # TYPE layer_preempt gauge layer_preempt{layer_name="normal"} 0.0 layer_preempt{layer_name="batch"} 0.0 layer_preempt{layer_name="immediate"} 0.0 # HELP layer_affn_viol % of scheduling events that violated configured policies due to CPU affinity restrictions. # TYPE layer_affn_viol gauge layer_affn_viol{layer_name="normal"} 2.950379973178364 layer_affn_viol{layer_name="batch"} 0.0 layer_affn_viol{layer_name="immediate"} 0.0 # HELP layer_cur_nr_cpus Current # of CPUs assigned to the layer. # TYPE layer_cur_nr_cpus gauge layer_cur_nr_cpus{layer_name="normal"} 50 layer_cur_nr_cpus{layer_name="batch"} 2 layer_cur_nr_cpus{layer_name="immediate"} 50 # HELP layer_min_nr_cpus Minimum # of CPUs assigned to the layer. # TYPE layer_min_nr_cpus gauge layer_min_nr_cpus{layer_name="normal"} 0 layer_min_nr_cpus{layer_name="batch"} 0 layer_min_nr_cpus{layer_name="immediate"} 0 # HELP layer_max_nr_cpus Maximum # of CPUs assigned to the layer. # TYPE layer_max_nr_cpus gauge layer_max_nr_cpus{layer_name="immediate"} 50 layer_max_nr_cpus{layer_name="normal"} 50 layer_max_nr_cpus{layer_name="batch"} 2 # EOF ^C19:40:11 [INFO] EXIT: BPF scheduler unregistered ``` Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>

htejun requested changes Jan 24, 2024

View reviewed changes

dschatzberg force-pushed the openmetrics branch from 82bc900 to 7f9548e Compare January 25, 2024 18:00

dschatzberg requested a review from htejun January 25, 2024 18:01

htejun approved these changes Jan 25, 2024

View reviewed changes

htejun merged commit eb997a6 into sched-ext:main Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scx_layered: Add support for OpenMetrics format #101

scx_layered: Add support for OpenMetrics format #101

dschatzberg commented Jan 24, 2024

htejun left a comment

dschatzberg commented Jan 25, 2024

htejun commented Jan 25, 2024

dschatzberg commented Jan 25, 2024

htejun commented Jan 25, 2024

dschatzberg commented Jan 25, 2024

scx_layered: Add support for OpenMetrics format #101

scx_layered: Add support for OpenMetrics format #101

Conversation

dschatzberg commented Jan 24, 2024

htejun left a comment

Choose a reason for hiding this comment

dschatzberg commented Jan 25, 2024

htejun commented Jan 25, 2024

dschatzberg commented Jan 25, 2024

htejun commented Jan 25, 2024

dschatzberg commented Jan 25, 2024