Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add exemplar support #395

Closed
wants to merge 15 commits into from
Closed

Add exemplar support #395

wants to merge 15 commits into from

Conversation

clux
Copy link

@clux clux commented Feb 26, 2021

This is a quick draft go at adding exemplar support from #393.

Is this something you would be willing to accept if it was cleaned up? Note that it doesn't compile yet.

Have added them to counters atm, but looks like there are a lot of layers between the Counter and the encoding bits so haven't gotten that far. If this is welcome, then any pointers on how to best propagate the data to the encoders would be appreciated.

@clux
Copy link
Author

clux commented Mar 14, 2021

Well, I've done the propagation..one way. It was hard to retrofit the whole cold/hot shard stuff with an extra optional struct which I don't know how to do atomic style for.

Also not cared for protobuf beyond what was necessary.

But can now do:

    let mut exemplar_labels = HashMap::new();
    exemplar_labels.insert("traceID".into(), trace_id);
    let ex = Exemplar::new_with_labels(duration, exemplar_labels);
    reconcile_duration_histogram
        .with_label_values(&[])
        .observe_with_exemplar(duration, ex);

and text encoder outputs:

# HELP foo_controller_reconcile_duration_seconds The duration of reconcile to complete in seconds
# TYPE foo_controller_reconcile_duration_seconds histogram
foo_controller_reconcile_duration_seconds_bucket{le="0.01"} 7 # {traceID="53b9b11e6aa06934c5f1bcedc772130d"} 37
foo_controller_reconcile_duration_seconds_bucket{le="0.1"} 8
foo_controller_reconcile_duration_seconds_bucket{le="0.25"} 8

@clux clux marked this pull request as ready for review March 14, 2021 18:24
@clux
Copy link
Author

clux commented Mar 16, 2021

have had to subvert the histogramcore sharding a bit, wasn't sure it made any sense to do a hot/cold path for exemplars when you have to sync write the whole struct, but maybe i am wrong.

for now it's sitting straight as a vector of optional exemplars (same length as bucket len) and writes to them are sync, which probably makes the existing sync mechanism useless. not sure how to make this better. could feature flag exemplars?

@wperron
Copy link

wperron commented Jun 9, 2021

Is there anything left to do on this PR? Anything I can do to help?

@clux
Copy link
Author

clux commented Jun 12, 2021

Well, I guess the main thing is that no matter how much I wrangle prometheus, it does not accept my output, even though it looks like it should conform to the spec. I have a demo app that is outputting:

foo_controller_reconcile_duration_seconds_bucket{le="0.01"} 1 # {trace_id="6c1011d34f17c4ca1177b40ddfb5184b"} 0.006
foo_controller_reconcile_duration_seconds_bucket{le="0.1"} 1
foo_controller_reconcile_duration_seconds_bucket{le="0.25"} 1

in its /metrics

but prometheus fails to parse it with a:

expected timestamp or new record, got "MNAME"

I have a demo implementation in kube-rs/controller-rs#12, and it's definitely the exemplar output that is failing, so not really sure what to do :-(

@bobrik
Copy link
Contributor

bobrik commented Jun 12, 2021

Exemplars must have the timestamp after the example value, just like the error says.

@clux
Copy link
Author

clux commented Jun 13, 2021

@bobrik : The error is asking for a histogram timestamp or a new record, not an exemplar timestamp (which is optional by the spec). These two are both valid:

$ echo 'foo_controller_reconcile_duration_seconds_bucket{le="0.01"} 1' | promtool check metrics
foo_controller_reconcile_duration_seconds_bucket no help text
$ echo 'foo_controller_reconcile_duration_seconds_bucket{le="0.01"} 1 1623574027' | promtool check metrics
foo_controller_reconcile_duration_seconds_bucket no help text

But none of these are:

$ echo 'foo_controller_reconcile_duration_seconds_bucket{le="0.01"} 1 1623574027 # {trace_id="69d991d2ef39b15e71e2f42eb02a9b55"} 0.006"' | promtool check metrics
error while linting: text format parsing error in line 1: spurious string after timestamp: " # {trace_id=\"69d991d2ef39b15e71e2f42eb02a9b55\"} 0.006\""
$ echo 'foo_controller_reconcile_duration_seconds_bucket{le="0.01"} 1 1623574027 # {trace_id="69d991d2ef39b15e71e2f42eb02a9b55"} 0.006 1623574027"' | promtool check metrics
error while linting: text format parsing error in line 1: spurious string after timestamp: " # {trace_id=\"69d991d2ef39b15e71e2f42eb02a9b55\"} 0.006 1623574027\""
$ echo 'foo_controller_reconcile_duration_seconds_bucket{le="0.01"} 1 1623574027 # ' | promtool check metrics
error while linting: text format parsing error in line 1: spurious string after timestamp: " # "

and without both optional timestamps (orig error):

$ echo 'foo_controller_reconcile_duration_seconds_bucket{le="0.01"} 1 # {trace_id="69d991d2ef39b15e71e2f42eb02a9b55"} 0.006"' | promtool check metrics
error while linting: text format parsing error in line 1: expected integer as timestamp, got "#"

@roidelapluie
Copy link

Prometheus has 2 metrics format: Prometheus text format, and OpenMetrics.

Only OpenMetrics supports examplars. There are a few other differences between the Text Format and OpenMetrics.

this is probably a backwards incompatible change that requres the new
openmetrics content-type header.
@clux
Copy link
Author

clux commented Jun 13, 2021

Thanks a lot @roidelapluie ! That solves the mystery.

I can get this client to work with openmetrics format, but it required a different exposition ending (# EOF\n), and forces a user requirement on the new content-type header, and possibly a different timestamp format (which I started tackling in here, but haven't changed to/verified anywhere).

So while this does seem parseable by prometheus if we skip optional exemplar timestamps and include the header, I am starting to fear that this PR might be a big one to merge if it requires changing the entire content format :(

Edit for anyone is visiting this in the future. I have stopped working on it. This PoC works for histograms, but it needs a lot of work in the library to support two metric formats.

@roidelapluie
Copy link

We are not evolving the text format anymore, we want to promote OpenMetrics. If you plan for adding full openmetrics support for Rust, do not hesitate to reach for help. There are a few differences between both formats, and the library ideally would handle both (via negotiation in the headers).

@clux clux closed this Aug 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants