New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for millisecond granularity on latency buckets? #667
Comments
Thank you for opening this issue. It is correct that Pyrra interprets the In that sense, the ingested metrics are wrong... That being said, I think Pyrra does the right thing, however, I can totally see that this might not be what helps you. Not sure how we can improve the experience. The SLO config somehow seems like the wrong place too... |
I had the same problem with a Keycloak metric ( groups:
# Relabel keycloak_request_duration_bucket to seconds instead of ms as this is the unit expected by pyrra
- name: keycloak-request-duration-seconds
rules:
# +Inf
- record: keycloak_request_duration_bucket:in_seconds
labels:
le: "+Inf"
expr: keycloak_request_duration_bucket{le="+Inf"}
# 30000.0
- record: keycloak_request_duration_bucket:in_seconds
labels:
le: "30.0"
expr: keycloak_request_duration_bucket{le="30000.0"}
# 10000.0
- record: keycloak_request_duration_bucket:in_seconds
labels:
le: "10.0"
expr: keycloak_request_duration_bucket{le="10000.0"}
# 2000.0
- record: keycloak_request_duration_bucket:in_seconds
labels:
le: "2.0"
expr: keycloak_request_duration_bucket{le="2000.0"}
# 1000.0
- record: keycloak_request_duration_bucket:in_seconds
labels:
le: "1.0"
expr: keycloak_request_duration_bucket{le="1000.0"}
# 500.0
- record: keycloak_request_duration_bucket:in_seconds
labels:
le: "0.5"
expr: keycloak_request_duration_bucket{le="500.0"}
# 250
- record: keycloak_request_duration_bucket:in_seconds
labels:
le: "0.25"
expr: keycloak_request_duration_bucket{le="250.0"}
# 100.0
- record: keycloak_request_duration_bucket:in_seconds
labels:
le: "0.1"
expr: keycloak_request_duration_bucket{le="100.0"}
# 50.0
- record: keycloak_request_duration_bucket:in_seconds
labels:
le: "0.05"
expr: keycloak_request_duration_bucket{le="50.0"} |
Wow. That's cool and awesome that you share the example here! 💯 |
Thanks everyone for this info! I think we'll proceed with the recording rule 💯 I'm also asking the Istio folks for clarity on why this metric is in milliseconds. Interestingly, the metric used to be in seconds: https://istio.io/v1.6/docs/reference/config/policy-and-telemetry/metrics/ ( |
Hi! Have you checked out the comment here: istio/istio#44168 (comment) ? It seems the OpenTelemetry standards requires milliseconds. Any chance we can reopen this issue? |
@rvalkenaers Thanks for the nudge! I filed that Istio issue a while back, and forgot to followup here. There seems to be a discrepancy between Prometheus' base units (requiring seconds), and the semantic conventions in the http metrics by OpenTelemetry (requiring milliseconds). Note that the OpenTelemetry convention is still experimental. I'm not sure which one is more correct - would love to hear what others folks think. One of the Istio maintainers expressed that seconds is too large of a unit for requests. Perhaps Pyrra should support both? We're happy to do the work, but would like a maintainer's opinion. (@metalmatze) Or perhaps this is something the Prometheus and OTel conventions should resolve between them? I'll post an update if I find more info on why they are conflicting. |
Milliseconds are still supported by Prometheus as labels. For example the histogram le Label is often used with values like While not having more OTel context, at least for Prometheus it's not an issue. |
That makes sense - thank you for the info 👍 |
For future reference, OTel is now moving to |
By default, it's good that pyrra defines the Prometheus best practice of using seconds. But is it not better to have the flexibility to configure all the applications which don't currently write the value in seconds?
It's excellent that open telemetry specification has adopted seconds as a standard, but it will take a long time for all the client libs to be updated in other cases. Also, I don't know if envoy will change from milliseconds to seconds either. Working around this by using a relabeling rule feels even worse than being able to define the unit on the SLO itself, in my opinion. |
Heads up, the recording rule trick might not work well with HA pairs of Prometheus and Thanos query deduplication. A couple of threads in the #thanos slack and the #thanos-dev slack channels that are not currently documented. Given how long it can take for OpenTelemetry libraries to get updated to use seconds, maybe it's interesting to support both units as Brad suggested? @metalmatze would you be ok with it if this was contributed by someone else? |
just so Im clear this issue only effects the pyrran web ui. If you are using Grafana Dashboards this shouldn't be an issue. |
Repro:
We created the following pyrra SLO:
Note how we're using the
istio_request_duration_milliseconds_count
metric, with ale="5000"
label. This is an Istio standard metric: https://istio.io/latest/docs/reference/config/metrics/Expected
Pyrra should interpret the latency buckets in millisecond granularity.
Actual
Pyrra assumes the latency buckets in second granularity (see screenshot below)
The
Requests
andErrors
sections look correct, but theDuration
section saysp99 must be faster than 1h23m.
Thele="5000"
seems to be interpreted as 5000 seconds (ie: 1h23m) instead of 5000 milliseconds.Next steps
Is there a known workaround for this issue?
I'm also interested in submitting a patch to fix this on Pyrra's end, so any tips on that approach would also be appreciated!
From a brief investigation, it seems like this
renderLatencyTarget
function assumes that the latency is always in seconds:pyrra/ui/src/App.tsx
Line 84 in ce95654
The text was updated successfully, but these errors were encountered: