Support for millisecond granularity on latency buckets? #667

modulitos · 2023-03-24T21:01:49Z

Repro:

We created the following pyrra SLO:

apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: my-slo
  labels:
    role: alert-rules
    pyrra.dev/team: my-team
spec:
  alerting:
    disabled: false
  target: "99"
  window: 4w
  description: "My SLO"
  indicator:
    latency:
      success:
        metric: istio_request_duration_milliseconds_bucket{destination_workload="my-workload", reporter="destination", response_code=~"2..", le="5000"}
      total:
        metric: istio_request_duration_milliseconds_count{destination_workload="my-workload", reporter="destination"}

Note how we're using the istio_request_duration_milliseconds_count metric, with a le="5000" label. This is an Istio standard metric: https://istio.io/latest/docs/reference/config/metrics/

Expected

Pyrra should interpret the latency buckets in millisecond granularity.

Actual

Pyrra assumes the latency buckets in second granularity (see screenshot below)

The Requests and Errors sections look correct, but the Duration section says p99 must be faster than 1h23m. The le="5000" seems to be interpreted as 5000 seconds (ie: 1h23m) instead of 5000 milliseconds.

Next steps

Is there a known workaround for this issue?

I'm also interested in submitting a patch to fix this on Pyrra's end, so any tips on that approach would also be appreciated!

From a brief investigation, it seems like this renderLatencyTarget function assumes that the latency is always in seconds:

pyrra/ui/src/App.tsx

Line 84 in ce95654

return formatDuration(latency * 1000)

The text was updated successfully, but these errors were encountered:

metalmatze · 2023-03-27T11:08:28Z

Thank you for opening this issue. It is correct that Pyrra interprets the le as seconds.
Pyrra is following Prometheus' guide for base units: https://prometheus.io/docs/practices/naming/#base-units

In that sense, the ingested metrics are wrong...
I'm not sure how easy it would be for you to change those metrics either where they get created or upon ingestion (I'm not entirely sure relabelling works for making the milliseconds into seconds...)

That being said, I think Pyrra does the right thing, however, I can totally see that this might not be what helps you. Not sure how we can improve the experience. The SLO config somehow seems like the wrong place too...

s-diez · 2023-03-27T14:40:24Z

I had the same problem with a Keycloak metric (keycloak_request_duration_bucket). My solution was to write a new prometheus recording rule which effectivly just maps millisecond buckets to second buckets.

groups:
  # Relabel keycloak_request_duration_bucket to seconds instead of ms as this is the unit expected by pyrra
  - name: keycloak-request-duration-seconds
    rules:
      # +Inf
      - record: keycloak_request_duration_bucket:in_seconds
        labels:
          le: "+Inf"
        expr: keycloak_request_duration_bucket{le="+Inf"}
      # 30000.0
      - record: keycloak_request_duration_bucket:in_seconds
        labels:
          le: "30.0"
        expr: keycloak_request_duration_bucket{le="30000.0"}
      # 10000.0
      - record: keycloak_request_duration_bucket:in_seconds
        labels:
          le: "10.0"
        expr: keycloak_request_duration_bucket{le="10000.0"}
      # 2000.0
      - record: keycloak_request_duration_bucket:in_seconds
        labels:
          le: "2.0"
        expr: keycloak_request_duration_bucket{le="2000.0"}
      # 1000.0
      - record: keycloak_request_duration_bucket:in_seconds
        labels:
          le: "1.0"
        expr: keycloak_request_duration_bucket{le="1000.0"}
      # 500.0
      - record: keycloak_request_duration_bucket:in_seconds
        labels:
          le: "0.5"
        expr: keycloak_request_duration_bucket{le="500.0"}
      # 250
      - record: keycloak_request_duration_bucket:in_seconds
        labels:
          le: "0.25"
        expr: keycloak_request_duration_bucket{le="250.0"}
      # 100.0
      - record: keycloak_request_duration_bucket:in_seconds
        labels:
          le: "0.1"
        expr: keycloak_request_duration_bucket{le="100.0"}
      # 50.0
      - record: keycloak_request_duration_bucket:in_seconds
        labels:
          le: "0.05"
        expr: keycloak_request_duration_bucket{le="50.0"}

metalmatze · 2023-03-27T19:41:06Z

Wow. That's cool and awesome that you share the example here! 💯
Thanks @s-diez ,this is why I like open source so much! 🥳

modulitos · 2023-03-27T20:34:13Z

Thanks everyone for this info! I think we'll proceed with the recording rule 💯

I'm also asking the Istio folks for clarity on why this metric is in milliseconds. Interestingly, the metric used to be in seconds: https://istio.io/v1.6/docs/reference/config/policy-and-telemetry/metrics/ (istio_request_duration_seconds), but it was changed to milliseconds during the telemetry v2 upgrade (Istio 1.7). I'll post an update if I find more context on why it was renamed, and/or if it's worth changing back at some point...

rvalkenaers · 2023-04-05T14:19:24Z

Hi!

Have you checked out the comment here: istio/istio#44168 (comment) ?

It seems the OpenTelemetry standards requires milliseconds.

Any chance we can reopen this issue?

modulitos · 2023-04-05T17:32:43Z

@rvalkenaers Thanks for the nudge! I filed that Istio issue a while back, and forgot to followup here.

There seems to be a discrepancy between Prometheus' base units (requiring seconds), and the semantic conventions in the http metrics by OpenTelemetry (requiring milliseconds). Note that the OpenTelemetry convention is still experimental.

I'm not sure which one is more correct - would love to hear what others folks think. One of the Istio maintainers expressed that seconds is too large of a unit for requests. Perhaps Pyrra should support both? We're happy to do the work, but would like a maintainer's opinion. (@metalmatze)

Or perhaps this is something the Prometheus and OTel conventions should resolve between them? I'll post an update if I find more info on why they are conflicting.

metalmatze · 2023-04-06T07:36:26Z

Milliseconds are still supported by Prometheus as labels. For example the histogram le Label is often used with values like 0.5 (for 500ms) or even 0.001 for just 1ms.

While not having more OTel context, at least for Prometheus it's not an issue.
Therefore I would still prefer to follow Prometheus conventions and see how we can get data correctly into Prometheus in the first place.

modulitos · 2023-04-06T23:08:16Z

That makes sense - thank you for the info 👍

gouthamve · 2023-04-18T17:42:00Z

For future reference, OTel is now moving to seconds as the unit: open-telemetry/opentelemetry-specification#2977

BradErz · 2023-05-10T17:19:12Z

By default, it's good that pyrra defines the Prometheus best practice of using seconds.

But is it not better to have the flexibility to configure all the applications which don't currently write the value in seconds?

  indicator:
    latency:
      unit: s (default) | ms
      success:
        metric: istio_request_duration_milliseconds_bucket{destination_workload="my-workload", reporter="destination", response_code=~"2..", le="5000"}
      total:
        metric: istio_request_duration_milliseconds_count{destination_workload="my-workload", reporter="destination"}

It's excellent that open telemetry specification has adopted seconds as a standard, but it will take a long time for all the client libs to be updated in other cases. Also, I don't know if envoy will change from milliseconds to seconds either.

Working around this by using a relabeling rule feels even worse than being able to define the unit on the SLO itself, in my opinion.

PabloPie · 2023-10-20T10:01:55Z

Heads up, the recording rule trick might not work well with HA pairs of Prometheus and Thanos query deduplication. A couple of threads in the #thanos slack and the #thanos-dev slack channels that are not currently documented.

Given how long it can take for OpenTelemetry libraries to get updated to use seconds, maybe it's interesting to support both units as Brad suggested? @metalmatze would you be ok with it if this was contributed by someone else?

sepulworld · 2024-01-30T20:00:35Z

just so Im clear this issue only effects the pyrran web ui. If you are using Grafana Dashboards this shouldn't be an issue.

modulitos closed this as completed Mar 27, 2023

modulitos mentioned this issue Mar 29, 2023

The istio_request_duration_milliseconds metic isn't adhering to the Prometheus standard time unit istio/istio#44168

Closed

modulitos reopened this Apr 5, 2023

modulitos closed this as completed Apr 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for millisecond granularity on latency buckets? #667

Support for millisecond granularity on latency buckets? #667

modulitos commented Mar 24, 2023 •

edited

metalmatze commented Mar 27, 2023

s-diez commented Mar 27, 2023

metalmatze commented Mar 27, 2023

modulitos commented Mar 27, 2023

rvalkenaers commented Apr 5, 2023 •

edited

modulitos commented Apr 5, 2023 •

edited

metalmatze commented Apr 6, 2023

modulitos commented Apr 6, 2023

gouthamve commented Apr 18, 2023

BradErz commented May 10, 2023

PabloPie commented Oct 20, 2023

sepulworld commented Jan 30, 2024 •

edited

Support for millisecond granularity on latency buckets? #667

Support for millisecond granularity on latency buckets? #667

Comments

modulitos commented Mar 24, 2023 • edited

Repro:

Expected

Actual

Next steps

metalmatze commented Mar 27, 2023

s-diez commented Mar 27, 2023

metalmatze commented Mar 27, 2023

modulitos commented Mar 27, 2023

rvalkenaers commented Apr 5, 2023 • edited

modulitos commented Apr 5, 2023 • edited

metalmatze commented Apr 6, 2023

modulitos commented Apr 6, 2023

gouthamve commented Apr 18, 2023

BradErz commented May 10, 2023

PabloPie commented Oct 20, 2023

sepulworld commented Jan 30, 2024 • edited

modulitos commented Mar 24, 2023 •

edited

rvalkenaers commented Apr 5, 2023 •

edited

modulitos commented Apr 5, 2023 •

edited

sepulworld commented Jan 30, 2024 •

edited