Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for millisecond granularity on latency buckets? #667

Closed
modulitos opened this issue Mar 24, 2023 · 12 comments
Closed

Support for millisecond granularity on latency buckets? #667

modulitos opened this issue Mar 24, 2023 · 12 comments

Comments

@modulitos
Copy link

modulitos commented Mar 24, 2023

Repro:

We created the following pyrra SLO:

apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: my-slo
  labels:
    role: alert-rules
    pyrra.dev/team: my-team
spec:
  alerting:
    disabled: false
  target: "99"
  window: 4w
  description: "My SLO"
  indicator:
    latency:
      success:
        metric: istio_request_duration_milliseconds_bucket{destination_workload="my-workload", reporter="destination", response_code=~"2..", le="5000"}
      total:
        metric: istio_request_duration_milliseconds_count{destination_workload="my-workload", reporter="destination"}

Note how we're using the istio_request_duration_milliseconds_count metric, with a le="5000" label. This is an Istio standard metric: https://istio.io/latest/docs/reference/config/metrics/

Expected

Pyrra should interpret the latency buckets in millisecond granularity.

Actual

Pyrra assumes the latency buckets in second granularity (see screenshot below)

The Requests and Errors sections look correct, but the Duration section says p99 must be faster than 1h23m. The le="5000" seems to be interpreted as 5000 seconds (ie: 1h23m) instead of 5000 milliseconds.

Next steps

Is there a known workaround for this issue?

I'm also interested in submitting a patch to fix this on Pyrra's end, so any tips on that approach would also be appreciated!

From a brief investigation, it seems like this renderLatencyTarget function assumes that the latency is always in seconds:

return formatDuration(latency * 1000)

screenshot_1679691226

@metalmatze
Copy link
Member

Thank you for opening this issue. It is correct that Pyrra interprets the le as seconds.
Pyrra is following Prometheus' guide for base units: https://prometheus.io/docs/practices/naming/#base-units

In that sense, the ingested metrics are wrong...
I'm not sure how easy it would be for you to change those metrics either where they get created or upon ingestion (I'm not entirely sure relabelling works for making the milliseconds into seconds...)

That being said, I think Pyrra does the right thing, however, I can totally see that this might not be what helps you. Not sure how we can improve the experience. The SLO config somehow seems like the wrong place too...

@s-diez
Copy link

s-diez commented Mar 27, 2023

I had the same problem with a Keycloak metric (keycloak_request_duration_bucket). My solution was to write a new prometheus recording rule which effectivly just maps millisecond buckets to second buckets.

groups:
  # Relabel keycloak_request_duration_bucket to seconds instead of ms as this is the unit expected by pyrra
  - name: keycloak-request-duration-seconds
    rules:
      # +Inf
      - record: keycloak_request_duration_bucket:in_seconds
        labels:
          le: "+Inf"
        expr: keycloak_request_duration_bucket{le="+Inf"}
      # 30000.0
      - record: keycloak_request_duration_bucket:in_seconds
        labels:
          le: "30.0"
        expr: keycloak_request_duration_bucket{le="30000.0"}
      # 10000.0
      - record: keycloak_request_duration_bucket:in_seconds
        labels:
          le: "10.0"
        expr: keycloak_request_duration_bucket{le="10000.0"}
      # 2000.0
      - record: keycloak_request_duration_bucket:in_seconds
        labels:
          le: "2.0"
        expr: keycloak_request_duration_bucket{le="2000.0"}
      # 1000.0
      - record: keycloak_request_duration_bucket:in_seconds
        labels:
          le: "1.0"
        expr: keycloak_request_duration_bucket{le="1000.0"}
      # 500.0
      - record: keycloak_request_duration_bucket:in_seconds
        labels:
          le: "0.5"
        expr: keycloak_request_duration_bucket{le="500.0"}
      # 250
      - record: keycloak_request_duration_bucket:in_seconds
        labels:
          le: "0.25"
        expr: keycloak_request_duration_bucket{le="250.0"}
      # 100.0
      - record: keycloak_request_duration_bucket:in_seconds
        labels:
          le: "0.1"
        expr: keycloak_request_duration_bucket{le="100.0"}
      # 50.0
      - record: keycloak_request_duration_bucket:in_seconds
        labels:
          le: "0.05"
        expr: keycloak_request_duration_bucket{le="50.0"}

@metalmatze
Copy link
Member

Wow. That's cool and awesome that you share the example here! 💯
Thanks @s-diez ,this is why I like open source so much! 🥳

@modulitos
Copy link
Author

Thanks everyone for this info! I think we'll proceed with the recording rule 💯

I'm also asking the Istio folks for clarity on why this metric is in milliseconds. Interestingly, the metric used to be in seconds: https://istio.io/v1.6/docs/reference/config/policy-and-telemetry/metrics/ (istio_request_duration_seconds), but it was changed to milliseconds during the telemetry v2 upgrade (Istio 1.7). I'll post an update if I find more context on why it was renamed, and/or if it's worth changing back at some point...

@rvalkenaers
Copy link

rvalkenaers commented Apr 5, 2023

Hi!

Have you checked out the comment here: istio/istio#44168 (comment) ?

It seems the OpenTelemetry standards requires milliseconds.

Any chance we can reopen this issue?

@modulitos
Copy link
Author

modulitos commented Apr 5, 2023

@rvalkenaers Thanks for the nudge! I filed that Istio issue a while back, and forgot to followup here.

There seems to be a discrepancy between Prometheus' base units (requiring seconds), and the semantic conventions in the http metrics by OpenTelemetry (requiring milliseconds). Note that the OpenTelemetry convention is still experimental.

I'm not sure which one is more correct - would love to hear what others folks think. One of the Istio maintainers expressed that seconds is too large of a unit for requests. Perhaps Pyrra should support both? We're happy to do the work, but would like a maintainer's opinion. (@metalmatze)

Or perhaps this is something the Prometheus and OTel conventions should resolve between them? I'll post an update if I find more info on why they are conflicting.

@modulitos modulitos reopened this Apr 5, 2023
@metalmatze
Copy link
Member

Milliseconds are still supported by Prometheus as labels. For example the histogram le Label is often used with values like 0.5 (for 500ms) or even 0.001 for just 1ms.

While not having more OTel context, at least for Prometheus it's not an issue.
Therefore I would still prefer to follow Prometheus conventions and see how we can get data correctly into Prometheus in the first place.

@modulitos
Copy link
Author

That makes sense - thank you for the info 👍

@gouthamve
Copy link

For future reference, OTel is now moving to seconds as the unit: open-telemetry/opentelemetry-specification#2977

@BradErz
Copy link
Contributor

BradErz commented May 10, 2023

By default, it's good that pyrra defines the Prometheus best practice of using seconds.

But is it not better to have the flexibility to configure all the applications which don't currently write the value in seconds?

  indicator:
    latency:
      unit: s (default) | ms
      success:
        metric: istio_request_duration_milliseconds_bucket{destination_workload="my-workload", reporter="destination", response_code=~"2..", le="5000"}
      total:
        metric: istio_request_duration_milliseconds_count{destination_workload="my-workload", reporter="destination"}

It's excellent that open telemetry specification has adopted seconds as a standard, but it will take a long time for all the client libs to be updated in other cases. Also, I don't know if envoy will change from milliseconds to seconds either.

Working around this by using a relabeling rule feels even worse than being able to define the unit on the SLO itself, in my opinion.

@PabloPie
Copy link

Heads up, the recording rule trick might not work well with HA pairs of Prometheus and Thanos query deduplication. A couple of threads in the #thanos slack and the #thanos-dev slack channels that are not currently documented.

Given how long it can take for OpenTelemetry libraries to get updated to use seconds, maybe it's interesting to support both units as Brad suggested? @metalmatze would you be ok with it if this was contributed by someone else?

@sepulworld
Copy link
Contributor

sepulworld commented Jan 30, 2024

just so Im clear this issue only effects the pyrran web ui. If you are using Grafana Dashboards this shouldn't be an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants