Reported request latency telemetry doesn't cover all the overhead of proxy #479

briansmith · 2018-02-28T00:57:48Z

I don't have any leads yet on to why the latency is increasing, but I think I do know why Conduit isn't reporting it: the http Event::StreamRequestOpen is triggered just before we serialize the request, and so any time since having accepted the request originally and waiting on DNS, connecting, and waiting in queue is not calculated in the latency.

In #437 (comment), @seanmonstar wrote:

So, we do collect telemetry on both client and server socket opens and closes, but it looks like those events don't get forward into prometheus, and so cannot be shown in the conduit dashboard.

One summary of this is that we're measuring mostly the service's latency but we're not measuring all the additional latency that Conduit adds. Thus users looking at the Conduit Web UI and/or conduit stat output may be getting inaccurate, overly-optimistic numbers.

This matches the behavior initially reported in #437 where @bourquep that he could measure the latency increasing but Conduit's UI was reporting the latency not increasing.

Assigning to @olix0r so he can set the prioritization for the 0.3.1 or 0.4 release and resource it.

/cc @adleong

The text was updated successfully, but these errors were encountered:

olix0r · 2018-02-28T01:09:26Z

@briansmith I think this should be a P1 for 0.4.0 (bundled with other telemetry reporting changes). Though this could also be a P2 on 0.3.1, since I don't think it has any prerequisite changes.

briansmith · 2018-02-28T20:26:44Z

@briansmith I think this should be a P1 for 0.4.0 (bundled with other telemetry reporting changes). Though this could also be a P2 on 0.3.1, since I don't think it has any prerequisite changes.

If I'm understanding correctly, all the latency metrics in the web UI and conduit stat are misleading, at best. If so, I think this should probably block the 0.4 release, if not the 0.3.1 release.

olix0r · 2018-02-28T20:41:20Z

Currently, the metrics most closely reflect the time spent in the application, which is the intent of how they are recorded. When we introduce client-side metrics, we'll have a more holistic view of latency. I don't consider this a bug, really, as if we moved the recording to include proxy latency, we can't accurately measure application performance.

briansmith · 2018-02-28T21:12:21Z

I don't consider this a bug, really, as if we moved the recording to include proxy latency, we can't accurately measure application performance.

OK, that's what I was missing. I didn't realize this was working as intended. It definitely isn't urgent.

I disagree that it is right to show only the latency without the proxy overhead even in the current inbound case. At least two people (@bourquep and myself) expected the proxy overhead to be included in the numbers currently being displayed, so we might need to improve the UX here.

olix0r · 2018-02-28T21:18:22Z

@briansmith that's fair. I think that ideally we should start recording request latencies as soon as a request is received by the proxy (i.e. not when it's sent, as it is today); but we should additionally record a handle time that explicitly tracks the time-in-proxy.

handletime should in almost all cases be negligible, but with these two latency distributions it should be possible tease out application latency from total latency.

hawkw · 2018-03-19T22:52:51Z

A couple of updates on this:

I've opened a separate issue Implement request duration metric in the proxy #570 to track adding request latency metrics.
Regarding adding handle time metrics, I think before we can move forward, we'll want to answer the questions posed by @adleong in Add proxy metrics documentation #536 (comment):

What units would handle time be reported in? ms should almost always be zero, but mixing units (ms for latency, us for handle time) is confusing.

Do the latency and duration metrics include the handle time?

If we add handle time we probably also want to add total time (total time =handle time + latency) or vice versa (service latency = latency - handle time) to discourage users from doing the wrong thing of trying to add or subtract percentiles. Because math.

hawkw · 2018-04-10T22:15:28Z

In order to make the request latency metrics more accurate, I'm going to modify the proxy so that the request start timestamp is taken as soon as the request context is created, which should be the earliest possible point in the request lifecycle where it's possible to do so. This means we will include a majority of the proxy's handle time in the request latency metric.

Eventually, we should also add a separate histogram stat for handle time. I've opened #730 to track this as a follow-up issue.

hawkw · 2018-04-11T18:56:23Z

Okay, I spent some time this morning looking at the code, and have some new information:

The timestamp we use for calculating request latency is taken in sensors::Http::call, immediately after we construct a request context for that request:

https://github.com/runconduit/conduit/blob/9321932918bde9cbc96a9bee076f4bfa58415165/proxy/src/telemetry/sensor/http.rs#L205-L215

This is fairly early in the request pipeline. We could probably move the timestamp to the very beginning of the function, but I'm not sure if that would make a significant difference in accuracy.

We also collect a timestamp immediately upon opening a connection (in transparency::Server::serve) here:
https://github.com/runconduit/conduit/blob/02c6887020010ead730d537bea765ef9774a0f2a/proxy/src/transparency/server.rs#L96-L97

Hypothetically, the connection opened_at timestamp could be used for the start time of HTTP/1 requests. However, this wouldn't make sense for HTTP/2, as multiple requests may be initiated on a connection that has already been accepted. Furthermore, I don't think that we actually want to use the connection accepted timestamp, since the design doc (doc/proxy_metrics.md) says that response latency is measured from when the request headers are received..

olix0r · 2018-04-11T19:31:26Z

Hypothetically, the connection opened_at timestamp could be used for the start time of HTTP/1 requests. However, this wouldn't make sense for HTTP/2, as multiple requests may be initiated on a connection that has already been accepted.

FWIW, HTTP/1.1 connections can serve multiple connections, so this value could only be used for the first request on a connection--and this seems a bit complex to do generally. Furthermore, the proxy may eagerly establish connections to HTTP endpoints, so the connection time may be totally divorced for the request time.

The timestamp we use for calculating request latency is taken in sensors::Http::call, immediately after we construct a request context for that request [...] This is fairly early in the request pipeline.

Can you go into more detail on this? Perhaps the sensor module needs to be instrumented earlier in the stack. For instance, I'd suspect that this module should be installed above the router (and not in the routed stack)...

hawkw · 2018-04-11T19:39:56Z

@olix0r

Perhaps the sensor module needs to be instrumented earlier in the stack. For instance, I'd suspect that this module should be installed above the router (and not in the routed stack)...

Ah, that's a good point, thanks --- I think we can probably move the sensors::Http middleware much higher in the stack without too much effort.

olix0r · 2018-04-11T19:43:48Z

The module is currently installed in bind.rs:

https://github.com/runconduit/conduit/blob/9321932918bde9cbc96a9bee076f4bfa58415165/proxy/src/bind.rs#L201-L205

But I also don't think that this module has to be moved. The important thing is that we record the requests's timestamp as early as possible. the sensors module can reference that timestamp, if the request is annotated properly.

hawkw · 2018-04-11T19:45:33Z

Yeah, that makes sense. After looking at the code, it looks like the sensors::Http module is installed under Bind is because Bind owns the request IDs. However, I think we can add an extension to the request with the request open timestamp, and have the sensors use that, rather than taking a timestamp when they're called.

briansmith added bug area/proxy area/telemetry labels Feb 28, 2018

briansmith assigned olix0r Feb 28, 2018

briansmith mentioned this issue Feb 28, 2018

Public facing services exposed via nginx-ingress suffer from higher latencies #437

Closed

olix0r added this to the 0.4.0 milestone Feb 28, 2018

olix0r added the priority/P1 Planned for Release label Feb 28, 2018

olix0r removed their assignment Feb 28, 2018

olix0r added this to To do in Telemetry Mar 2, 2018

seanmonstar removed the bug label Mar 3, 2018

adleong moved this from Blocked to Ready in Telemetry Mar 3, 2018

hawkw mentioned this issue Mar 8, 2018

Add proxy metrics documentation #536

Merged

siggy modified the milestones: 0.4.0, 0.5.0 Apr 10, 2018

olix0r modified the milestones: 0.5.0, 0.4.0 Apr 10, 2018

olix0r assigned hawkw Apr 10, 2018

hawkw mentioned this issue Apr 10, 2018

Add handle time metrics to proxy #730

Closed

hawkw mentioned this issue Apr 11, 2018

Move request open timestamp to the top of the stack #744

Merged

hawkw added the review label Apr 11, 2018

olix0r modified the milestones: 0.4.0, 0.4.1 Apr 16, 2018

hawkw closed this as completed in #744 Apr 17, 2018

Telemetry automation moved this from Ready to Done Apr 17, 2018

hawkw removed the review label Apr 17, 2018

github-actions bot locked as resolved and limited conversation to collaborators Jul 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reported request latency telemetry doesn't cover all the overhead of proxy #479

Reported request latency telemetry doesn't cover all the overhead of proxy #479

briansmith commented Feb 28, 2018

olix0r commented Feb 28, 2018

briansmith commented Feb 28, 2018

olix0r commented Feb 28, 2018

briansmith commented Feb 28, 2018

olix0r commented Feb 28, 2018 •

edited

hawkw commented Mar 19, 2018

hawkw commented Apr 10, 2018 •

edited

hawkw commented Apr 11, 2018

olix0r commented Apr 11, 2018

hawkw commented Apr 11, 2018

olix0r commented Apr 11, 2018 •

edited

hawkw commented Apr 11, 2018

Reported request latency telemetry doesn't cover all the overhead of proxy #479

Reported request latency telemetry doesn't cover all the overhead of proxy #479

Comments

briansmith commented Feb 28, 2018

olix0r commented Feb 28, 2018

briansmith commented Feb 28, 2018

olix0r commented Feb 28, 2018

briansmith commented Feb 28, 2018

olix0r commented Feb 28, 2018 • edited

hawkw commented Mar 19, 2018

hawkw commented Apr 10, 2018 • edited

hawkw commented Apr 11, 2018

olix0r commented Apr 11, 2018

hawkw commented Apr 11, 2018

olix0r commented Apr 11, 2018 • edited

hawkw commented Apr 11, 2018

olix0r commented Feb 28, 2018 •

edited

hawkw commented Apr 10, 2018 •

edited

olix0r commented Apr 11, 2018 •

edited