Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reported request latency telemetry doesn't cover all the overhead of proxy #479

Closed
briansmith opened this issue Feb 28, 2018 · 12 comments
Closed
Assignees
Projects
Milestone

Comments

@briansmith
Copy link
Contributor

In #437 (comment), @seanmonstar wrote:

I don't have any leads yet on to why the latency is increasing, but I think I do know why Conduit isn't reporting it: the http Event::StreamRequestOpen is triggered just before we serialize the request, and so any time since having accepted the request originally and waiting on DNS, connecting, and waiting in queue is not calculated in the latency.

In #437 (comment), @seanmonstar wrote:

So, we do collect telemetry on both client and server socket opens and closes, but it looks like those events don't get forward into prometheus, and so cannot be shown in the conduit dashboard.

One summary of this is that we're measuring mostly the service's latency but we're not measuring all the additional latency that Conduit adds. Thus users looking at the Conduit Web UI and/or conduit stat output may be getting inaccurate, overly-optimistic numbers.

This matches the behavior initially reported in #437 where @bourquep that he could measure the latency increasing but Conduit's UI was reporting the latency not increasing.

Assigning to @olix0r so he can set the prioritization for the 0.3.1 or 0.4 release and resource it.

/cc @adleong

@olix0r
Copy link
Member

olix0r commented Feb 28, 2018

@briansmith I think this should be a P1 for 0.4.0 (bundled with other telemetry reporting changes). Though this could also be a P2 on 0.3.1, since I don't think it has any prerequisite changes.

@olix0r olix0r added the priority/P1 Planned for Release label Feb 28, 2018
@olix0r olix0r removed their assignment Feb 28, 2018
@briansmith
Copy link
Contributor Author

@briansmith I think this should be a P1 for 0.4.0 (bundled with other telemetry reporting changes). Though this could also be a P2 on 0.3.1, since I don't think it has any prerequisite changes.

If I'm understanding correctly, all the latency metrics in the web UI and conduit stat are misleading, at best. If so, I think this should probably block the 0.4 release, if not the 0.3.1 release.

@olix0r
Copy link
Member

olix0r commented Feb 28, 2018

Currently, the metrics most closely reflect the time spent in the application, which is the intent of how they are recorded. When we introduce client-side metrics, we'll have a more holistic view of latency. I don't consider this a bug, really, as if we moved the recording to include proxy latency, we can't accurately measure application performance.

@briansmith
Copy link
Contributor Author

I don't consider this a bug, really, as if we moved the recording to include proxy latency, we can't accurately measure application performance.

OK, that's what I was missing. I didn't realize this was working as intended. It definitely isn't urgent.

I disagree that it is right to show only the latency without the proxy overhead even in the current inbound case. At least two people (@bourquep and myself) expected the proxy overhead to be included in the numbers currently being displayed, so we might need to improve the UX here.

@olix0r
Copy link
Member

olix0r commented Feb 28, 2018

@briansmith that's fair. I think that ideally we should start recording request latencies as soon as a request is received by the proxy (i.e. not when it's sent, as it is today); but we should additionally record a handle time that explicitly tracks the time-in-proxy.

handletime should in almost all cases be negligible, but with these two latency distributions it should be possible tease out application latency from total latency.

@olix0r olix0r added this to To do in Telemetry Mar 2, 2018
@seanmonstar seanmonstar removed the bug label Mar 3, 2018
@adleong adleong moved this from Blocked to Ready in Telemetry Mar 3, 2018
@hawkw
Copy link
Member

hawkw commented Mar 19, 2018

A couple of updates on this:

  1. I've opened a separate issue Implement request duration metric in the proxy #570 to track adding request latency metrics.
  2. Regarding adding handle time metrics, I think before we can move forward, we'll want to answer the questions posed by @adleong in Add proxy metrics documentation #536 (comment):
  • What units would handle time be reported in? ms should almost always be zero, but mixing units (ms for latency, us for handle time) is confusing.
  • Do the latency and duration metrics include the handle time?
  • If we add handle time we probably also want to add total time (total time =handle time + latency) or vice versa (service latency = latency - handle time) to discourage users from doing the wrong thing of trying to add or subtract percentiles. Because math.

@siggy siggy modified the milestones: 0.4.0, 0.5.0 Apr 10, 2018
@olix0r olix0r modified the milestones: 0.5.0, 0.4.0 Apr 10, 2018
@hawkw
Copy link
Member

hawkw commented Apr 10, 2018

In order to make the request latency metrics more accurate, I'm going to modify the proxy so that the request start timestamp is taken as soon as the request context is created, which should be the earliest possible point in the request lifecycle where it's possible to do so. This means we will include a majority of the proxy's handle time in the request latency metric.

Eventually, we should also add a separate histogram stat for handle time. I've opened #730 to track this as a follow-up issue.

@hawkw
Copy link
Member

hawkw commented Apr 11, 2018

Okay, I spent some time this morning looking at the code, and have some new information:

The timestamp we use for calculating request latency is taken in sensors::Http::call, immediately after we construct a request context for that request:

https://github.com/runconduit/conduit/blob/9321932918bde9cbc96a9bee076f4bfa58415165/proxy/src/telemetry/sensor/http.rs#L205-L215

This is fairly early in the request pipeline. We could probably move the timestamp to the very beginning of the function, but I'm not sure if that would make a significant difference in accuracy.

We also collect a timestamp immediately upon opening a connection (in transparency::Server::serve) here:
https://github.com/runconduit/conduit/blob/02c6887020010ead730d537bea765ef9774a0f2a/proxy/src/transparency/server.rs#L96-L97

Hypothetically, the connection opened_at timestamp could be used for the start time of HTTP/1 requests. However, this wouldn't make sense for HTTP/2, as multiple requests may be initiated on a connection that has already been accepted. Furthermore, I don't think that we actually want to use the connection accepted timestamp, since the design doc (doc/proxy_metrics.md) says that response latency is measured from when the request headers are received..

@olix0r
Copy link
Member

olix0r commented Apr 11, 2018

Hypothetically, the connection opened_at timestamp could be used for the start time of HTTP/1 requests. However, this wouldn't make sense for HTTP/2, as multiple requests may be initiated on a connection that has already been accepted.

FWIW, HTTP/1.1 connections can serve multiple connections, so this value could only be used for the first request on a connection--and this seems a bit complex to do generally. Furthermore, the proxy may eagerly establish connections to HTTP endpoints, so the connection time may be totally divorced for the request time.

The timestamp we use for calculating request latency is taken in sensors::Http::call, immediately after we construct a request context for that request [...] This is fairly early in the request pipeline.

Can you go into more detail on this? Perhaps the sensor module needs to be instrumented earlier in the stack. For instance, I'd suspect that this module should be installed above the router (and not in the routed stack)...

@hawkw
Copy link
Member

hawkw commented Apr 11, 2018

@olix0r

Perhaps the sensor module needs to be instrumented earlier in the stack. For instance, I'd suspect that this module should be installed above the router (and not in the routed stack)...

Ah, that's a good point, thanks --- I think we can probably move the sensors::Http middleware much higher in the stack without too much effort.

@olix0r
Copy link
Member

olix0r commented Apr 11, 2018

The module is currently installed in bind.rs:

https://github.com/runconduit/conduit/blob/9321932918bde9cbc96a9bee076f4bfa58415165/proxy/src/bind.rs#L201-L205

But I also don't think that this module has to be moved. The important thing is that we record the requests's timestamp as early as possible. the sensors module can reference that timestamp, if the request is annotated properly.

@hawkw
Copy link
Member

hawkw commented Apr 11, 2018

Yeah, that makes sense. After looking at the code, it looks like the sensors::Http module is installed under Bind is because Bind owns the request IDs. However, I think we can add an extension to the request with the request open timestamp, and have the sensors use that, rather than taking a timestamp when they're called.

@hawkw hawkw added the review label Apr 11, 2018
@olix0r olix0r modified the milestones: 0.4.0, 0.4.1 Apr 16, 2018
Telemetry automation moved this from Ready to Done Apr 17, 2018
@hawkw hawkw removed the review label Apr 17, 2018
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Telemetry
  
Done
Development

No branches or pull requests

5 participants