New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reported request latency telemetry doesn't cover all the overhead of proxy #479
Comments
@briansmith I think this should be a P1 for 0.4.0 (bundled with other telemetry reporting changes). Though this could also be a P2 on 0.3.1, since I don't think it has any prerequisite changes. |
If I'm understanding correctly, all the latency metrics in the web UI and |
Currently, the metrics most closely reflect the time spent in the application, which is the intent of how they are recorded. When we introduce client-side metrics, we'll have a more holistic view of latency. I don't consider this a bug, really, as if we moved the recording to include proxy latency, we can't accurately measure application performance. |
OK, that's what I was missing. I didn't realize this was working as intended. It definitely isn't urgent. I disagree that it is right to show only the latency without the proxy overhead even in the current inbound case. At least two people (@bourquep and myself) expected the proxy overhead to be included in the numbers currently being displayed, so we might need to improve the UX here. |
@briansmith that's fair. I think that ideally we should start recording request latencies as soon as a request is received by the proxy (i.e. not when it's sent, as it is today); but we should additionally record a handle time that explicitly tracks the time-in-proxy. handletime should in almost all cases be negligible, but with these two latency distributions it should be possible tease out application latency from total latency. |
A couple of updates on this:
|
In order to make the request latency metrics more accurate, I'm going to modify the proxy so that the request start timestamp is taken as soon as the request context is created, which should be the earliest possible point in the request lifecycle where it's possible to do so. This means we will include a majority of the proxy's handle time in the request latency metric. Eventually, we should also add a separate histogram stat for handle time. I've opened #730 to track this as a follow-up issue. |
Okay, I spent some time this morning looking at the code, and have some new information: The timestamp we use for calculating request latency is taken in This is fairly early in the request pipeline. We could probably move the timestamp to the very beginning of the function, but I'm not sure if that would make a significant difference in accuracy. We also collect a timestamp immediately upon opening a connection (in Hypothetically, the connection opened_at timestamp could be used for the start time of HTTP/1 requests. However, this wouldn't make sense for HTTP/2, as multiple requests may be initiated on a connection that has already been accepted. Furthermore, I don't think that we actually want to use the connection accepted timestamp, since the design doc ( |
FWIW, HTTP/1.1 connections can serve multiple connections, so this value could only be used for the first request on a connection--and this seems a bit complex to do generally. Furthermore, the proxy may eagerly establish connections to HTTP endpoints, so the connection time may be totally divorced for the request time.
Can you go into more detail on this? Perhaps the sensor module needs to be instrumented earlier in the stack. For instance, I'd suspect that this module should be installed above the router (and not in the routed stack)... |
Ah, that's a good point, thanks --- I think we can probably move the |
The module is currently installed in bind.rs: But I also don't think that this module has to be moved. The important thing is that we record the requests's timestamp as early as possible. the sensors module can reference that timestamp, if the request is annotated properly. |
Yeah, that makes sense. After looking at the code, it looks like the |
In #437 (comment), @seanmonstar wrote:
In #437 (comment), @seanmonstar wrote:
One summary of this is that we're measuring mostly the service's latency but we're not measuring all the additional latency that Conduit adds. Thus users looking at the Conduit Web UI and/or
conduit stat
output may be getting inaccurate, overly-optimistic numbers.This matches the behavior initially reported in #437 where @bourquep that he could measure the latency increasing but Conduit's UI was reporting the latency not increasing.
Assigning to @olix0r so he can set the prioritization for the 0.3.1 or 0.4 release and resource it.
/cc @adleong
The text was updated successfully, but these errors were encountered: