-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add latency and availability metrics to gateway #838
Conversation
3f0797d
to
5410878
Compare
m.recordLatency(middlewareResponseLatencyTag, start, req) | ||
|
||
//for error metrics only emit when there is gateway error and not request error | ||
if res.pendingStatusCode >= 500 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4xx is a success?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's the wrong place to verify statusCode. statusCode would be available only after we make call to downstream client - ctx = m.handle(ctx, req, res)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4xx is a success?
Removed the success part as it can be taken from total_requests.Not taking 4xx as error is because the availability defined for gateway should not be dependent on the user's request and rather on the gateway internal errors, so that is why only 5xx.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's the wrong place to verify statusCode. statusCode would be available only after we make call to downstream client - ctx = m.handle(ctx, req, res)
zanzibar/runtime/server_http_response.go
Line 271 in 24bea8e
res.pendingStatusCode = statusCode |
We are setting the pendingStatuscode whenever we call the res.WriteJson from the middlewares,so this will always be populated in case of errors.
5410878
to
e1c6d2d
Compare
m.recordLatency(middlewareResponseLatencyTag, start, req) | ||
|
||
//for error metrics only emit when there is gateway error and not request error | ||
if res.pendingStatusCode >= 500 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's the wrong place to verify statusCode. statusCode would be available only after we make call to downstream client - ctx = m.handle(ctx, req, res)
e1c6d2d
to
f126aa7
Compare
addressed comments
f126aa7
to
c61eb74
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Please add the test plan and a dump of the new metrics from a sample Zanzibar example gateway for both successful / unsuccessful runs. |
@@ -183,7 +184,7 @@ func TestMiddlewareRequestAbort(t *testing.T) { | |||
if !assert.NoError(t, err) { | |||
return | |||
} | |||
assert.Equal(t, resp.StatusCode, http.StatusOK) | |||
assert.Equal(t, resp.StatusCode, http.StatusInternalServerError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was changed because if we are reaching here means that the middleware request has aborted with an error and we will return from here without calling the downstream .So sending a 200 ok in this case is not apt.Additionally , it increased the code coverage as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave some comments
for j := i; j >= 0; j-- { | ||
m.middlewares[j].HandleResponse(ctx, res, shared) | ||
} | ||
//record latency for middlewares responses in unsuccesful case | ||
m.recordLatency(middlewareResponseLatencyTag, middlewareResponseStartTime, req.scope) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we tracking req/resp latency for each middleware or all of the edge middlewares?
If it's the former, I don't think the code does what it is supposed to do
If there are 4 middlewares and handling fails for the 4th one, then we need to run the decrement loop of handling th responses for the 3 that executed.
In this case, the middleware response latency recorded for the 3rd would be incorrect as it will include the latency of the first 2 middlewares's responses.
for j := i; j >= 0; j-- {
m.middlewares[j].HandleResponse(ctx, res, shared)
}
m.recordLatency(middlewareResponseLatencyTag, middlewareResponseStartTime, req.scope)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we tracking req/resp latency for each middleware or all of the edge middlewares?
Yes , it is for all the middlewares that a request goes through , since these metrics are tagged with endpointID we will apply the filtering . So we are not measuring for one middleware but rather all the middleware that a request goes through and filter on the basis of endpointID's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requested clarification
Added metrics dump https://code.uberinternal.com/P341907 |
Added metrics that needs to be sent for latency ( in case of middleware request and response ) and availability metrics for middlewares requests