Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add latency and availability metrics to gateway #838

Merged
merged 5 commits into from
Oct 31, 2022

Conversation

zeeshan-mohd
Copy link
Contributor

Added metrics that needs to be sent for latency ( in case of middleware request and response ) and availability metrics for middlewares requests

@CLAassistant
Copy link

CLAassistant commented Oct 3, 2022

CLA assistant check
All committers have signed the CLA.

@zeeshan-mohd zeeshan-mohd force-pushed the zeeshan/emitGatewayMetrics branch 3 times, most recently from 3f0797d to 5410878 Compare October 3, 2022 16:58
m.recordLatency(middlewareResponseLatencyTag, start, req)

//for error metrics only emit when there is gateway error and not request error
if res.pendingStatusCode >= 500 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4xx is a success?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the wrong place to verify statusCode. statusCode would be available only after we make call to downstream client - ctx = m.handle(ctx, req, res)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4xx is a success?

Removed the success part as it can be taken from total_requests.Not taking 4xx as error is because the availability defined for gateway should not be dependent on the user's request and rather on the gateway internal errors, so that is why only 5xx.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the wrong place to verify statusCode. statusCode would be available only after we make call to downstream client - ctx = m.handle(ctx, req, res)

res.pendingStatusCode = statusCode

We are setting the pendingStatuscode whenever we call the res.WriteJson from the middlewares,so this will always be populated in case of errors.

runtime/middlewares.go Outdated Show resolved Hide resolved
runtime/middlewares.go Outdated Show resolved Hide resolved
runtime/middlewares.go Outdated Show resolved Hide resolved
runtime/middlewares.go Outdated Show resolved Hide resolved
m.recordLatency(middlewareResponseLatencyTag, start, req)

//for error metrics only emit when there is gateway error and not request error
if res.pendingStatusCode >= 500 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the wrong place to verify statusCode. statusCode would be available only after we make call to downstream client - ctx = m.handle(ctx, req, res)

@coveralls
Copy link

coveralls commented Oct 25, 2022

Coverage Status

Coverage increased (+0.03%) to 69.997% when pulling eb1ccc0 on zeeshan/emitGatewayMetrics into 1966351 on master.

Copy link
Contributor

@bishnuag bishnuag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@isopropylcyanide
Copy link
Contributor

Please add the test plan and a dump of the new metrics from a sample Zanzibar example gateway for both successful / unsuccessful runs.

@@ -183,7 +184,7 @@ func TestMiddlewareRequestAbort(t *testing.T) {
if !assert.NoError(t, err) {
return
}
assert.Equal(t, resp.StatusCode, http.StatusOK)
assert.Equal(t, resp.StatusCode, http.StatusInternalServerError)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was changed because if we are reaching here means that the middleware request has aborted with an error and we will return from here without calling the downstream .So sending a 200 ok in this case is not apt.Additionally , it increased the code coverage as well.

Copy link
Contributor

@isopropylcyanide isopropylcyanide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave some comments

for j := i; j >= 0; j-- {
m.middlewares[j].HandleResponse(ctx, res, shared)
}
//record latency for middlewares responses in unsuccesful case
m.recordLatency(middlewareResponseLatencyTag, middlewareResponseStartTime, req.scope)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we tracking req/resp latency for each middleware or all of the edge middlewares?

If it's the former, I don't think the code does what it is supposed to do

If there are 4 middlewares and handling fails for the 4th one, then we need to run the decrement loop of handling th responses for the 3 that executed.
In this case, the middleware response latency recorded for the 3rd would be incorrect as it will include the latency of the first 2 middlewares's responses.

for j := i; j >= 0; j-- {
				m.middlewares[j].HandleResponse(ctx, res, shared)
			}
m.recordLatency(middlewareResponseLatencyTag, middlewareResponseStartTime, req.scope)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we tracking req/resp latency for each middleware or all of the edge middlewares?

Yes , it is for all the middlewares that a request goes through , since these metrics are tagged with endpointID we will apply the filtering . So we are not measuring for one middleware but rather all the middleware that a request goes through and filter on the basis of endpointID's.

Copy link
Contributor

@isopropylcyanide isopropylcyanide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requested clarification

@sachsingh
Copy link
Contributor

Please add the test plan and a dump of the new metrics from a sample Zanzibar example gateway for both successful / unsuccessful runs.

Added metrics dump https://code.uberinternal.com/P341907

@sachsingh sachsingh closed this Oct 31, 2022
@sachsingh sachsingh reopened this Oct 31, 2022
@sachsingh sachsingh merged commit d9a8b55 into master Oct 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants