-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: gateway_response_time buckets #3554
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #3554 +/- ##
==========================================
- Coverage 67.54% 67.50% -0.04%
==========================================
Files 321 321
Lines 52027 52027
==========================================
- Hits 35141 35122 -19
- Misses 14560 14578 +18
- Partials 2326 2327 +1
☔ View full report in Codecov by Sentry. |
ca1e0e3
to
9d0cce2
Compare
runner/buckets.go
Outdated
var customBuckets = map[string][]float64{ | ||
"gateway.response_time": { | ||
0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 60, | ||
300, /* 5 mins */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a timeout of 10sec, thus 300 might not be really required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have the histogram showing requests taking much more than that then? 🤔
In the PR description I posted a screenshot depicting that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Http server is configured with a WriteTimeout of 10sec.
Even if we don't honour this at the handler and take more than 10sec to prepare a response for a client's request, my understanding is that the client will never receive this response, but will receive a proxy error instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is the case then yeah, anything >10s is actionable from a metrics perspective. I can remove 300 but I'll leave 60 just in case because the WriteTimeout
is configurable after all, perhaps some customer wants to increase the timeout. In that case at least we will have it covered until one minute.
runner/runner.go
Outdated
@@ -131,11 +131,15 @@ func (r *Runner) Run(ctx context.Context, args []string) int { | |||
(!config.IsSet("WORKSPACE_NAMESPACE") || strings.Contains(config.GetString("WORKSPACE_NAMESPACE", ""), "free")) { | |||
config.Set("statsExcludedTags", []string{"workspaceId", "sourceID", "destId"}) | |||
} | |||
stats.Default = stats.NewStats(config.Default, logger.Default, svcMetric.Instance, | |||
statsOptions := []stats.Option{ | |||
stats.WithServiceName(r.appType), | |||
stats.WithServiceVersion(r.releaseInfo.Version), | |||
stats.WithDefaultHistogramBuckets(defaultHistogramBuckets), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For defaultHistogramBuckets
we might need 2 more, 0.002
for really fast operations & 20
for slower ones but not slow as 1 min.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to decrese the buckets if possible 😅
What do you think are the use cases for the 0.002
and 20
buckets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atzoum I've introduced those and removed a few on the higher end then created custom boundaries just for the warehouse. it should work fine, please have a look.
@achettyiitr can you please review and let me know if the buckets for warehouse make sense or if we can remove some? My guess is that the ones on the microsecond precision aren't going to be useful, right? Possibly others as well or you reckon we might have timers that do measure in that range anyway? |
microsecond precision is not required. |
b71c03d
to
87d6551
Compare
Co-authored-by: Aris Tzoumas <atzoumas@rudderstack.com>
Description
I don't think we need as many buckets as we do by default for the
gateway_response_time metric
. This task is to keep only the relevant buckets.I propose we remove the boundaries for requests that take longer than 5 minutes. Data wise, knowing that it takes even longer isn't helpful in my opinion and we should try to avoid anything that goes above 1 minute and possibly look into anything that goes above 5.
Screenshot from
prousmt
:Tested locally:
Let me know if the boundaries make sense to you and if you think we should have different ones.
Also, if you think we should do the same with other metrics we might as well add them here.
Notion Ticket
< Notion Link >
Security