Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: gateway_response_time buckets #3554

Merged
merged 5 commits into from
Jun 30, 2023
Merged

Conversation

fracasula
Copy link
Collaborator

@fracasula fracasula commented Jun 27, 2023

Description

I don't think we need as many buckets as we do by default for the gateway_response_time metric. This task is to keep only the relevant buckets.

I propose we remove the boundaries for requests that take longer than 5 minutes. Data wise, knowing that it takes even longer isn't helpful in my opinion and we should try to avoid anything that goes above 1 minute and possibly look into anything that goes above 5.

Screenshot from prousmt:
Screenshot from 2023-06-27 11-19-04

Tested locally:
Screenshot from 2023-06-27 14-59-57

Let me know if the boundaries make sense to you and if you think we should have different ones.
Also, if you think we should do the same with other metrics we might as well add them here.

Notion Ticket

< Notion Link >

Security

  • The code changed/added as part of this pull request won't create any security issues with how the software is being used.

@codecov
Copy link

codecov bot commented Jun 27, 2023

Codecov Report

Patch coverage: 84.61% and project coverage change: -0.04 ⚠️

Comparison is base (8975dcc) 67.54% compared to head (0a325f0) 67.50%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3554      +/-   ##
==========================================
- Coverage   67.54%   67.50%   -0.04%     
==========================================
  Files         321      321              
  Lines       52027    52027              
==========================================
- Hits        35141    35122      -19     
- Misses      14560    14578      +18     
- Partials     2326     2327       +1     
Impacted Files Coverage Δ
enterprise/replay/setup.go 0.00% <0.00%> (ø)
runner/runner.go 70.58% <100.00%> (+0.95%) ⬆️

... and 14 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@fracasula fracasula marked this pull request as ready for review June 27, 2023 13:08
@fracasula fracasula requested review from atzoum and cisse21 June 27, 2023 13:45
var customBuckets = map[string][]float64{
"gateway.response_time": {
0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 60,
300, /* 5 mins */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a timeout of 10sec, thus 300 might not be really required

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have the histogram showing requests taking much more than that then? 🤔
In the PR description I posted a screenshot depicting that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Http server is configured with a WriteTimeout of 10sec.
Even if we don't honour this at the handler and take more than 10sec to prepare a response for a client's request, my understanding is that the client will never receive this response, but will receive a proxy error instead

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is the case then yeah, anything >10s is actionable from a metrics perspective. I can remove 300 but I'll leave 60 just in case because the WriteTimeout is configurable after all, perhaps some customer wants to increase the timeout. In that case at least we will have it covered until one minute.

runner/runner.go Outdated
@@ -131,11 +131,15 @@ func (r *Runner) Run(ctx context.Context, args []string) int {
(!config.IsSet("WORKSPACE_NAMESPACE") || strings.Contains(config.GetString("WORKSPACE_NAMESPACE", ""), "free")) {
config.Set("statsExcludedTags", []string{"workspaceId", "sourceID", "destId"})
}
stats.Default = stats.NewStats(config.Default, logger.Default, svcMetric.Instance,
statsOptions := []stats.Option{
stats.WithServiceName(r.appType),
stats.WithServiceVersion(r.releaseInfo.Version),
stats.WithDefaultHistogramBuckets(defaultHistogramBuckets),
Copy link
Contributor

@atzoum atzoum Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For defaultHistogramBuckets we might need 2 more, 0.002 for really fast operations & 20 for slower ones but not slow as 1 min.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to decrese the buckets if possible 😅
What do you think are the use cases for the 0.002 and 20 buckets?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atzoum I've introduced those and removed a few on the higher end then created custom boundaries just for the warehouse. it should work fine, please have a look.

@fracasula
Copy link
Collaborator Author

@achettyiitr can you please review and let me know if the buckets for warehouse make sense or if we can remove some? My guess is that the ones on the microsecond precision aren't going to be useful, right? Possibly others as well or you reckon we might have timers that do measure in that range anyway?

@achettyiitr
Copy link
Member

@achettyiitr can you please review and let me know if the buckets for warehouse make sense or if we can remove some? My guess is that the ones on the microsecond precision aren't going to be useful, right? Possibly others as well or you reckon we might have timers that do measure in that range anyway?

microsecond precision is not required.

@fracasula fracasula merged commit bed100d into master Jun 30, 2023
38 checks passed
@fracasula fracasula deleted the chore.gwRespTimeBuckets branch June 30, 2023 11:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants