chore: gateway_response_time buckets #3554

fracasula · 2023-06-27T09:40:28Z

Description

I don't think we need as many buckets as we do by default for the gateway_response_time metric. This task is to keep only the relevant buckets.

I propose we remove the boundaries for requests that take longer than 5 minutes. Data wise, knowing that it takes even longer isn't helpful in my opinion and we should try to avoid anything that goes above 1 minute and possibly look into anything that goes above 5.

Screenshot from prousmt:

Tested locally:

Let me know if the boundaries make sense to you and if you think we should have different ones.
Also, if you think we should do the same with other metrics we might as well add them here.

Notion Ticket

< Notion Link >

Security

The code changed/added as part of this pull request won't create any security issues with how the software is being used.

codecov · 2023-06-27T12:18:18Z

Codecov Report

Patch coverage: 84.61% and project coverage change: -0.04 ⚠️

Comparison is base (8975dcc) 67.54% compared to head (0a325f0) 67.50%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3554      +/-   ##
==========================================
- Coverage   67.54%   67.50%   -0.04%     
==========================================
  Files         321      321              
  Lines       52027    52027              
==========================================
- Hits        35141    35122      -19     
- Misses      14560    14578      +18     
- Partials     2326     2327       +1

Impacted Files	Coverage Δ
enterprise/replay/setup.go	`0.00% <0.00%> (ø)`
runner/runner.go	`70.58% <100.00%> (+0.95%)`	⬆️

... and 14 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

runner/runner.go

atzoum · 2023-06-28T08:35:36Z

runner/buckets.go

+var customBuckets = map[string][]float64{
+	"gateway.response_time": {
+		0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 60,
+		300, /* 5 mins */


We have a timeout of 10sec, thus 300 might not be really required

Why do we have the histogram showing requests taking much more than that then? 🤔
In the PR description I posted a screenshot depicting that.

Http server is configured with a WriteTimeout of 10sec.
Even if we don't honour this at the handler and take more than 10sec to prepare a response for a client's request, my understanding is that the client will never receive this response, but will receive a proxy error instead

If that is the case then yeah, anything >10s is actionable from a metrics perspective. I can remove 300 but I'll leave 60 just in case because the WriteTimeout is configurable after all, perhaps some customer wants to increase the timeout. In that case at least we will have it covered until one minute.

atzoum · 2023-06-28T08:37:39Z

runner/runner.go

@@ -131,11 +131,15 @@ func (r *Runner) Run(ctx context.Context, args []string) int {
 		(!config.IsSet("WORKSPACE_NAMESPACE") || strings.Contains(config.GetString("WORKSPACE_NAMESPACE", ""), "free")) {
 		config.Set("statsExcludedTags", []string{"workspaceId", "sourceID", "destId"})
 	}
-	stats.Default = stats.NewStats(config.Default, logger.Default, svcMetric.Instance,
+	statsOptions := []stats.Option{
 		stats.WithServiceName(r.appType),
 		stats.WithServiceVersion(r.releaseInfo.Version),
 		stats.WithDefaultHistogramBuckets(defaultHistogramBuckets),


For defaultHistogramBuckets we might need 2 more, 0.002 for really fast operations & 20 for slower ones but not slow as 1 min.

I'm trying to decrese the buckets if possible 😅
What do you think are the use cases for the 0.002 and 20 buckets?

https://rudderlabs.slack.com/archives/C01HTT66UMB/p1687271131609569?thread_ts=1687161567.768049&cid=C01HTT66UMB

@atzoum I've introduced those and removed a few on the higher end then created custom boundaries just for the warehouse. it should work fine, please have a look.

fracasula · 2023-06-28T16:22:18Z

@achettyiitr can you please review and let me know if the buckets for warehouse make sense or if we can remove some? My guess is that the ones on the microsecond precision aren't going to be useful, right? Possibly others as well or you reckon we might have timers that do measure in that range anyway?

achettyiitr · 2023-06-28T16:26:58Z

@achettyiitr can you please review and let me know if the buckets for warehouse make sense or if we can remove some? My guess is that the ones on the microsecond precision aren't going to be useful, right? Possibly others as well or you reckon we might have timers that do measure in that range anyway?

microsecond precision is not required.

integration_test/kafka_batching/kafka_batching_test.go

Co-authored-by: Aris Tzoumas <atzoumas@rudderstack.com>

$fracasula$

github-actions bot added the server-team label Jun 27, 2023

$@fracasula$ fracasula force-pushed the chore.gwRespTimeBuckets branch from ca1e0e3 to 9d0cce2 Compare June 27, 2023 13:00

$@fracasula$ fracasula marked this pull request as ready for review June 27, 2023 13:08

$@fracasula$ fracasula requested review from atzoum and cisse21 June 27, 2023 13:45

psrikanth88 approved these changes Jun 27, 2023

View reviewed changes

mcchran reviewed Jun 28, 2023

View reviewed changes

runner/runner.go Show resolved Hide resolved

atzoum reviewed Jun 28, 2023

View reviewed changes

$@fracasula$ fracasula requested a review from achettyiitr June 28, 2023 16:20

achettyiitr approved these changes Jun 28, 2023

View reviewed changes

$@fracasula$ fracasula requested a review from atzoum June 28, 2023 16:41

github-actions bot added the with tests label Jun 28, 2023

fracasula added 3 commits June 28, 2023 18:45

$@fracasula$

chore: gateway_response_time buckets

3516478

$@fracasula$

chore: removing 300 bucket

b70c5f1

$@fracasula$

chore: splitting warehouse buckets

87d6551

$@fracasula$ fracasula force-pushed the chore.gwRespTimeBuckets branch from b71c03d to 87d6551 Compare June 28, 2023 16:45

atzoum approved these changes Jun 30, 2023

View reviewed changes

integration_test/kafka_batching/kafka_batching_test.go Outdated Show resolved Hide resolved

fracasula and others added 2 commits June 30, 2023 11:56

$@fracasula$

Update integration_test/kafka_batching/kafka_batching_test.go

8bc1678

Co-authored-by: Aris Tzoumas <atzoumas@rudderstack.com>

$@fracasula$

Merge branch 'master' into chore.gwRespTimeBuckets

0a325f0

$@fracasula$ fracasula merged commit bed100d into master Jun 30, 2023
38 checks passed

$@fracasula$ fracasula deleted the chore.gwRespTimeBuckets branch June 30, 2023 11:39

This was referenced Jul 4, 2023

chore: prerelease 1.11.0-preview.1 #3578

Closed

chore: prerelease 1.11.0-preview.1 #3589

Closed

devops-github-rudderstack mentioned this pull request Jul 6, 2023

chore: release 1.11.0 #3590

Closed

rudder-server-bot mentioned this pull request Jul 6, 2023

chore: prerelease 1.11.0-rc.1 #3591

Closed

devops-github-rudderstack mentioned this pull request Jul 6, 2023

chore: release 1.11.0 #3593

Merged

This was referenced Jul 6, 2023

chore: prerelease 1.11.0-rc.1 #3594

Merged

chore: prerelease 1.11.0-rc.2 #3624

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: gateway_response_time buckets #3554

chore: gateway_response_time buckets #3554

$@fracasula$ fracasula commented Jun 27, 2023 •

edited

codecov bot commented Jun 27, 2023 •

edited

atzoum Jun 28, 2023

$@fracasula$ fracasula Jun 28, 2023

atzoum Jun 28, 2023

$@fracasula$ fracasula Jun 28, 2023

atzoum Jun 28, 2023 •

edited

$@fracasula$ fracasula Jun 28, 2023

atzoum Jun 28, 2023

$@fracasula$ fracasula Jun 28, 2023

fracasula commented Jun 28, 2023

achettyiitr commented Jun 28, 2023

chore: gateway_response_time buckets #3554

chore: gateway_response_time buckets #3554

Conversation

fracasula commented Jun 27, 2023 • edited

Description

Notion Ticket

Security

codecov bot commented Jun 27, 2023 • edited

Codecov Report

atzoum Jun 28, 2023

Choose a reason for hiding this comment

fracasula Jun 28, 2023

Choose a reason for hiding this comment

atzoum Jun 28, 2023

Choose a reason for hiding this comment

fracasula Jun 28, 2023

Choose a reason for hiding this comment

atzoum Jun 28, 2023 • edited

Choose a reason for hiding this comment

fracasula Jun 28, 2023

Choose a reason for hiding this comment

atzoum Jun 28, 2023

Choose a reason for hiding this comment

fracasula Jun 28, 2023

Choose a reason for hiding this comment

fracasula commented Jun 28, 2023

achettyiitr commented Jun 28, 2023

$@fracasula$ fracasula commented Jun 27, 2023 •

edited

codecov bot commented Jun 27, 2023 •

edited

$@fracasula$ fracasula Jun 28, 2023

$@fracasula$ fracasula Jun 28, 2023

atzoum Jun 28, 2023 •

edited

$@fracasula$ fracasula Jun 28, 2023

$@fracasula$ fracasula Jun 28, 2023