fix(router): preserve event order while draining a previously failed job #2546

atzoum · 2022-10-10T16:19:04Z

Description

When a job needs to be drained from the router, we cannot drain it early, but we need to push it to the worker's queue, so that all other (waiting) jobs for the same userID have time to flush from the pipeline's buffers, before the job is actually drained.

Notion Ticket

Link

Security

The code changed/added as part of this pull request won't create any security issues with how the software is being used.

codecov · 2022-10-11T09:11:27Z

Codecov Report

Base: 43.90% // Head: 43.83% // Decreases project coverage by -0.06% ⚠️

Coverage data is based on head (ff8e0d0) compared to base (ce69ae3).
Patch coverage: 89.70% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2546      +/-   ##
==========================================
- Coverage   43.90%   43.83%   -0.07%     
==========================================
  Files         187      187              
  Lines       39161    39075      -86     
==========================================
- Hits        17195    17130      -65     
+ Misses      20877    20861      -16     
+ Partials     1089     1084       -5

Impacted Files	Coverage Δ
warehouse/api.go	`70.41% <ø> (ø)`
warehouse/jobs/runner.go	`0.00% <ø> (ø)`
router/router.go	`66.76% <85.00%> (-1.05%)`	⬇️
router/internal/eventorder/eventorder.go	`96.20% <96.42%> (+0.22%)`	⬆️
enterprise/reporting/setup.go	`38.09% <0.00%> (-14.29%)`	⬇️
config/backend-config/namespace_config.go	`70.83% <0.00%> (-3.13%)`	⬇️
enterprise/reporting/reporting.go	`8.33% <0.00%> (-1.44%)`	⬇️
processor/processor.go	`71.91% <0.00%> (+0.82%)`	⬆️
utils/httputil/server.go	`92.30% <0.00%> (+11.53%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

atzoum · 2022-10-11T12:32:59Z

router/router.go

@@ -1271,7 +1311,7 @@ func (rt *HandleT) findWorker(job *jobsdb.JobT, throttledAtTime time.Time) (toSe
 	userID := job.UserID
 	// checking if the user is in throttledMap. If yes, returning nil.
 	// this check is done to maintain order.
-	if _, ok := rt.throttledUserMap[userID]; ok {
+	if _, ok := rt.throttledUserMap[userID]; ok && rt.guaranteeUserEventOrder {


found this was missing

atzoum · 2022-10-11T12:41:33Z

router/router.go

+	defer func() {
+		if toSendWorker == nil {
+			rt.throttler.Dec(parameters.DestinationID, userID, 1, throttledAtTime, throttler.ALL_LEVELS)
+		}
+	}()


Added this since we need to decrement the throttler if no worker is found for the job.

As a general note, the current throttling logic (incrementing/decrementing) appears to be fragile.

cc @fracasula

chandumlg

Looks good.

Only comment is, to avoid draining in two places (by checking canDrainEarly), we can drain all kinds of jobs in the worker itself. True?

atzoum · 2022-10-12T06:56:03Z

Looks good.

Only comment is, to avoid draining in two places (by checking canDrainEarly), we can drain all kinds of jobs in the worker itself. True?

@chandumlg the initial intention was to leave it there as an optimization, i.e. sparing the extra loop for expired jobs that won't be picked up due to another previously failed job. But you are correct, this minor optimization doesn't justify all this extra code. Will simplify it by using a single draining strategy. Thanks!

router/router.go

cisse21 · 2022-10-12T09:38:34Z

LGTM

github-actions bot added server-team with tests labels Oct 11, 2022

atzoum force-pushed the fix.drainRouterOrder branch 2 times, most recently from 81d1172 to d7bb5d5 Compare October 11, 2022 08:42

atzoum force-pushed the fix.drainRouterOrder branch from d7bb5d5 to 3ffaf01 Compare October 11, 2022 12:28

github-actions bot added warehouse-team and removed server-team labels Oct 11, 2022

atzoum commented Oct 11, 2022

View reviewed changes

atzoum marked this pull request as ready for review October 11, 2022 12:42

atzoum changed the title ~~[WIP] fix: preserve event ordering while draining previously failed jobs~~ fix(router): preserve event order while draining a previously failed job Oct 11, 2022

atzoum requested review from cisse21 and chandumlg October 11, 2022 13:09

atzoum force-pushed the fix.drainRouterOrder branch 4 times, most recently from 4c3a44e to 1660322 Compare October 12, 2022 05:00

chandumlg approved these changes Oct 12, 2022

View reviewed changes

atzoum commented Oct 12, 2022

View reviewed changes

router/router.go Show resolved Hide resolved

atzoum requested a review from chandumlg October 12, 2022 08:49

atzoum added 4 commits October 12, 2022 11:49

chore: introduce job draining in eventorder simulation test

c89ff90

fix: properly drain previously failed jobs

cba85d8

chore: deepsource

e791b30

chore: always drain jobs through worker

ff8e0d0

atzoum force-pushed the fix.drainRouterOrder branch from 93836b2 to ff8e0d0 Compare October 12, 2022 08:50

chandumlg approved these changes Oct 12, 2022

View reviewed changes

cisse21 approved these changes Oct 12, 2022

View reviewed changes

cisse21 merged commit f0654b0 into master Oct 12, 2022

cisse21 deleted the fix.drainRouterOrder branch October 12, 2022 09:38

rudder-server-bot mentioned this pull request Oct 12, 2022

chore: release 1.2.0 #2527

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(router): preserve event order while draining a previously failed job #2546

fix(router): preserve event order while draining a previously failed job #2546

atzoum commented Oct 10, 2022 •

edited

codecov bot commented Oct 11, 2022 •

edited

atzoum Oct 11, 2022

atzoum Oct 11, 2022

chandumlg left a comment

atzoum commented Oct 12, 2022

cisse21 commented Oct 12, 2022

fix(router): preserve event order while draining a previously failed job #2546

fix(router): preserve event order while draining a previously failed job #2546

Conversation

atzoum commented Oct 10, 2022 • edited

Description

Notion Ticket

Security

codecov bot commented Oct 11, 2022 • edited

Codecov Report

atzoum Oct 11, 2022

Choose a reason for hiding this comment

atzoum Oct 11, 2022

Choose a reason for hiding this comment

chandumlg left a comment

Choose a reason for hiding this comment

atzoum commented Oct 12, 2022

cisse21 commented Oct 12, 2022

atzoum commented Oct 10, 2022 •

edited

codecov bot commented Oct 11, 2022 •

edited