Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(router): preserve event order while draining a previously failed job #2546

Merged
merged 4 commits into from Oct 12, 2022

Conversation

atzoum
Copy link
Contributor

@atzoum atzoum commented Oct 10, 2022

Description

When a job needs to be drained from the router, we cannot drain it early, but we need to push it to the worker's queue, so that all other (waiting) jobs for the same userID have time to flush from the pipeline's buffers, before the job is actually drained.

Notion Ticket

Link

Security

  • The code changed/added as part of this pull request won't create any security issues with how the software is being used.

@codecov
Copy link

codecov bot commented Oct 11, 2022

Codecov Report

Base: 43.90% // Head: 43.83% // Decreases project coverage by -0.06% ⚠️

Coverage data is based on head (ff8e0d0) compared to base (ce69ae3).
Patch coverage: 89.70% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2546      +/-   ##
==========================================
- Coverage   43.90%   43.83%   -0.07%     
==========================================
  Files         187      187              
  Lines       39161    39075      -86     
==========================================
- Hits        17195    17130      -65     
+ Misses      20877    20861      -16     
+ Partials     1089     1084       -5     
Impacted Files Coverage Δ
warehouse/api.go 70.41% <ø> (ø)
warehouse/jobs/runner.go 0.00% <ø> (ø)
router/router.go 66.76% <85.00%> (-1.05%) ⬇️
router/internal/eventorder/eventorder.go 96.20% <96.42%> (+0.22%) ⬆️
enterprise/reporting/setup.go 38.09% <0.00%> (-14.29%) ⬇️
config/backend-config/namespace_config.go 70.83% <0.00%> (-3.13%) ⬇️
enterprise/reporting/reporting.go 8.33% <0.00%> (-1.44%) ⬇️
processor/processor.go 71.91% <0.00%> (+0.82%) ⬆️
utils/httputil/server.go 92.30% <0.00%> (+11.53%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@@ -1271,7 +1311,7 @@ func (rt *HandleT) findWorker(job *jobsdb.JobT, throttledAtTime time.Time) (toSe
userID := job.UserID
// checking if the user is in throttledMap. If yes, returning nil.
// this check is done to maintain order.
if _, ok := rt.throttledUserMap[userID]; ok {
if _, ok := rt.throttledUserMap[userID]; ok && rt.guaranteeUserEventOrder {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found this was missing

Comment on lines +1331 to +1329
defer func() {
if toSendWorker == nil {
rt.throttler.Dec(parameters.DestinationID, userID, 1, throttledAtTime, throttler.ALL_LEVELS)
}
}()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this since we need to decrement the throttler if no worker is found for the job.

As a general note, the current throttling logic (incrementing/decrementing) appears to be fragile.

cc @fracasula

@atzoum atzoum marked this pull request as ready for review October 11, 2022 12:42
@atzoum atzoum changed the title [WIP] fix: preserve event ordering while draining previously failed jobs fix(router): preserve event order while draining a previously failed job Oct 11, 2022
@atzoum atzoum force-pushed the fix.drainRouterOrder branch 4 times, most recently from 4c3a44e to 1660322 Compare October 12, 2022 05:00
Copy link
Member

@chandumlg chandumlg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Only comment is, to avoid draining in two places (by checking canDrainEarly), we can drain all kinds of jobs in the worker itself. True?

@atzoum
Copy link
Contributor Author

atzoum commented Oct 12, 2022

Looks good.

Only comment is, to avoid draining in two places (by checking canDrainEarly), we can drain all kinds of jobs in the worker itself. True?

@chandumlg the initial intention was to leave it there as an optimization, i.e. sparing the extra loop for expired jobs that won't be picked up due to another previously failed job. But you are correct, this minor optimization doesn't justify all this extra code. Will simplify it by using a single draining strategy. Thanks!

@atzoum atzoum requested a review from chandumlg October 12, 2022 08:49
@cisse21
Copy link
Member

cisse21 commented Oct 12, 2022

LGTM

@cisse21 cisse21 merged commit f0654b0 into master Oct 12, 2022
@cisse21 cisse21 deleted the fix.drainRouterOrder branch October 12, 2022 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants