Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: inconsistent and leaky retry delay logic in router #3002

Merged
merged 2 commits into from
Feb 21, 2023

Conversation

atzoum
Copy link
Contributor

@atzoum atzoum commented Feb 20, 2023

Description

Addressing router job retry backoff inconsistencies which may also cause memory leak:

  • We don’t delete entries from the map in case we drain jobs or mark them as aborted in places other than postStatusOnResponseQ.
  • In case event ordering is disabled, the worker-specific map we are using now is not really useful, since there is no guarantee that the same job will always be assigned to the same worker…

To fix these issues we are no longer using an in-memory map per worker, but the job status' retry_time column is used for storing the calculated backoff time. JobsDB queries for retrieving jobs don't use any conditions against the retry_time column, otherwise in case of a server restart this would cause jobs to be picked out-of-order as router's event-ordering algorithm doesn't persist its state, it only retains it in-memory.

Notion Ticket

Link

Security

  • The code changed/added as part of this pull request won't create any security issues with how the software is being used.

@codecov
Copy link

codecov bot commented Feb 20, 2023

Codecov Report

Base: 53.01% // Head: 53.00% // Decreases project coverage by -0.01% ⚠️

Coverage data is based on head (8457cc5) compared to base (4b613ad).
Patch coverage: 92.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3002      +/-   ##
==========================================
- Coverage   53.01%   53.00%   -0.01%     
==========================================
  Files         334      334              
  Lines       51941    51885      -56     
==========================================
- Hits        27534    27504      -30     
+ Misses      22796    22779      -17     
+ Partials     1611     1602       -9     
Impacted Files Coverage Δ
jobsdb/jobsdb_utils.go 77.48% <ø> (-2.04%) ⬇️
router/router.go 78.29% <81.81%> (+0.49%) ⬆️
jobsdb/jobsdb.go 73.91% <100.00%> (-0.13%) ⬇️
jobsdb/unionQuery.go 87.98% <100.00%> (+3.85%) ⬆️
processor/worker.go 82.81% <0.00%> (-2.35%) ⬇️
services/rsources/handler.go 75.20% <0.00%> (+1.37%) ⬆️
testhelper/log/log.go 11.53% <0.00%> (+3.84%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@atzoum atzoum force-pushed the fix.routerRetryDelay branch 4 times, most recently from 5b1656c to 0af362b Compare February 20, 2023 15:01
@atzoum atzoum changed the title [WIP] fix: inconsistent retry delay logic in router fix: inconsistent retry delay logic in router Feb 20, 2023
@atzoum atzoum changed the title fix: inconsistent retry delay logic in router fix: inconsistent and leaky retry delay logic in router Feb 20, 2023
@atzoum atzoum marked this pull request as ready for review February 21, 2023 07:15
router/router.go Outdated Show resolved Hide resolved
router/router.go Outdated Show resolved Hide resolved
@Sidddddarth
Copy link
Member

another shameless lo reference

could be useful here:

if !rt.guaranteeUserEventOrder {
		// if guaranteeUserEventOrder is false, assigning worker randomly and returning here.
		if rt.shouldThrottle(job, parameters, throttledUserMap) {
			return
		}
		toSendWorker = rt.workers[rand.Intn(rt.noOfWorkers)] // skipcq: GSC-G404
		return
	}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants