-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Labels
bugSomething isn't workingSomething isn't working
Description
What version are you using?
Latest code from the main branch
What did you do?
- Run parallel catchup v2
- Some jobs caused OOM of the catchup worker pods. The jobs didn't finish and were left in the progress queue
- Eventually job monitor saw jobs in the progress queue with all workers were down
- Retry logic moved the oomed jobs from the in progress queue to the job queue
- The jobs were picked up by the catchup worker pods and moved to the in progress queue
- Jobs oomed again and did not finish
- We're back to step 3 in an infinite loop
What did you expect to see?
Perhaps we should fail the mission after certain number of failures?
We could also check catchup worker for OOM events. If those happen there is a possibility that some ranges will never finish.
What did you see instead?
The missions was stuck in a retry loop.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working