Skip to content

Parallel catchup V2 retry logic can lead to stuck mission #334

@jacekn

Description

@jacekn

What version are you using?

Latest code from the main branch

What did you do?

  1. Run parallel catchup v2
  2. Some jobs caused OOM of the catchup worker pods. The jobs didn't finish and were left in the progress queue
  3. Eventually job monitor saw jobs in the progress queue with all workers were down
  4. Retry logic moved the oomed jobs from the in progress queue to the job queue
  5. The jobs were picked up by the catchup worker pods and moved to the in progress queue
  6. Jobs oomed again and did not finish
  7. We're back to step 3 in an infinite loop

What did you expect to see?

Perhaps we should fail the mission after certain number of failures?
We could also check catchup worker for OOM events. If those happen there is a possibility that some ranges will never finish.

What did you see instead?

The missions was stuck in a retry loop.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions