Skip to content

Parallel catchup (v2) false alarm on condition "no worker should be running while queues are empty" #243

@jayz22

Description

@jayz22

Here is the assertion.

Apparently it can produce false alarm (such as what we saw recently).

One of the possible reason is that a worker might be temporarily unreachable (either memory starved or coreDNS/ingress/network related issues) but it hasn't failed or crashed. Due to the retry logic another worker might start working on the same job. Eventually one of them finishes and the mission declares all jobs successful. However the other work is still running, so that sanity check fails.

I think the correct fix is to just loosen that check (we still want the retry logic as an extra safety layer) and provide better diagnostics for when such scenarios happen.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions