Parallel catchup (v2) false alarm on condition "no worker should be running while queues are empty"

Here is the [assertion](https://github.com/stellar/supercluster/blob/main/src/FSLibrary/MissionHistoryPubnetParallelCatchupV2.fs#L235-L236). 

Apparently it can produce false alarm (such as what we saw [recently](https://buildmeister-v3.stellar-ops.com/job/Core/job/stellar-supercluster/1130/)). 

One of the possible reason is that a worker might be temporarily unreachable (either memory starved or coreDNS/ingress/network related issues) but it hasn't failed or crashed.  Due to the retry logic another worker might start working on the same job. Eventually one of them finishes and the mission declares all jobs successful. However the other work is still running, so that sanity check fails.

I think the correct fix is to just loosen that check (we still want the retry logic as an extra safety layer) and provide better diagnostics for when such scenarios happen. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel catchup (v2) false alarm on condition "no worker should be running while queues are empty" #243

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parallel catchup (v2) false alarm on condition "no worker should be running while queues are empty" #243

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions