Here is the assertion.
Apparently it can produce false alarm (such as what we saw recently).
One of the possible reason is that a worker might be temporarily unreachable (either memory starved or coreDNS/ingress/network related issues) but it hasn't failed or crashed. Due to the retry logic another worker might start working on the same job. Eventually one of them finishes and the mission declares all jobs successful. However the other work is still running, so that sanity check fails.
I think the correct fix is to just loosen that check (we still want the retry logic as an extra safety layer) and provide better diagnostics for when such scenarios happen.