Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
pkg/payload/task_graph: Avoid deadlocking on cancel with workCh queue
Before 55ef3d3 (pkg/payload/task_graph: Handle node pushing and result collection without a goroutine, 2019-10-21, openshift#264), RunGraph had a separate goroutine that managed the work queue, with results fed into errCh to be collected by the main RunGraph goroutine. It didn't matter if that work queue goroutine hung; as long as all the worker goroutines exited, RunGraph would collect their errors from errCh and return. In 55ef3d3, I removed the queue goroutine and moved queue management into the main RunGraph goroutine. With that change, we became exposed to the following race: 1. Main goroutine pushes work into workCh. 2. Context canceled. 3. Workers exit via the "Canceled worker..." case, so they don't pick the work out of workCh. 4. Main goroutine deadlocks because there is work in flight, but nothing in resultCh, and no longer any workers to feed resultCh. In logs, this looks like "sync-worker goroutine has gone to sleep, and is no longer synchronizing manifests" [1]. With this commit, we drain results when they are available, but we also respect the context to allow the resultCh read to be canceled. When we have been canceled with work in flight, we also attempt a non-blocking read from workCh to drain out anything there that has not yet been picked up by a worker. Because 'done' will not be set true, we'll call getNextNode again and come in with a fresh pass through the for loop. ctx.Err() will no longer be nil, but if the workCh drain worked, we may now have inflight == 0, and we'll end up in the case that sets 'done' true, and break out of the for loop on that round. The unit test sets up two parallel nodes: a and b. We configure one worker, which picks up node a. Node b doesn't block on node a, so it gets pushed into workCh while the worker grinds through node a. On its second task in node a, the worker cancels the run. Because the sleeps do not have select-ctx.Done guards, the worker finishes off that second task, notices the cancel as they enter their third task, and exits with the "context canceled" error. This leaves node b stuck in workCh, and we need the fix from this commit to avoid deadlocking on that in-flight node. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1873900
- Loading branch information