pkg/payload/task_graph: Avoid deadlocking on cancel with workCh queue

Before 55ef3d3 (pkg/payload/task_graph: Handle node pushing and result collection without a goroutine, 2019-10-21, openshift#264), RunGraph had a separate goroutine that managed the work queue, with results fed into errCh to be collected by the main RunGraph goroutine. It didn't matter if that work queue goroutine hung; as long as all the worker goroutines exited, RunGraph would collect their errors from errCh and return. In 55ef3d3, I removed the queue goroutine and moved queue management into the main RunGraph goroutine. With that change, we became exposed to the following race: 1. Main goroutine pushes work into workCh. 2. Context canceled. 3. Workers exit via the "Canceled worker..." case, so they don't pick the work out of workCh. 4. Main goroutine deadlocks because there is work in flight, but nothing in resultCh, and no longer any workers to feed resultCh. In logs, this looks like "sync-worker goroutine has gone to sleep, and is no longer synchronizing manifests" [1]. With this commit, we drain results when they are available, but we also respect the context to allow the resultCh read to be canceled. When we have been canceled with work in flight, we also attempt a non-blocking read from workCh to drain out anything there that has not yet been picked up by a worker. Because 'done' will not be set true, we'll call getNextNode again and come in with a fresh pass through the for loop. ctx.Err() will no longer be nil, but if the workCh drain worked, we may now have inflight == 0, and we'll end up in the case that sets 'done' true, and break out of the for loop on that round. The unit test sets up two parallel nodes: a and b. We configure one worker, which picks up node a. Node b doesn't block on node a, so it gets pushed into workCh while the worker grinds through node a. On its second task in node a, the worker cancels the run. Because the sleeps do not have select-ctx.Done guards, the worker finishes off that second task, notices the cancel as they enter their third task, and exits with the "context canceled" error. This leaves node b stuck in workCh, and we need the fix from this commit to avoid deadlocking on that in-flight node. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1873900
wking · Sep 16, 2020 · 622e04f · 622e04f
1 parent 6d56c65
commit 622e04f
Show file tree

Hide file tree

Showing 2 changed files with 32 additions and 3 deletions.
diff --git a/pkg/payload/task_graph.go b/pkg/payload/task_graph.go
@@ -507,9 +507,18 @@ func RunGraph(ctx context.Context, graph *TaskGraph, maxParallelism int, fn func
 			case <-ctx.Done():
 			}
 		case inflight > 0: // no work available to push; collect results
-			result := <-resultCh
-			results[result.index] = &result
-			inflight--
+			select {
+			case result := <-resultCh:
+				results[result.index] = &result
+				inflight--
+			case <-ctx.Done():
+				select {
+				case runTask := <-workCh: // workers canceled, so remove any work from the queue ourselves
+					inflight--
+					submitted[runTask.index] = false
+				default:
+				}
+			}
 		default: // no work to push and nothing in flight.  We're done
 			done = true
 		}

diff --git a/pkg/payload/task_graph_test.go b/pkg/payload/task_graph_test.go
@@ -831,6 +831,26 @@ func TestRunGraph(t *testing.T) {
 				}
 			},
 		},
+		{
+			name: "mid-task cancellation with work in queue does not deadlock",
+			nodes: []*TaskNode{
+				{Tasks: tasks("a1", "a2", "a3")},
+				{Tasks: tasks("b")},
+			},
+			sleep:    time.Millisecond,
+			parallel: 1,
+			errorOn: func(t *testing.T, name string, ctx context.Context, cancelFn func()) error {
+				if err := ctx.Err(); err != nil {
+					return err
+				}
+				if name == "a2" {
+					cancelFn()
+				}
+				return nil
+			},
+			want:     []string{"a1", "a2"},
+			wantErrs: []string{"context canceled"},
+		},
 		{
 			name: "task errors in parallel nodes both reported",
 			nodes: []*TaskNode{