Skip to content

Commit

Permalink
pkg/payload/task_graph: Avoid deadlocking on cancel with workCh queue
Browse files Browse the repository at this point in the history
Before 55ef3d3 (pkg/payload/task_graph: Handle node pushing and
result collection without a goroutine, 2019-10-21, openshift#264), RunGraph had
a separate goroutine that managed the work queue, with results fed
into errCh to be collected by the main RunGraph goroutine.  It didn't
matter if that work queue goroutine hung; as long as all the worker
goroutines exited, RunGraph would collect their errors from errCh and
return.  In 55ef3d3, I removed the queue goroutine and moved queue
management into the main RunGraph goroutine.  With that change, we
became exposed to the following race:

1. Main goroutine pushes work into workCh.
2. Context canceled.
3. Workers exit via the "Canceled worker..." case, so they don't pick
   the work out of workCh.
4. Main goroutine deadlocks because there is work in flight, but
   nothing in resultCh, and no longer any workers to feed resultCh.

In logs, this looks like "sync-worker goroutine has gone to sleep, and
is no longer synchronizing manifests" [1].

With this commit, we drain results when they are available, but we
also respect the context to allow the resultCh read to be canceled.
When we have been canceled with work in flight, we also attempt a
non-blocking read from workCh to drain out anything there that has not
yet been picked up by a worker.  Because 'done' will not be set true,
we'll call getNextNode again and come in with a fresh pass through the
for loop.  ctx.Err() will no longer be nil, but if the workCh drain
worked, we may now have inflight == 0, and we'll end up in the case
that sets 'done' true, and break out of the for loop on that round.

The unit test sets up two parallel nodes: a and b.  We configure one
worker, which picks up node a.  Node b doesn't block on node a, so it
gets pushed into workCh while the worker grinds through node a.  On
its second task in node a, the worker cancels the run.  Because the
sleeps do not have select-ctx.Done guards, the worker finishes off
that second task, notices the cancel as they enter their third task,
and exits with the "context canceled" error.  This leaves node b stuck
in workCh, and we need the fix from this commit to avoid deadlocking
on that in-flight node.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1873900
  • Loading branch information
wking committed Sep 16, 2020
1 parent 6d56c65 commit 622e04f
Show file tree
Hide file tree
Showing 2 changed files with 32 additions and 3 deletions.
15 changes: 12 additions & 3 deletions pkg/payload/task_graph.go
Expand Up @@ -507,9 +507,18 @@ func RunGraph(ctx context.Context, graph *TaskGraph, maxParallelism int, fn func
case <-ctx.Done():
}
case inflight > 0: // no work available to push; collect results
result := <-resultCh
results[result.index] = &result
inflight--
select {
case result := <-resultCh:
results[result.index] = &result
inflight--
case <-ctx.Done():
select {
case runTask := <-workCh: // workers canceled, so remove any work from the queue ourselves
inflight--
submitted[runTask.index] = false
default:
}
}
default: // no work to push and nothing in flight. We're done
done = true
}
Expand Down
20 changes: 20 additions & 0 deletions pkg/payload/task_graph_test.go
Expand Up @@ -831,6 +831,26 @@ func TestRunGraph(t *testing.T) {
}
},
},
{
name: "mid-task cancellation with work in queue does not deadlock",
nodes: []*TaskNode{
{Tasks: tasks("a1", "a2", "a3")},
{Tasks: tasks("b")},
},
sleep: time.Millisecond,
parallel: 1,
errorOn: func(t *testing.T, name string, ctx context.Context, cancelFn func()) error {
if err := ctx.Err(); err != nil {
return err
}
if name == "a2" {
cancelFn()
}
return nil
},
want: []string{"a1", "a2"},
wantErrs: []string{"context canceled"},
},
{
name: "task errors in parallel nodes both reported",
nodes: []*TaskNode{
Expand Down

0 comments on commit 622e04f

Please sign in to comment.