Skip to content

🐛 Fix(manager): Prevent goroutine leak on shutdown timeout #3247

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jingyih
Copy link

@jingyih jingyih commented Jun 26, 2025

Fixes #3218

A race condition during manager shutdown can cause a goroutine leak if the graceful shutdown period is exceeded.

When the gracefulShutdownTimeout is reached, the manager stops waiting for runnables and terminates the goroutine responsible for draining errors from runnables. If a slow runnable subsequently fails, its attempt to send an error on the shared channel (errChan) will block its goroutine forever, causing a leak.

This change prevents the leak by replacing the blocking error send in the runnableGroup with a non-blocking select statement. If the context is canceled, indicating a shutdown is in progress, the error is logged and dropped. This allows the runnable's goroutine to terminate cleanly instead of blocking forever.

Example test failure without this fix:

$ go test ./pkg/manager -v -ginkgo.focus="should not leak goroutines when a runnable returns error slowly after being
 signaled to stop"
=== RUN   TestSource
Running Suite: Manager Suite - /usr/local/google/home/jingyih/go/src/github.com/kubernetes-sigs/controller-runtime/pkg/manager
==============================================================================================================================
Random Seed: 1750977926

Will run 1 of 98 specs
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
------------------------------
• [FAILED] [1.712 seconds]
manger.Manager [It] should not leak goroutines when a runnable returns error slowly after being signaled to stop
/usr/local/google/home/jingyih/go/src/github.com/kubernetes-sigs/controller-runtime/pkg/manager/manager_test.go:1889

  Timeline >>
  2025-06-26T22:45:31Z  INFO    Stopping and waiting for non leader election runnables
  2025-06-26T22:45:31Z  INFO    Stopping and waiting for leader election runnables
  2025-06-26T22:45:31Z  INFO    Stopping and waiting for caches
  2025-06-26T22:45:31Z  INFO    Stopping and waiting for webhooks
  2025-06-26T22:45:31Z  INFO    Stopping and waiting for HTTP servers
  2025-06-26T22:45:31Z  INFO    Wait completed, proceeding to shutdown the manager
  [FAILED] in [It] - /usr/local/google/home/jingyih/go/src/github.com/kubernetes-sigs/controller-runtime/pkg/manager/manager_test.go:1934 @ 06/26/25 22:45:32.755
  << Timeline

  [FAILED] Timed out after 1.360s.
  Expected success, but got an error:
      <*errors.errorString | 0xc000194450>: 
      found unexpected goroutines:
      [Goroutine 65 in state chan send, with sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1 on top of the stack:
      sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1(0xc0002f4840)
        /usr/local/google/home/jingyih/go/src/github.com/kubernetes-sigs/controller-runtime/pkg/manager/runnable_group.go:227 +0xe5
      created by sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile in goroutine 262
        /usr/local/google/home/jingyih/go/src/github.com/kubernetes-sigs/controller-runtime/pkg/manager/runnable_group.go:210 +0x191
       Goroutine 274 in state chan send, with sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1 on top of the stack:
      sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1(0xc0002f4860)
        /usr/local/google/home/jingyih/go/src/github.com/kubernetes-sigs/controller-runtime/pkg/manager/runnable_group.go:227 +0xe5
      created by sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile in goroutine 262
        /usr/local/google/home/jingyih/go/src/github.com/kubernetes-sigs/controller-runtime/pkg/manager/runnable_group.go:210 +0x191
       Goroutine 269 in state sync.WaitGroup.Wait, with sync.runtime_SemacquireWaitGroup on top of the stack:
      sync.runtime_SemacquireWaitGroup(0xc0002e9f98?)
        /usr/local/google/home/jingyih/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.0.linux-amd64/src/runtime/sema.go:110 +0x25
      sync.(*WaitGroup).Wait(0xc0002e9fd0?)
        /usr/local/google/home/jingyih/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.0.linux-amd64/src/sync/waitgroup.go:118 +0x48
      sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).StopAndWait.func1.2()
        /usr/local/google/home/jingyih/go/src/github.com/kubernetes-sigs/controller-runtime/pkg/manager/runnable_group.go:307 +0x4e
      created by sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).StopAndWait.func1 in goroutine 267
        /usr/local/google/home/jingyih/go/src/github.com/kubernetes-sigs/controller-runtime/pkg/manager/runnable_group.go:304 +0x118
      ]
      {
          s: "found unexpected goroutines:\n[Goroutine 65 in state chan send, with sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1 on top of the stack:\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1(0xc0002f4840)\n\t/usr/local/google/home/jingyih/go/src/github.com/kubernetes-sigs/controller-runtime/pkg/manager/runnable_group.go:227 +0xe5\ncreated by sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile in goroutine 262\n\t/usr/local/google/home/jingyih/go/src/github.com/kubernetes-sigs/controller-runtime/pkg/manager/runnable_group.go:210 +0x191\n Goroutine 274 in state chan send, with sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1 on top of the stack:\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1(0xc0002f4860)\n\t/usr/local/google/home/jingyih/go/src/github.com/kubernetes-sigs/controller-runtime/pkg/manager/runnable_group.go:227 +0xe5\ncreated by sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile in goroutine 262\n\t/usr/local/google/home/jingyih/go/src/github.com/kubernetes-sigs/controller-runtime/pkg/manager/runnable_group.go:210 +0x191\n Goroutine 269 in state sync.WaitGroup.Wait, with sync.runtime_SemacquireWaitGroup on top of the stack:\nsync.runtime_SemacquireWaitGroup(0xc0002e9f98?)\n\t/usr/local/google/home/jingyih/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.0.linux-amd64/src/runtime/sema.go:110 +0x25\nsync.(*WaitGroup).Wait(0xc0002e9fd0?)\n\t/usr/local/google/home/jingyih/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.0.linux-amd64/src/sync/waitgroup.go:118 +0x48\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).StopAndWait.func1.2()\n\t/usr/local/google/home/jingyih/go/src/github.com/kubernetes-sigs/controller-runtime/pkg/manager/runnable_group.go:307 +0x4e\ncreated by sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).StopAndWait.func1 in goroutine 267\n\t/usr/local/google/home/jingyih/go/src/github.com/kubernetes-sigs/controller-runtime/pkg/manager/runnable_group.go:304 +0x118\n]",
      }
  In [It] at: /usr/local/google/home/jingyih/go/src/github.com/kubernetes-sigs/controller-runtime/pkg/manager/manager_test.go:1934 @ 06/26/25 22:45:32.755
------------------------------
SSSSSSSSSSSSS

Summarizing 1 Failure:
  [FAIL] manger.Manager [It] should not leak goroutines when a runnable returns error slowly after being signaled to stop
  /usr/local/google/home/jingyih/go/src/github.com/kubernetes-sigs/controller-runtime/pkg/manager/manager_test.go:1934

Ran 1 of 98 Specs in 7.215 seconds
FAIL! -- 0 Passed | 1 Failed | 0 Pending | 97 Skipped
--- FAIL: TestSource (7.22s)
FAIL
FAIL    sigs.k8s.io/controller-runtime/pkg/manager      7.279s
FAIL

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 26, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jingyih
Once this PR has been reviewed and has the lgtm label, please assign vincepri for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 26, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @jingyih. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 26, 2025
Copy link
Member

@alvaroaleman alvaroaleman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

select {
case r.errChan <- err:
// Error sent successfully
case <-r.ctx.Done():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shutdown is triggered by cancelling the context and select is not ordered, i.E. if both clauses are valid, its not deterministic which it will use. Shouldn't this be a default instead to ensure we try sending on the errChan first?

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 28, 2025
@k8s-ci-robot
Copy link
Contributor

@jingyih: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-controller-runtime-test ff4325f link true /test pull-controller-runtime-test

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Context cancellation causes a goroutine leak (possibly intermittent)
3 participants