Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proxy_store: improve error handling #107

Merged

Conversation

moio
Copy link
Contributor

@moio moio commented Jun 6, 2023

This draft PR aims to prove a theory that resulted in a goroutine leak observed by one Rancher user after the 2.7.1->2.7.2 update.

This is one of the potential causes of SURE-6256 aka rancher/rancher#41225 currently under investigation.

Main symptom is many goroutines parked at returnErr:

      flat  flat%   sum%        cum   cum%
    244377   100%   100%     244377   100%  runtime.gopark
         0     0%   100%     119986 49.10%  runtime.goparkunlock (inline)
         0     0%   100%     113097 46.28%  runtime.chansend
         0     0%   100%     113097 46.28%  runtime.chansend1
         0     0%   100%     112277 45.94%  golang.org/x/sync/errgroup.(*Group).Go.func1
         0     0%   100%     111938 45.80%  runtime.semacquire1
         0     0%   100%     111938 45.80%  sync.(*WaitGroup).Wait
         0     0%   100%     111938 45.80%  sync.runtime_Semacquire
         0     0%   100%     111937 45.80%  golang.org/x/sync/errgroup.(*Group).Wait
         0     0%   100%     111935 45.80%  github.com/rancher/steve/pkg/stores/proxy.(*Store).listAndWatch
         0     0%   100%     111935 45.80%  github.com/rancher/steve/pkg/stores/proxy.(*Store).listAndWatch.func3
         0     0%   100%     111935 45.80%  github.com/rancher/steve/pkg/stores/proxy.(*Store).watch.func1
         0     0%   100%     111837 45.76%  github.com/rancher/steve/pkg/stores/proxy.returnErr (inline)

Theory:

  1. returnErr waits on c <- because nobody is ever reading from c

func returnErr(err error, c chan watch.Event) {
c <- watch.Event{

  1. c is, in most cases, read by a tight for loop: that guarantees that returnErr will eventually return. But in one case, an error condition can return from that loop earlier than the channel c is closed:

for item := range c {
m, err := meta.Accessor(item.Object)
if err != nil {
return

(note: WatchNames gets c from watch, which creates it and passes it to listAndWatch, which in turn calls returnErr. Hence, WatchNames’s for loop reads from the channel returnErr writes to)

  1. in principle, if some item.Object coming out of c is a runtime.Object but not a metav1.Object, then WatchNames’s for loop will quit early, nobody reads from c any longer, and whatever next call to returnErr will block until c is closed. But c is never closed, because c is closed by watch, which waits for listAndWatch to terminate, but that in turn waits for returnErr to terminate, but that never happens because nobody reads from c. Such deadlock causes a goroutine leak, and possibly an associated memory leak.

Note that the suspect if statement that terminates the loop was indeed introduced with 2.7.2, specifically:

60d234d#diff-82b79befb09b4c7baa9a34e0412389dc2be544a3e56086726419bbf939a9e9b9R347-R350

CC @rmweir

Signed-off-by: Silvio Moioli <silvio@moioli.net>
@rmweir rmweir changed the base branch from master to v2.7-debug-41225-1 June 6, 2023 21:55
@rmweir rmweir marked this pull request as ready for review June 6, 2023 21:56
@rmweir rmweir self-requested a review June 6, 2023 21:56
Copy link
Contributor

@rmweir rmweir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for debugging purposes

@rmweir rmweir merged commit 776b32d into rancher:v2.7-debug-41225-1 Jun 6, 2023
1 check failed
if status, ok := item.Object.(*metav1.Status); ok {
logrus.Debugf("WatchNames received error: %s", status.Message)
} else {
logrus.Debugf("WatchNames received error: %s", item)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think watch.Event has a String() so this likely would have failed a go vet and will result in an improperly formatted log message. Did you want %v instead of %s?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the CI validate step is broken at the moment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right on both accounts. Fixed in #109 and #110 for master, v2.7-debug-41225-1 fixed manually

moio added a commit to moio/steve that referenced this pull request Jun 9, 2023
This prevents a goroutine leak when item.Object is a `runtime.Object` but not a
`metav1.Object`, as in that case `WatchNames`’s `for` loop will quit early and
subsequent calls to `returnErr` will remain parked forever.

This helps with rancher/rancher#41225
Fuller explanation in rancher#107
moio added a commit to moio/steve that referenced this pull request Jun 9, 2023
This prevents a goroutine leak when item.Object is a `runtime.Object` but not a
`metav1.Object`, as in that case `WatchNames`’s `for` loop will quit early and
subsequent calls to `returnErr` will remain parked forever.

This helps with rancher/rancher#41225
Fuller explanation in rancher#107
rmweir pushed a commit to rmweir/steve that referenced this pull request Jun 28, 2023
This prevents a goroutine leak when item.Object is a `runtime.Object` but not a
`metav1.Object`, as in that case `WatchNames`’s `for` loop will quit early and
subsequent calls to `returnErr` will remain parked forever.

This helps with rancher/rancher#41225
Fuller explanation in rancher#107
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants