Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix mem leak on UDP connections #6815

Merged
merged 1 commit into from Jun 4, 2020
Merged

Conversation

ddtmachado
Copy link
Contributor

What does this PR do?

Fix a memory / go routine leak on udp connections, specifically in allocating a new time.Ticker under the Conn struct.

No expert here, but after taking a heap profile in a stress test which created lots of new udp connections it showed that we were leaking memory, and probably go routines (runtime malg), as we allocated a time.Ticker in udp.(*Listener)newConn. This is also consistent with the profile sent on #6761.

traefik_2 1 1_leak

It could be that the Ticker, which was referenced by the Conn, was not being collected by the GC, justifying the dead go routines and memory leak.

Changing the implementation to allocate and stop the ticker in the main readLoop seems to solve the memory and go routine problems as demonstrated in the new profile taken after running the same stress test:

traefik_2 1 1_leak_fix

Motivation

Issue #6761

More

  • Added/updated tests
  • Added/updated documentation

Additional Notes

Comment on lines +207 to +208
ticker := time.NewTicker(c.timeout)
defer ticker.Stop()
Copy link
Collaborator

@mpl mpl May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand at all why this would fix anything for now.
If we suppose (even though I don't know why or how) that the problem is that somehow ticker.Stop was not called, then I understand why L208 (VS L284) might make a difference.

But then why is L207 needed at all? Why can't L208 just be:

    defer c.ticker.Stop()

?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if we keep on assuming that the problem comes from ticker.Stop not being called, that probably means that c.Close was not always called, which is still a big problem in itself. Because, among other things, that would mean c.listener.conns is not being cleaned up either (which would also be a memleak).

This is all very puzzling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed c.Close was called (actually multiple times once the timeout was reached), so c.ticker.Stop() was actually being called as well but for some reason the routine and objects were not GCed. I'm puzzled with this as well, my best guess is that the conn, stored in c.listener.conns is not being cleaned up as well but I confirmed it gets removed from the map.

So you are right that we might have more to it than just the ticker, and actually I don't see memory getting back to the starting point level after experiencing a lot of new connections, maybe its conn that is not being GCed for some reason.

Copy link
Collaborator

@mpl mpl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the initial code was correct, and I think the new code is also correct, and since it seems to fix the problem, I think we should merge ASAP.

However, we should take the time at some point to properly understand what was going on.
One coworker's hypothesis, is that there actually is no memleak, and that we're simply allocating too aggressively for the GC to cope. And this PR would have the side-effect of moving the memory footprint from the heap to the stack, which would avoid said allocation. We could test that hypothesis by e.g. running the initial code and problematic repro with GC set to be much more aggressive.

Copy link
Contributor

@dtomcej dtomcej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
:shipit:

@ldez ldez added the kind/bug/fix a bug fix label Jun 2, 2020
@ldez ldez added this to To review in v2 via automation Jun 2, 2020
Copy link
Member

@juliens juliens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@traefiker traefiker merged commit 48c73d6 into traefik:v2.2 Jun 4, 2020
v2 automation moved this from To review to Done Jun 4, 2020
@jbdoumenjou jbdoumenjou mentioned this pull request Jul 22, 2020
@ddtmachado ddtmachado deleted the GH6761-udp-leak branch September 8, 2022 23:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
v2
Done
Development

Successfully merging this pull request may close these issues.

None yet

6 participants