Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix memory leak #1156

Closed
wants to merge 1 commit into from
Closed

fix memory leak #1156

wants to merge 1 commit into from

Conversation

melekes
Copy link
Contributor

@melekes melekes commented Jan 25, 2018

Refs cosmos/gaia#108

Test case:

func TestSwitchMemoryLeak(t *testing.T) {
        sw := MakeSwitch(config, 1, "testing", "123.123.123", initSwitchFunc)
        err := sw.Start()
        if err != nil {
                t.Error(err)
        }
        defer sw.Stop()

        // simulate remote peer
        rp := &remotePeer{PrivKey: crypto.GenPrivKeyEd25519().Wrap(), Config: DefaultPeerConfig()}
        rp.Start()
        defer rp.Stop()

        p, err := newOutboundPeer(rp.Addr(), sw.reactorsByCh, sw.chDescs, sw.StopPeerForError, sw.nodeKey.PrivKey, DefaultPeerConfig(), true)
        require.Nil(t, err)
        id := p.ID()
        err = sw.addPeer(p)
        require.Nil(t, err)

        f, err := os.Create("/tmp/mem1.mprof")
        if err != nil {
                t.Fatal(err)
        }
        pprof.WriteHeapProfile(f)
        f.Close()

        fmt.Println(runtime.NumGoroutine())

        // simulate failure by closing connection
        p.CloseConn()

        npeers := sw.Peers().Size()
        for i := 0; i < 100; i++ {
                time.Sleep(250 * time.Millisecond)
                npeers = sw.Peers().Size()
                if npeers > 0 {
                        // simulate failure by closing connection
                        sw.Peers().Get(id).(*peer).CloseConn()
                }
        }

        runtime.GC()

        f, err = os.Create("/tmp/mem2.mprof")
        if err != nil {
                t.Fatal(err)
        }
        pprof.WriteHeapProfile(f)
        f.Close()

        fmt.Println(runtime.NumGoroutine())
}

Refs cosmos/gaia#108

Test case:

```
func TestSwitchMemoryLeak(t *testing.T) {
        sw := MakeSwitch(config, 1, "testing", "123.123.123", initSwitchFunc)
        err := sw.Start()
        if err != nil {
                t.Error(err)
        }
        defer sw.Stop()

        // simulate remote peer
        rp := &remotePeer{PrivKey: crypto.GenPrivKeyEd25519().Wrap(), Config: DefaultPeerConfig()}
        rp.Start()
        defer rp.Stop()

        p, err := newOutboundPeer(rp.Addr(), sw.reactorsByCh, sw.chDescs, sw.StopPeerForError, sw.nodeKey.PrivKey, DefaultPeerConfig(), true)
        require.Nil(t, err)
        id := p.ID()
        err = sw.addPeer(p)
        require.Nil(t, err)

        f, err := os.Create("/tmp/mem1.mprof")
        if err != nil {
                t.Fatal(err)
        }
        pprof.WriteHeapProfile(f)
        f.Close()

        fmt.Println(runtime.NumGoroutine())

        // simulate failure by closing connection
        p.CloseConn()

        npeers := sw.Peers().Size()
        for i := 0; i < 100; i++ {
                time.Sleep(250 * time.Millisecond)
                npeers = sw.Peers().Size()
                if npeers > 0 {
                        // simulate failure by closing connection
                        sw.Peers().Get(id).(*peer).CloseConn()
                }
        }

        runtime.GC()

        f, err = os.Create("/tmp/mem2.mprof")
        if err != nil {
                t.Fatal(err)
        }
        pprof.WriteHeapProfile(f)
        f.Close()

        fmt.Println(runtime.NumGoroutine())
}
```
@ebuchman
Copy link
Contributor

Data races

@melekes
Copy link
Contributor Author

melekes commented Jan 25, 2018

Yah, sorry. Hm.. looks like we're using peer after we stopped it.. duh

@melekes
Copy link
Contributor Author

melekes commented Jan 25, 2018

Needs more ❤️

@melekes melekes closed this Jan 25, 2018
@melekes
Copy link
Contributor Author

melekes commented Jan 26, 2018

Running goroutines:

goroutine profile: total 8
2 @ 0x42f15c 0x43f3e9 0x6e9dbc 0x45e381
#       0x6e9dbb        github.com/tendermint/tendermint/vendor/github.com/tendermint/tmlibs/common.(*RepeatTimer).fireRoutine+0x12b  /home/vagrant/go/src/github.com/tendermint/tendermint/vendor/github.com/tendermint/tmlibs/common/repeat_timer.go:158

1 @ 0x4114a2 0x444eb6 0x6de352 0x45e381
#       0x444eb5        os/signal.signal_recv+0xa5      /usr/lib/go-1.9/src/runtime/sigqueue.go:131
#       0x6de351        os/signal.loop+0x21             /usr/lib/go-1.9/src/os/signal/signal_unix.go:22

1 @ 0x42f15c 0x42a02a 0x429627 0x486cee 0x486d6d 0x487b0a 0x556b72 0x56a48d 0x4682c6 0x468438 0x729aba 0x51aeab 0x4682c6 0x468438 0x6f55fb 0x6edf4e 0x726aee 0x45e381
#       0x429626        internal/poll.runtime_pollWait+0x56                          /usr/lib/go-1.9/src/runtime/netpoll.go:173
#       0x486ced        internal/poll.(*pollDesc).wait+0xad                          /usr/lib/go-1.9/src/internal/poll/fd_poll_runtime.go:85
#       0x486d6c        internal/poll.(*pollDesc).waitRead+0x3c                      /usr/lib/go-1.9/src/internal/poll/fd_poll_runtime.go:90
#       0x487b09        internal/poll.(*FD).Read+0x189                               /usr/lib/go-1.9/src/internal/poll/fd_unix.go:126
#       0x556b71        net.(*netFD).Read+0x51                                       /usr/lib/go-1.9/src/net/fd_unix.go:202
#       0x56a48c        net.(*conn).Read+0x6c                                        /usr/lib/go-1.9/src/net/net.go:176
#       0x4682c5        io.ReadAtLeast+0x85                                          /usr/lib/go-1.9/src/io/io.go:309
#       0x468437        io.ReadFull+0x57                                             /usr/lib/go-1.9/src/io/io.go:327
#       0x729ab9        github.com/tendermint/tendermint/p2p/conn.(*SecretConnection).Read+0x1c9              /home/vagrant/go/src/github.com/tendermint/tendermint/p2p/conn/secret_connection.go:155
#       0x51aeaa        bufio.(*Reader).Read+0x30a                                   /usr/lib/go-1.9/src/bufio/bufio.go:213
#       0x4682c5        io.ReadAtLeast+0x85                                          /usr/lib/go-1.9/src/io/io.go:309
#       0x468437        io.ReadFull+0x57                                             /usr/lib/go-1.9/src/io/io.go:327
#       0x6f55fa        github.com/tendermint/tendermint/vendor/github.com/tendermint/go-wire.ReadFull+0x6a   /home/vagrant/go/src/github.com/tendermint/tendermint/vendor/github.com/tendermint/go-wire/wire.go:168
#       0x6edf4d        github.com/tendermint/tendermint/vendor/github.com/tendermint/go-wire.ReadByte+0x7d   /home/vagrant/go/src/github.com/tendermint/tendermint/vendor/github.com/tendermint/go-wire/int.go:62
#       0x726aed        github.com/tendermint/tendermint/p2p/conn.(*MConnection).recvRoutine+0x11d            /home/vagrant/go/src/github.com/tendermint/tendermint/p2p/conn/connection.go:437

1 @ 0x42f15c 0x42a02a 0x429627 0x486cee 0x486d6d 0x4891c2 0x557592 0x5750de 0x5736f9 0x7af869 0x45e381
#       0x429626        internal/poll.runtime_pollWait+0x56                          /usr/lib/go-1.9/src/runtime/netpoll.go:173
#       0x486ced        internal/poll.(*pollDesc).wait+0xad                          /usr/lib/go-1.9/src/internal/poll/fd_poll_runtime.go:85
#       0x486d6c        internal/poll.(*pollDesc).waitRead+0x3c                      /usr/lib/go-1.9/src/internal/poll/fd_poll_runtime.go:90
#       0x4891c1        internal/poll.(*FD).Accept+0x1e1                             /usr/lib/go-1.9/src/internal/poll/fd_unix.go:335
#       0x557591        net.(*netFD).accept+0x41                                     /usr/lib/go-1.9/src/net/fd_unix.go:238
#       0x5750dd        net.(*TCPListener).accept+0x2d                               /usr/lib/go-1.9/src/net/tcpsock_posix.go:136
#       0x5736f8        net.(*TCPListener).Accept+0x48                               /usr/lib/go-1.9/src/net/tcpsock.go:247
#       0x7af868        github.com/tendermint/tendermint/p2p.(*remotePeer).accept+0x58/home/vagrant/go/src/github.com/tendermint/tendermint/p2p/peer_test.go:140

1 @ 0x42f15c 0x42f24e 0x406424 0x4060cb 0x4efe1c 0x4f37b4 0x4efac0 0x4f11f8 0x4f01c1 0x7b882b 0x42eca6 0x45e381
#       0x4efe1b        testing.(*T).Run+0x2fb          /usr/lib/go-1.9/src/testing/testing.go:790
#       0x4f37b3        testing.runTests.func1+0x63     /usr/lib/go-1.9/src/testing/testing.go:1004
#       0x4efabf        testing.tRunner+0xcf            /usr/lib/go-1.9/src/testing/testing.go:746
#       0x4f11f7        testing.runTests+0x2d7          /usr/lib/go-1.9/src/testing/testing.go:1002
#       0x4f01c0        testing.(*M).Run+0x110          /usr/lib/go-1.9/src/testing/testing.go:921
#       0x7b882a        main.main+0xda                  github.com/tendermint/tendermint/p2p/_test/_testmain.go:90
#       0x42eca5        runtime.main+0x225              /usr/lib/go-1.9/src/runtime/proc.go:195

1 @ 0x42f15c 0x43f3e9 0x7260f0 0x45e381
#       0x7260ef        github.com/tendermint/tendermint/p2p/conn.(*MConnection).sendRoutine+0x29f    /home/vagrant/go/src/github.com/tendermint/tendermint/p2p/conn/connection.go:307

1 @ 0x538b32 0x538930 0x53541b 0x7b3df7 0x4efac0 0x45e381
#       0x538b31        runtime/pprof.writeRuntimeProfile+0xa1                       /usr/lib/go-1.9/src/runtime/pprof/pprof.go:637
#       0x53892f        runtime/pprof.writeGoroutine+0x9f                            /usr/lib/go-1.9/src/runtime/pprof/pprof.go:599
#       0x53541a        runtime/pprof.(*Profile).WriteTo+0x3aa                       /usr/lib/go-1.9/src/runtime/pprof/pprof.go:310
#       0x7b3df6        github.com/tendermint/tendermint/p2p.TestSwitchReconnectsToPersistentPeer2+0x986      /home/vagrant/go/src/github.com/tendermint/tendermint/p2p/switch_test.go:344
#       0x4efabf        testing.tRunner+0xcf                                         /usr/lib/go-1.9/src/testing/testing.go:746

@melekes
Copy link
Contributor Author

melekes commented Jan 29, 2018

ok, so some consensus reactor is definitely holding a reference to a peer. that's why it's not garbage collected. see also https://stackoverflow.com/questions/48499573/is-there-a-way-to-know-who-holds-a-reference-to-an-object-in-go

@melekes
Copy link
Contributor Author

melekes commented Jan 29, 2018

mempool reactor

@melekes
Copy link
Contributor Author

melekes commented Jan 29, 2018

114 @ 0x42f2bc 0x42f3ae 0x440794 0x4403b9 0x468002 0x9fe32d 0x9ff78f 0xa025ed 0x45e571
#       0x4403b8        sync.runtime_Semacquire+0x38                                 /usr/lib/go-1.9/src/runtime/sema.go:56
#       0x468001        sync.(*WaitGroup).Wait+0x71                                  /usr/lib/go-1.9/src/sync/waitgroup.go:131
#       0x9fe32c        github.com/tendermint/tendermint/vendor/github.com/tendermint/tmlibs/clist.(*CList).FrontWait+0x2c    /home/vagrant/go/src/github.com/tendermint/tendermint/vendor/github.com/tendermint/tmlibs/clist/clist.go:233
#       0x9ff78e        github.com/tendermint/tendermint/mempool.(*Mempool).TxsFrontWait+0x2e                                 /home/vagrant/go/src/github.com/tendermint/tendermint/mempool/mempool.go:184
#       0xa025ec        github.com/tendermint/tendermint/mempool.(*MempoolReactor).broadcastTxRoutine+0x25c                   /home/vagrant/go/src/github.com/tendermint/tendermint/mempool/reactor.go:120

melekes added a commit that referenced this pull request Jan 30, 2018
Leaking goroutine:
```
114 @ 0x42f2bc 0x42f3ae 0x440794 0x4403b9 0x468002 0x9fe32d 0x9ff78f 0xa025ed 0x45e571
```

Explanation:
it blocks on an empty clist forever. so unless theres txs coming in,
this go routine will just sit there, holding onto the peer too.
if we're constantly reconnecting to some peer, old instances are not
garbage collected, leading to memory leak.

Fixes cosmos/gaia#108
Previous attempt #1156
melekes added a commit that referenced this pull request Feb 5, 2018
Leaking goroutine:
```
114 @ 0x42f2bc 0x42f3ae 0x440794 0x4403b9 0x468002 0x9fe32d 0x9ff78f 0xa025ed 0x45e571
```

Explanation:
it blocks on an empty clist forever. so unless theres txs coming in,
this go routine will just sit there, holding onto the peer too.
if we're constantly reconnecting to some peer, old instances are not
garbage collected, leading to memory leak.

Fixes cosmos/gaia#108
Previous attempt #1156
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants