Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: possible memory leak #1662

Closed
jm-clius opened this issue Apr 11, 2023 · 24 comments
Closed

bug: possible memory leak #1662

jm-clius opened this issue Apr 11, 2023 · 24 comments
Assignees
Labels
bug Something isn't working critical This issue needs critical attention

Comments

@jm-clius
Copy link
Contributor

Problem

A possible memory leak was introduced or activated in the wakuv2.test fleet, roughly between 30 March and 5th of April

Impact

Critical. Unbounded memory growth causes crashes and instability.

Screenshots/logs

image

nwaku version/commit hash

At least c26dcb2 and later

Additional context

This memory leak is not visible on status.test, which runs the same version of nwaku

@jm-clius jm-clius added bug Something isn't working critical This issue needs critical attention track:maintenance labels Apr 11, 2023
@alrevuelta
Copy link
Contributor

I just deployed latest master (statusteam/nim-waku:9e1432c9) in my minifleet with just relay enabled, in an attempt to point to which protocol may be causing issues.

@Ivansete-status Ivansete-status self-assigned this Apr 11, 2023
@alrevuelta
Copy link
Contributor

I just deployed latest master (statusteam/nim-waku:9e1432c9) in my minifleet with just relay enabled, in an attempt to point to which protocol may be causing issues.

Can't reproduce the memory leak with the above setup. 20 peers with just relay and discovery enabled 100 msg/second and 2kb msg size (actually more than the traffic we have in the fleet). Container mem usage is <50Mb over almost 1 day.

So I would say the mem lead is not related to the relay protocol.

@jm-clius
Copy link
Contributor Author

We've run overnight on wakuv2.test:
commit c26dcb2 on ams3
and the latest commit (73cbafa) on hong-kong and restarted containers simultaneously.
Currently ams3 is at ~400MB memory and hong-kong over 1 GB, so it seems likely that the leak occurred in between,

I'm also running several other versions on a sandbox machine, but so far the leak is only observed on the fleets.

Next step is to try to pin the specific commit that caused the leak. Since we suspect that one of the submodule updates may be to blame, I'll deploy the commit just after the two submodule updates next.

@Ivansete-status
Copy link
Collaborator

  • ams3 it at 1.09GB - c26dcb2

  • us-central1-a is at 1.09GB - 73cbafa

  • The hong-kong node crashed with 862MB running 7c229ec

    Log file: /var/log/docker/nim-waku-v2/docker.20230414_0000.log.gz (node-01.ac-cn-hongkong-c.wakuv2.test)

    2023-04-13T19:02:56.472588+00:00 docker/nim-waku-v2[793]: /app/apps/wakunode2/wakunode2.nim(826) wakunode2
    2023-04-13T19:02:56.472624+00:00 docker/nim-waku-v2[793]: /app/vendor/nim-chronos/chronos/asyncloop.nim(1245) 
    runForever
    ...
    2023-04-13T19:02:56.472882+00:00 docker/nim-waku-v2[793]: Error: unhandled exception: over- or underflow [OverflowDefect]
    

@Ivansete-status
Copy link
Collaborator

@jm-clius checked the current memory status. it seems like it was present already in v0.16.0 release - no noticeable difference between release and later versions, all using more than 1.2 GB of memory currently. Restarting ams3 with v0.15.0.

@Ivansete-status
Copy link
Collaborator

  • HK has 2.51GB at 2ffd2f8. It didn't crash and clearly has mem leak.
  • US central has now 1.08GB at v0.16.0. It crashed when it had 1.56GB.
  • Ams3 has 537MB at v0.15.0. Although it crashed, Its memory gets properly GC'ed, without crash, every now and then.

@jm-clius
Copy link
Contributor Author

Thanks, @Ivansete-status! I'm starting HK with 43b3f1c - more or less halfway between the two releases, so we can narrow down the offending change.

@Ivansete-status
Copy link
Collaborator

Thanks @jm-clius!,
with 43b3f1c, the HK node gets correctly GC'ed at 1.17GB.

@Ivansete-status
Copy link
Collaborator

In 67db35e the HK node crashed at 1.06 GB

@Ivansete-status
Copy link
Collaborator

After analysing the heaptrack reports, we saw the next points as possible mem-leak candidates:

  1. Chronos - one - https://github.com/status-im/nim-chronos/blob/e05d2f8e9648e17b9a71bf5e909a79241c067c8b/chronos/streams/tlsstream.nim#L366

image

  1. Chronos - tlsWriteApp - https://github.com/status-im/nim-chronos/blob/e05d2f8e9648e17b9a71bf5e909a79241c067c8b/chronos/streams/tlsstream.nim#L339

image

  1. LibP2P - sendEncoded - https://github.com/status-im/nim-libp2p/blob/a5666789b0a80e810d55844ccbbbc6c18928b462/libp2p/protocols/pubsub/pubsubpeer.nim#L262

image

@jm-clius
Copy link
Contributor Author

Thanks for this! What is the meaning of the two values next to each other? Is that allocs and deallocs?

@Ivansete-status
Copy link
Collaborator

Ivansete-status commented Apr 20, 2023

Thanks for this! What is the meaning of the two values next to each other? Is that allocs and deallocs?

ah sorry! I missed that. Is ther Peak & Leaked, according to heaptrack.
image

@Ivansete-status
Copy link
Collaborator

In 67db35e the HK is 1.29 GB. Has mem leak.

@jm-clius
Copy link
Contributor Author

While we're investigating the ways in which these allocations could leak and if it's due to nwaku or submodules, I've done the following:

  • restarted HK with the same confirmed leaky version, but with websockets disabled. Secure websockets could be the cause of the TLSStream leaks
  • installed ef7ce67 on AMS3 and restarted

@jm-clius
Copy link
Contributor Author

jm-clius commented Apr 20, 2023

image

Websockets seeming more and more likely as the possible source of the leak: green line is HK node, running with websockets disabled after 17:15 - yellow is AMS3 running with websockets enabled. Note, however, that ams3 has more than 10 times the number of connected peers.

@jm-clius
Copy link
Contributor Author

image

14 hours of running with and without websockets. Green is without websockets.

@alrevuelta
Copy link
Contributor

We can indeed see some correlation between websockets enabled/not and the mem leak. Wondering where we are getting that many wss connections.

image

@alrevuelta
Copy link
Contributor

Wondering also if its fair to assume its websockets issue, since disabling websockets, reduces significantly the amount of peers.

image

@alrevuelta
Copy link
Contributor

A quick summary on the findings:

  • We verified with the existing data from the fleet that disabling websockets indeed stops the leak see.
  • But wondering if this is not really the rootcause of the issue. Disabling websockets also drops the connections from hundreds to just 6-8 see so perhaps its not fair to say its the websockets fault, since having a low number of peers has also some implications that can be related.
  • We deployed in the sandbox machine 80 nodes totally isolated from the internet, with the same exact cofiguration as in the fleets, but couldnt reproduce the leak.
  • We suspect that its not only about a given setup (release, configuration parameters, etc) but also some external action on the fleet (eg filte/lp/store requests, connections (tcp or wss), etc). Some of these external actions disapear when we test in our local sandbox.
  • We just deployed one node in the sandbox machine with the same configuration as the fleets, hoping to reproduce the leak here. Only difference is that the local database for messages is almost empty (since the node just started). If we can't reproduce the issue with this, I would disable store protocol as next step, as it would be the only difference.

@Ivansete-status
Copy link
Collaborator

Thanks for the summary @alrevuelta !

Regarding the LibP2P correlation with the mem-leak and wss option, in the next image we can see that in the last two hours, the US node has wss disabled but still ~70 peers, and the memory is low:

us_without_mem_leak_without_wss

@alrevuelta
Copy link
Contributor

@Ivansete-status ah i see my bad. I thought wss was enabled again in the star.

image

@Ivansete-status
Copy link
Collaborator

The next issue was reported in nim-websock status-im/nim-websock#138

@Ivansete-status
Copy link
Collaborator

We now have an additional problem which is that there are many peers connecting through wss.
The next table shows the most representative peers for node-01.do-ams3.wakuv2.test (271 wss ESTABLISHED connections.).

# of connections Remote peer address
28 103.50.33.82
51 45.128.199.251
45 cped4e2cb4f2b0a-cmd4e2cb4f2b08.cpe.net.cable.rogers.com
... ...
47 gen-070-125-105-243.biz.spectrum.com

(netstat -ap | grep ESTABLISHED | grep 8000 | awk '{print $5}' | cut -d':' -f 1 | sort | uniq -c from within the docker container)

I checked that all connections with 45.128.199.251 are actually in use by running the next cmd for a few minutes:
tcpdump dst 45.128.199.251 | awk '{print $5}' | cut -d'.' -f 5

@jm-clius
Copy link
Contributor Author

Several items being fixed here, but #1800 provides some memory improvement, coupled with better peer management techniques + gossipsub scoring.

Monitoring memory usage.

@SionoiS SionoiS unpinned this issue Jul 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working critical This issue needs critical attention
Projects
Archived in project
Development

No branches or pull requests

3 participants