Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
setsockopt out of memory causes babeld failure #24
On a fresh boot of the psychz exit node, home nodes dig tunnels, babel babels, and everybody's routing tables get filled with mesh routes. But...
Over time (after about 24-48 hours), routes start to slowly disappear from the routing table, and they don't return until
This appears to be due to a memory leak in babeld. When the exit node is in the bad state, looking at
i.e. babeld tries to add the socket to its ipv6 broadcast group and fails due to a memory allocation error.
When the exit node is in a healthy state, no such errors get logged to
It looks like there's a socket option memory leak in babeld. I think we're only seeing this bug now in the last month because someone happens to be running a weird node that disconnects and reconnects its tunnel every 5 minutes. You can see this behavior by watching
Every time the rogue node destroys and recreates a tunnel, the tunneldigger up and down hooks are run, the old tunnel interface is removed from babeld (
It seems that removing an interface from babeld does not properly clean up all used memory, and eventually babeld is unable to
Look into socket option memory allocation? Halp!
I have installed a babeld-monitor on both the HE and Psychz exit node to detect and apply a workaround for the issue reported in #24 . Using a systemd timer babeld-monitor.timer , babeld log is scanned for specific memory error every 10 minutes. If detected, babeld is restarted and all active tunnel interfaces are re-added to babeld. All things can be observed in the systemd logs. All this is now also added when using create_exitnode via sudomesh/exitnode repo. Please see https://github.com/sudomesh/exitnode/tree/master/src/opt/babeld-monitor and https://github.com/sudomesh/exitnode/tree/master/src/etc/systemd/system if you'd like to learn more about this.
Tried to write a dead-simple stress test today at the software working group with @eenblam and @squeeesh, but we were unable to reproduce the bug. I think our test did not go quite deep enough--an
Probably a better test would involve creating fresh network interfaces and adding to babeld instead of adding/removing my computer's default interface over and over again :-P I'm not sure what's a good way to create a bunch of functional network interfaces...
Also, @eenblam noticed that in the re6stnet commit, they seem to suggest that their fix was to clean up their tunnels less aggressively. So: maybe babeld needs to
Oh, and if it /is/ a matter of giving babeld a chance to
(Hook scripts are executed in their own processes.)