Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setsockopt out of memory causes babeld failure #24

Open
bennlich opened this issue Mar 27, 2018 · 7 comments
Open

setsockopt out of memory causes babeld failure #24

bennlich opened this issue Mar 27, 2018 · 7 comments

Comments

@bennlich
Copy link
Collaborator

@bennlich bennlich commented Mar 27, 2018

Thanks to https://peoplesopen.net/monitor it is now easier to track this. See #8 for early observations.

The Bug

On a fresh boot of the psychz exit node, home nodes dig tunnels, babel babels, and everybody's routing tables get filled with mesh routes. But...

Over time (after about 24-48 hours), routes start to slowly disappear from the routing table, and they don't return until babeld and tunneldigger-broker are restarted on the exit node.

Debugging

This appears to be due to a memory leak in babeld. When the exit node is in the bad state, looking at /var/log/babeld.log during a tunnel connect shows:

Warning: cannot save old configuration for l2tp4061.
setsockopt(IPV6_JOIN_GROUP): Cannot allocate memory
setsockopt(IPV6_LEAVE_GROUP): Cannot assign requested address
Warning: cannot restore old configuration for l2tp4061.

i.e. babeld tries to add the socket to its ipv6 broadcast group and fails due to a memory allocation error.

When the exit node is in a healthy state, no such errors get logged to /var/log/babeld.log, and the mesh routes get added to the routing table as expected.

Conclusion

It looks like there's a socket option memory leak in babeld. I think we're only seeing this bug now in the last month because someone happens to be running a weird node that disconnects and reconnects its tunnel every 5 minutes. You can see this behavior by watching /var/log/syslog on the psychz node for 5 minutes.

Every time the rogue node destroys and recreates a tunnel, the tunneldigger up and down hooks are run, the old tunnel interface is removed from babeld (babeld -x $ifname) and the new tunnel interface is added (babeld -a $ifname).

It seems that removing an interface from babeld does not properly clean up all used memory, and eventually babeld is unable to setsockopt on new sockets.

Todo

Look into socket option memory allocation? Halp!

@jhpoelen

This comment has been minimized.

Copy link
Contributor

@jhpoelen jhpoelen commented Mar 27, 2018

@bennlich awesome! Perhaps we can hack on reproducing this in an isolated babeld stress test so we can easily know when future fixes are resolving the issue. Happy to help with this, although @Juul and others might have more experience with this.

jhpoelen added a commit to sudomesh/exitnode that referenced this issue Apr 1, 2018
jhpoelen added a commit to sudomesh/exitnode that referenced this issue Apr 1, 2018
@jhpoelen

This comment has been minimized.

Copy link
Contributor

@jhpoelen jhpoelen commented Apr 2, 2018

I have installed a babeld-monitor on both the HE and Psychz exit node to detect and apply a workaround for the issue reported in #24 . Using a systemd timer babeld-monitor.timer , babeld log is scanned for specific memory error every 10 minutes. If detected, babeld is restarted and all active tunnel interfaces are re-added to babeld. All things can be observed in the systemd logs. All this is now also added when using create_exitnode via sudomesh/exitnode repo. Please see https://github.com/sudomesh/exitnode/tree/master/src/opt/babeld-monitor and https://github.com/sudomesh/exitnode/tree/master/src/etc/systemd/system if you'd like to learn more about this.

@jhpoelen

This comment has been minimized.

Copy link
Contributor

@jhpoelen jhpoelen commented Apr 2, 2018

I hope we can remove this hack once the root cause of the babeld error can be found and fixed.

@bennlich

This comment has been minimized.

Copy link
Collaborator Author

@bennlich bennlich commented Apr 5, 2018

@jhpoelen nice haxxx! I'm reading up on systemd now... Would love to figure out the root cause too. Raw socket land seems like a daunting land tho. Maybe need to use a phone-a-friend.

@bennlich

This comment has been minimized.

@bennlich bennlich changed the title Over time, routes disappear from the exit node routing table setsockopt out of memory causes babeld failure Sep 27, 2018
@bennlich

This comment has been minimized.

Copy link
Collaborator Author

@bennlich bennlich commented Sep 28, 2018

Tried to write a dead-simple stress test today at the software working group with @eenblam and @squeeesh, but we were unable to reproduce the bug. I think our test did not go quite deep enough--an strace of babeld showed that babeld was rarely calling setsockopt(IPV6_JOIN_GROUP).

Probably a better test would involve creating fresh network interfaces and adding to babeld instead of adding/removing my computer's default interface over and over again :-P I'm not sure what's a good way to create a bunch of functional network interfaces...

Also, @eenblam noticed that in the re6stnet commit, they seem to suggest that their fix was to clean up their tunnels less aggressively. So: maybe babeld needs to setsockopt(IPV6_LEAVE_GROUP) /before/ tunneldigger obliterates the network interface. This would make some sense, as setsockopt(IPV6_LEAVE_GROUP) does expect to be passed an interface index (see https://tools.ietf.org/html/rfc3493#section-5.2).

And @squeeesh found this cool and terrifying network stress test lib https://github.com/dtaht/rtod.

@bennlich

This comment has been minimized.

Copy link
Collaborator Author

@bennlich bennlich commented Sep 28, 2018

Oh, and if it /is/ a matter of giving babeld a chance to LEAVE_GROUP before tunneldigger destroys the interface, tunneldigger's pre-down hook seems promising, except for the fact that:

the pre-down hook is not guaranteed to complete before the tunnel is shut down.

from https://github.com/wlanslovenija/tunneldigger/blob/master/HISTORY.rst.

(Hook scripts are executed in their own processes.)

jkilpatr added a commit to jkilpatr/althea_rs that referenced this issue Oct 16, 2019
This resolves the issue described here sudomesh/bugs#24
Where babel will be uanble to free it's resources for the interface and run out of
memory
jkilpatr added a commit to althea-net/althea_rs that referenced this issue Feb 7, 2020
This resolves the issue described here sudomesh/bugs#24
Where babel will be uanble to free it's resources for the interface and run out of
memory
jkilpatr added a commit to althea-net/althea_rs that referenced this issue Feb 7, 2020
This resolves the issue described here sudomesh/bugs#24
Where babel will be uanble to free it's resources for the interface and run out of
memory
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.