New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfaults #174
Comments
Just driving by with some suggestions, a good start would probably be:
|
@linuskendall do you need any help getting a debug build? |
Thanks so much for posting the issue! I've never experienced this myself. Debug build will definitely be a good start: git clone https://github.com/tonarino/innernet
cargo build --bin innernet
sudo target/debug/innernet fetch [network] And almost certainly since it's a segfault you'll need to combine that with a memory fault detector like @hansihe's ASAN suggestion or valgrind to help pinpoint the issue. I'm going to guess you're having this issue on Linux because there's a lot more FFI there than on the macOS version (suspects are the wireguard-control-sys FFI and the netlink code). When you post an update, could you also include your OS details (Linux or macOS, kernel version, distro, etc)? Also, if you don't have the time to do this yourself, if you post enough details that it can be reproduced on a VM I'm happy to dig in myself. |
Hi @mcginty , I made a debug build yesterdand ran it. When run through gdb I got this (looks like your assumption is correct):
A node that's partially added without a public key? I tried building with the ASAN suggestion above, but it seems like it's not wokring due to an undefined symbol A bit of info about the system: Ubuntu 20.04.3 LTS |
Valgrind stack trace: |
A bit more data from the logs (not sure if helpful). Yesterday I did a reboot and after reboot innernet started up:
After this I get a number of restart attempts each ending with SEGV/core-dump. If I run
|
Thank you so much @linuskendall! This is great debug info. Looks like an issue with the wireguard control FFI. Relevant bit from the valgrind log (the other statx issues seem to be rust-lang/rust#68979):
It looks like the problem is not a missing public key, but that the Not sure if this is a bug in the C wireguard helper library or our assumption about |
It's interesting that the first run of innernet seems to at least bring the interface up. A couple of questions:
Sorry, this is an annoying one! I appreciate your help getting to the bottom of it. |
Indeed,
No problem, looking forward to get to the bottom of this :) |
Does (disabling the systemd service if it's enabled to avoid that messing with the interface)
Also, I created a debug branch at While I doubt it's related, do you use IPv6 or IPv4 for your internal/external endpoints, and do you use any domain names as endpoints (requiring DNS resolution)? |
running:
|
@Lusitaniae was this run also in with the code in the |
yessir
|
@Lusitaniae If you' have time, could you also try this sequence of commands and let me know if you still have a segfault? I just pushed a bit more diagnostic output to the
And then if you try Greatly appreciate your help :). Annoying that I can't reproduce the issue myself! |
Much appreciated for looking into this! Seems there's no segfault now with NAT disabled
Removing the NAT flags
|
@Lusitaniae huh, and so you can't reproduce the segfault now even without the flags? This is an interesting one. So far, it seems like something about the NAT traversal code (updating endpoints?) puts the interface in a state that causes fetching information about the WireGuard interface to fail... |
@mcginty That's interesting. That series of commands seems to have fixed/cleaned up the segfault. I ran the three commands (ip link del / inn --no... up / inn --no .. fetch) and no segfaults. I den deleted the interface again and re-ran inn up / inn fetch with the straight innernet 1.5.1 and now I don't see any errors or segfaults (?). |
Btw. @Lusitaniae and I are on the same team so we were seeing this on the same config/setup. |
Just tried this on another host, got the same output as @Lusitaniae:
|
Yeah, this is an interesting one, thanks so much for helping. It's going to be hard to track down further without a reproducible setup to test fixes on, but in the mean time I'll prioritize dropping this FFI all together and putting effort into improving the WireGuard netlink rust support here: little-dude/netlink#191 |
Looks like it's re-emerging, just had a client where the problem re-emerged. The three steps worked, but I guess it'll get back itno that state soon enough. We've got the innernet-server throwing segfaults right now. |
@linuskendall and when running that debug branch now when a segfault is happening there are no lines that start with FYI, I've started migrating off the FFI crate in a WIP branch: #177 |
same as above @mcginty
|
@linuskendall @Lusitaniae I just merged the netlink rewrite into the main branch, so the code that was causing the segfault no longer exists :). Please feel free to give the latest main a try and see how it goes. |
great news! much appreciated for the quick turn around @mcginty we'll start rolling out the new release and keep you posted |
Hi @mcginty . I work in the same Team as @linuskendall and @Lusitaniae . I've compiled the innernet client and server from main successfully but we're getting this error when attempting to run any command, e.g:
Compiled with:
|
Ha! I love how big your network is catching these edge cases. Looks like the number of peers in the network generate a netlink packet that exceeds the max buffer size and I didn't write logic to break it into multiple parts in that case. Will write that (and tests) and let you know. Thanks so much for testing! |
@dancamarg0 I've added code to break up large wireguard updates into multiple netlink requests and confirmed on my machine that it works with large peer updates. Could you give it another try? |
@mcginty I'm currently getting the following error on both
Looks like I'm hitting this commit: 4784a69 I wonder if you did commit all your changes into main already? This one seems to be the latest commit related to netlink (2 days ago) Let me know if we can help with anything. |
@dancamarg0 a95fa1b should be the latest commit (the meat of the fix was added in 92b60f5). |
^ Cool But yes I have the latest code then. My FYI if I perform a simple command like
|
@dancamarg0 thanks for clarifying - I'll double-check my math and get back to you. Sorry for the number of iterations this is taking, it's tough when I can't reproduce this stuff locally! |
@dancamarg0 I think I might have caught it and fixed it in e04bd26. Let me know if you get a chance to try it out. |
Appears to be working now @mcginty thanks so much!! Both innernet and innernet-server are running without errors or segfaulting, we'll be rolling this new version slowly to our cluster over this week and let you know if we find anything. Just a quick note about the innernet client, it seems this latest version is re-fetching all peers every time I run So, this is the expected behavior (old binary)
Rather it's generating a big output with all of of ours nodes of the sort
I don't think this is any critical as this generates very little hardware overhead, but would be cool if we can sort this out too. |
YESSSSSSSSSSSSSSSSS FINALLY!! Thanks for sticking through that with me. I'll look into the large updates after every fetch - it's possible that the same fix will apply in the inverse direction - wireguard's kernel module is possibly sending multi-part netlink messages to report the state of the interface and we need to have logic to join them together. Like you said, it shouldn't be a blocker for you right now though, just not as efficient as it could be :). Going to close this as fixed now, and then make a new issue for checking the netlink code for retrieving interface state. |
It looks like since a recent inn fetch, we've started getting segmentation faults on the innernet clients as well as the innernet server.
Tried version 1.5.1 as well as a downgraded version 1.4.0.
Not quite sure where to start reviewing this, my suspicion would be a corrupted database but since it's happening on several hosts it seems like something that transmits through inn fetch? It isn't consistent and doesn't happen on all hosts. I guess a debug build would be helpful to get some details on the segfault?
The text was updated successfully, but these errors were encountered: