Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfaults #174

Closed
linuskendall opened this issue Nov 23, 2021 · 35 comments
Closed

Segfaults #174

linuskendall opened this issue Nov 23, 2021 · 35 comments

Comments

@linuskendall
Copy link
Contributor

linuskendall commented Nov 23, 2021

It looks like since a recent inn fetch, we've started getting segmentation faults on the innernet clients as well as the innernet server.

Tried version 1.5.1 as well as a downgraded version 1.4.0.

Not quite sure where to start reviewing this, my suspicion would be a corrupted database but since it's happening on several hosts it seems like something that transmits through inn fetch? It isn't consistent and doesn't happen on all hosts. I guess a debug build would be helpful to get some details on the segfault?

@hansihe
Copy link

hansihe commented Nov 23, 2021

Just driving by with some suggestions, a good start would probably be:

  • Debug build if feasible performance-wise.
  • Try running built with ASAN.

@bschwind
Copy link
Member

Debug build if feasible performance-wise.

innernet doesn't require any sort of sensitive performance characteristics so this is probably the best way to go.

@linuskendall do you need any help getting a debug build?

@mcginty
Copy link
Collaborator

mcginty commented Nov 23, 2021

Thanks so much for posting the issue! I've never experienced this myself.

Debug build will definitely be a good start:

git clone https://github.com/tonarino/innernet
cargo build --bin innernet
sudo target/debug/innernet fetch [network]

And almost certainly since it's a segfault you'll need to combine that with a memory fault detector like @hansihe's ASAN suggestion or valgrind to help pinpoint the issue. I'm going to guess you're having this issue on Linux because there's a lot more FFI there than on the macOS version (suspects are the wireguard-control-sys FFI and the netlink code).

When you post an update, could you also include your OS details (Linux or macOS, kernel version, distro, etc)?

Also, if you don't have the time to do this yourself, if you post enough details that it can be reproduced on a VM I'm happy to dig in myself.

@linuskendall
Copy link
Contributor Author

linuskendall commented Nov 24, 2021

Hi @mcginty ,

I made a debug build yesterdand ran it. When run through gdb I got this (looks like your assumption is correct):

Program received signal SIGSEGV, Segmentation fault.
wireguard_control::backends::kernel::<impl core::convert::From<&wireguard_control_sys::wg_peer> for wireguard_control::device::PeerInfo>::from (raw=0x0) at wireguard-control/src/backends/kernel.rs:37
37      wireguard-control/src/backends/kernel.rs: No such file or directory.

A node that's partially added without a public key?

I tried building with the ASAN suggestion above, but it seems like it's not wokring due to an undefined symbol __asan_option_detect_stack_use_after_return in proc-macro-error-1.0.4/src/lib.rs. Checking what I can find as a solution for that.

A bit of info about the system:

Ubuntu 20.04.3 LTS
Linux mon-vmagent1 5.4.0-90-generic #101-Ubuntu SMP Fri Oct 15 20:00:55 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

@linuskendall
Copy link
Contributor Author

Valgrind stack trace:

valgrind.txt

@linuskendall
Copy link
Contributor Author

linuskendall commented Nov 24, 2021

A bit more data from the logs (not sure if helpful). Yesterday I did a reboot and after reboot innernet started up:

-- Reboot --
Nov 23 05:13:54 mon-vmagent1 systemd[1]: Started innernet client daemon for ....
Nov 23 05:13:55 mon-vmagent1 innernet[843]: [*] bringing up the interface.
Nov 23 05:13:55 mon-vmagent1 innernet[843]: [*] fetching state from server.
... peer xxx .. was added
Nov 23 05:13:55 mon-vmagent1 innernet[843]: [*] updating /etc/hosts with the latest peers.
Nov 23 05:13:55 mon-vmagent1 innernet[843]: [*] updated interface ...
Nov 23 05:13:55 mon-vmagent1 innernet[843]: [*] reporting 3 interface addresses as NAT traversal candidates...
Nov 23 05:13:55 mon-vmagent1 systemd[1]: innernet@....service: Main process exited, code=dumped, status=11/SEGV
Nov 23 05:13:55 mon-vmagent1 systemd[1]: innernet@....service: Failed with result 'core-dump'.
...

After this I get a number of restart attempts each ending with SEGV/core-dump.

If I run inn fetch <interface> I get:

[*] fetching state from server...
Segmentation fault (core dumped)

@mcginty
Copy link
Collaborator

mcginty commented Nov 24, 2021

Thank you so much @linuskendall! This is great debug info. Looks like an issue with the wireguard control FFI. Relevant bit from the valgrind log (the other statx issues seem to be rust-lang/rust#68979):

==8133==  Access not within mapped region at address 0x4
==8133==    at 0x6CB85E: wireguard_control::backends::kernel::<impl core::convert::From<&wireguard_control_sys::wg_peer> for wireguard_control::device::PeerInfo>::from (kernel.rs:37)
==8133==    by 0x6BBAFA: wireguard_control::backends::kernel::parse_peers (kernel.rs:110)
==8133==    by 0x6CBE92: wireguard_control::backends::kernel::<impl core::convert::From<&wireguard_control_sys::wg_device> for wireguard_control::device::Device>::from (kernel.rs:90)
==8133==    by 0x6BD7CA: wireguard_control::backends::kernel::get_by_name (kernel.rs:397)
==8133==    by 0x6CC9FC: wireguard_control::device::Device::get (device.rs:250)
==8133==    by 0x265674: innernet::fetch (main.rs:541)
==8133==    by 0x276304: innernet::run (main.rs:1153)
==8133==    by 0x2747A7: innernet::main (main.rs:1117)

It looks like the problem is not a missing public key, but that the raw pointer is still pointing to 0x0 (what it was initialized to), but somehow the ret code is still 0.

Not sure if this is a bug in the C wireguard helper library or our assumption about if ret == 0 we can use the device pointer. Checking it out now. It still doesn't answer yet what causes such a state to exist also...

@mcginty
Copy link
Collaborator

mcginty commented Nov 24, 2021

It's interesting that the first run of innernet seems to at least bring the interface up. A couple of questions:

  1. When innernet is in this segfault state, can you still run sudo wg and see the interface that innernet brought up normally?
  2. Given the crash in your systemd logs, it looks like it might crash only after reporting NAT information. Could you try:
    innernet down <interface>
    # or `sudo ip link del dev <interface>`
    
    innernet -vv up <interface> --no-nat-traversal --no-nat-candidates
    
    and see if a segfault or other crash still occurs there?

Sorry, this is an annoying one! I appreciate your help getting to the bottom of it.

@linuskendall
Copy link
Contributor Author

linuskendall commented Nov 24, 2021

Indeed,

  1. Yes, sudo wg shows the hosts configured as far as I can tell (there's 200+ hosts so I didn't check quite on them all)
  2. innernet down <if> gave a segfault, but after ip link del dev <if> I can get the innernet to come back up with the no-nat-traverrsal and no-nat candidates options. Looks like it came up without any crashes and I can see the hosts in wg. inn show -t <..> and inn fetch still segfaults though.

No problem, looking forward to get to the bottom of this :)

@mcginty
Copy link
Collaborator

mcginty commented Nov 25, 2021

Does inn fetch still segfault if also used with the same --no-nat-traversal --no-nat-candidates? Just to check if the NAT code taints this somehow, does the following sequence cause a segfault?

(disabling the systemd service if it's enabled to avoid that messing with the interface)

ip link del dev <if>
inn -vv up <if> --no-nat-traversal --no-nat-candidates
inn -vv fetch <if> --no-nat-traversal --no-nat-candidates

Also, I created a debug branch at ffi-segfault-fix (diff). Could you try reproducing the segfault with that and see if anything interesting is printed out?

While I doubt it's related, do you use IPv6 or IPv4 for your internal/external endpoints, and do you use any domain names as endpoints (requiring DNS resolution)?

@Lusitaniae
Copy link

Lusitaniae commented Nov 26, 2021

running: sudo target/debug/innernet -vv up <if>

[*] reporting 1 interface address as NAT traversal candidates...
[D] [ureq::pool] pulling stream from pool: http|10.42.0.1|51820 -> TcpStream { addr: 10.42.64.9:54230, peer: 10.42.0.1:51820, fd: 4 }
[D] [ureq::unit] sending request (reused connection) PUT http://10.42.0.1:51820/v1/user/candidates
[D] [ureq::unit] writing prelude: PUT /v1/user/candidates HTTP/1.1
Host: 10.42.0.1:51820
User-Agent: ureq/2.3.1
Accept: */*
X-Innernet-Server-Key: [redacted]
Content-Type: application/json
Content-Length: 23
[D] [ureq::unit] response 204 to PUT http://10.42.0.1:51820/v1/user/candidates
[D] [ureq::pool] adding stream to pool: http|10.42.0.1|51820 -> TcpStream { addr: 10.42.64.9:54230, peer: 10.42.0.1:51820, fd: 4 }
[D] reported candidates: [Endpoint { host: Ipv4(192.168.1.101), port: 34275 }]
Segmentation fault

@mcginty
Copy link
Collaborator

mcginty commented Nov 28, 2021

@Lusitaniae was this run also in with the code in the ffi-segfault-fix branch perchance?

@Lusitaniae
Copy link

Lusitaniae commented Nov 28, 2021

yessir

git status
On branch ffi-segfault-fix
Your branch is up to date with 'origin/ffi-segfault-fix'.

nothing to commit, working tree clean

cargo build --bin innernet
    Finished dev [unoptimized + debuginfo] target(s) in 0.33s
sudo target/debug/innernet -vv up <if>  --no-nat-traversal --no-nat-candidates
[*] fetching state for <if> from server...
[D] [ureq::stream] connecting to 10.42.0.1:51820 at 10.42.0.1:51820
[D] [ureq::stream] created stream: TcpStream { addr: 10.42.64.9:54364, peer: 10.42.0.1:51820, fd: 4 }
[D] [ureq::unit] sending request GET http://10.42.0.1:51820/v1/user/state
[D] [ureq::unit] writing prelude: GET /v1/user/state HTTP/1.1
Host: 10.42.0.1:51820
User-Agent: ureq/2.3.1
Accept: */*
X-Innernet-Server-Key: [redacted][D] [ureq::unit] response 200 to GET http://10.42.0.1:51820/v1/user/state
[D] [ureq::pool] adding stream to pool: http|10.42.0.1|51820 -> TcpStream { addr: 10.42.64.9:54364, peer: 10.42.0.1:51820, fd: 4 }
Segmentation fault

@mcginty
Copy link
Collaborator

mcginty commented Nov 28, 2021

@Lusitaniae If you' have time, could you also try this sequence of commands and let me know if you still have a segfault? I just pushed a bit more diagnostic output to the ffi-segfault-fix branch, if you wouldn't mind fetching those as well.

ip link del dev <if>
inn -vv up <if> --no-nat-traversal --no-nat-candidates
inn -vv fetch <if> --no-nat-traversal --no-nat-candidates

And then if you try up again without the NAT traversal flags and see if the extra diagnostic information comes with those segfaults, that would be very helpful.

Greatly appreciate your help :). Annoying that I can't reproduce the issue myself!

@Lusitaniae
Copy link

Lusitaniae commented Nov 29, 2021

Much appreciated for looking into this!

Seems there's no segfault now with NAT disabled

[*] reporting 0 interface addresses as NAT traversal candidates...
[D] [ureq::pool] pulling stream from pool: http|10.42.0.1|51820 -> TcpStream { addr: 10.42.64.9:54394, peer: 10.42.0.1:51820, fd: 4 }
[D] [ureq::unit] sending request (reused connection) PUT http://10.42.0.1:51820/v1/user/candidates
[D] [ureq::unit] writing prelude: PUT /v1/user/candidates HTTP/1.1
Host: 10.42.0.1:51820
User-Agent: ureq/2.3.1
Accept: */*
X-Innernet-Server-Key: [redacted]
Content-Type: application/json
Content-Length: 2
[D] [ureq::unit] response 204 to PUT http://10.42.0.1:51820/v1/user/candidates
[D] [ureq::pool] adding stream to pool: http|10.42.0.1|51820 -> TcpStream { addr: 10.42.64.9:54394, peer: 10.42.0.1:51820, fd: 4 }
[D] reported candidates: []
[D] NAT traversal explicitly disabled, not attempting.
[D] [ureq::stream] dropping stream: TcpStream { addr: 10.42.64.9:54394, peer: 10.42.0.1:51820, fd: 4 }

Removing the NAT flags sudo target/debug/innernet -vv fetch <if> or sudo target/debug/innernet -vv up <if>

D] [ureq::unit] response 204 to PUT http://10.42.0.1:51820/v1/user/candidates
[D] [ureq::pool] adding stream to pool: http|10.42.0.1|51820 -> TcpStream { addr: 10.42.64.9:54396, peer: 10.42.0.1:51820, fd: 4 }
[D] reported candidates: [Endpoint { host: Ipv4(192.168.1.101), port: 52231 }]
[T] NatTraverse::new()
[T] NatTraverse::refresh_remaining()
kernel get_by_name: FFI ret code was 0, &device is 0x563701677050, last OS error: Some(0)
kernel get_by_name: wg FFI ret code was normal.
[T] NatTraverse: retrieved device info.
[T] new remaining: []
[D] [ureq::stream] dropping stream: TcpStream { addr: 10.42.64.9:54396, peer: 10.42.0.1:51820, fd: 4 }

@mcginty
Copy link
Collaborator

mcginty commented Nov 29, 2021

@Lusitaniae huh, and so you can't reproduce the segfault now even without the flags?

This is an interesting one. So far, it seems like something about the NAT traversal code (updating endpoints?) puts the interface in a state that causes fetching information about the WireGuard interface to fail...

@linuskendall
Copy link
Contributor Author

@mcginty That's interesting. That series of commands seems to have fixed/cleaned up the segfault. I ran the three commands (ip link del / inn --no... up / inn --no .. fetch) and no segfaults.

I den deleted the interface again and re-ran inn up / inn fetch with the straight innernet 1.5.1 and now I don't see any errors or segfaults (?).

@linuskendall
Copy link
Contributor Author

Btw. @Lusitaniae and I are on the same team so we were seeing this on the same config/setup.

@linuskendall
Copy link
Contributor Author

linuskendall commented Dec 1, 2021

Just tried this on another host, got the same output as @Lusitaniae:

(ansible2.9) linuskendall@workstation:~/work/innernet$ sudo  ./target/debug/innernet fetch rpcpool   --no-nat-traversal --no-nat-candidates
[*] fetching state for rpcpool from server...
kernel get_by_name: FFI ret code was 0, &device is 0x56278f214d90, last OS error: Some(0)
kernel get_by_name: wg FFI ret code was normal.
[*] peers are already up to date
[*] reporting 0 interface addresses as NAT traversal candidates...

@mcginty
Copy link
Collaborator

mcginty commented Dec 1, 2021

Yeah, this is an interesting one, thanks so much for helping. It's going to be hard to track down further without a reproducible setup to test fixes on, but in the mean time I'll prioritize dropping this FFI all together and putting effort into improving the WireGuard netlink rust support here: little-dude/netlink#191

@linuskendall
Copy link
Contributor Author

Looks like it's re-emerging, just had a client where the problem re-emerged. The three steps worked, but I guess it'll get back itno that state soon enough. We've got the innernet-server throwing segfaults right now.

@mcginty
Copy link
Collaborator

mcginty commented Dec 3, 2021

@linuskendall and when running that debug branch now when a segfault is happening there are no lines that start with kernel get_by_name:? curious if we can suss out any error codes that can help.

FYI, I've started migrating off the FFI crate in a WIP branch: #177

@Lusitaniae
Copy link

same as above @mcginty

[*] reporting 1 interface address as NAT traversal candidates...
kernel get_by_name: FFI ret code was 0, &device is 0x558c820cd0a0, last OS error: Some(0)
kernel get_by_name: wg FFI ret code was normal.
Segmentation fault

@mcginty
Copy link
Collaborator

mcginty commented Jan 7, 2022

@linuskendall @Lusitaniae I just merged the netlink rewrite into the main branch, so the code that was causing the segfault no longer exists :). Please feel free to give the latest main a try and see how it goes.

@Lusitaniae
Copy link

great news! much appreciated for the quick turn around @mcginty

we'll start rolling out the new release and keep you posted

@dancamarg0
Copy link

dancamarg0 commented Jan 7, 2022

Hi @mcginty .

I work in the same Team as @linuskendall and @Lusitaniae . I've compiled the innernet client and server from main successfully but we're getting this error when attempting to run any command, e.g: innernet fetch <interface

thread 'main' panicked at 'range end index 43052 out of range for slice of length 4096', /home/daniel/.cargo/registry/src/github.com-1ecc6299db9ec823/netlink-packet-core-0.4.1/src/message.rs:149:27

Compiled with:

rustc 1.57.0 (f1edd0429 2021-11-29)
libclang-6.0-dev
libclang-common-6.0-dev

@mcginty
Copy link
Collaborator

mcginty commented Jan 8, 2022

Ha! I love how big your network is catching these edge cases. Looks like the number of peers in the network generate a netlink packet that exceeds the max buffer size and I didn't write logic to break it into multiple parts in that case.

Will write that (and tests) and let you know. Thanks so much for testing!

@mcginty
Copy link
Collaborator

mcginty commented Jan 10, 2022

@dancamarg0 I've added code to break up large wireguard updates into multiple netlink requests and confirmed on my machine that it works with large peer updates. Could you give it another try?

@dancamarg0
Copy link

@mcginty I'm currently getting the following error on both innernet and innernet-server binaries:

Error: Serialized netlink packet larger than maximum size 4096

Looks like I'm hitting this commit: 4784a69

I wonder if you did commit all your changes into main already? This one seems to be the latest commit related to netlink (2 days ago)

Let me know if we can help with anything.

@mcginty
Copy link
Collaborator

mcginty commented Jan 10, 2022

@dancamarg0 a95fa1b should be the latest commit (the meat of the fix was added in 92b60f5).

@dancamarg0
Copy link

^ Cool

But yes I have the latest code then. My libclang-dev lib and Rust are still the same as reported above. I wonder if there is any other dependency i don't have installed that's breaking it?

FYI if I perform a simple command like innernet fetch <interface it works, but spills the error just at the end.

[E] rpcpool - Serialized netlink packet larger than maximum size 4096

@mcginty
Copy link
Collaborator

mcginty commented Jan 11, 2022

@dancamarg0 thanks for clarifying - I'll double-check my math and get back to you. Sorry for the number of iterations this is taking, it's tough when I can't reproduce this stuff locally!

@mcginty
Copy link
Collaborator

mcginty commented Jan 11, 2022

@dancamarg0 I think I might have caught it and fixed it in e04bd26. Let me know if you get a chance to try it out.

@dancamarg0
Copy link

Appears to be working now @mcginty thanks so much!!

Both innernet and innernet-server are running without errors or segfaulting, we'll be rolling this new version slowly to our cluster over this week and let you know if we find anything.

Just a quick note about the innernet client, it seems this latest version is re-fetching all peers every time I run ./innernet fetch <interface> instead of caching results and just appending new nodes.

So, this is the expected behavior (old binary)

sudo inn fetch <interface>
[*] fetching state from server.
[*] peers are already up to date.
[*] reporting 1 interface address as NAT traversal candidates...

Rather it's generating a big output with all of of ours nodes of the sort [*] peer <peer> (<pubkey>...) was added. and at the end:

[*] updating /etc/hosts with the latest peers.

[*] updated interface <rpcpool>

[*] reporting 1 interface address as NAT traversal candidates..

I don't think this is any critical as this generates very little hardware overhead, but would be cool if we can sort this out too.

@mcginty
Copy link
Collaborator

mcginty commented Jan 11, 2022

YESSSSSSSSSSSSSSSSS FINALLY!! Thanks for sticking through that with me.

I'll look into the large updates after every fetch - it's possible that the same fix will apply in the inverse direction - wireguard's kernel module is possibly sending multi-part netlink messages to report the state of the interface and we need to have logic to join them together. Like you said, it shouldn't be a blocker for you right now though, just not as efficient as it could be :).

Going to close this as fixed now, and then make a new issue for checking the netlink code for retrieving interface state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants