Segfaults #174

linuskendall · 2021-11-23T05:17:39Z

It looks like since a recent inn fetch, we've started getting segmentation faults on the innernet clients as well as the innernet server.

Tried version 1.5.1 as well as a downgraded version 1.4.0.

Not quite sure where to start reviewing this, my suspicion would be a corrupted database but since it's happening on several hosts it seems like something that transmits through inn fetch? It isn't consistent and doesn't happen on all hosts. I guess a debug build would be helpful to get some details on the segfault?

hansihe · 2021-11-23T12:56:46Z

Just driving by with some suggestions, a good start would probably be:

Debug build if feasible performance-wise.
Try running built with ASAN.

bschwind · 2021-11-23T12:59:50Z

Debug build if feasible performance-wise.

innernet doesn't require any sort of sensitive performance characteristics so this is probably the best way to go.

@linuskendall do you need any help getting a debug build?

mcginty · 2021-11-23T18:51:57Z

Thanks so much for posting the issue! I've never experienced this myself.

Debug build will definitely be a good start:

git clone https://github.com/tonarino/innernet
cargo build --bin innernet
sudo target/debug/innernet fetch [network]

And almost certainly since it's a segfault you'll need to combine that with a memory fault detector like @hansihe's ASAN suggestion or valgrind to help pinpoint the issue. I'm going to guess you're having this issue on Linux because there's a lot more FFI there than on the macOS version (suspects are the wireguard-control-sys FFI and the netlink code).

When you post an update, could you also include your OS details (Linux or macOS, kernel version, distro, etc)?

Also, if you don't have the time to do this yourself, if you post enough details that it can be reproduced on a VM I'm happy to dig in myself.

linuskendall · 2021-11-24T08:52:44Z

Hi @mcginty ,

I made a debug build yesterdand ran it. When run through gdb I got this (looks like your assumption is correct):

Program received signal SIGSEGV, Segmentation fault.
wireguard_control::backends::kernel::<impl core::convert::From<&wireguard_control_sys::wg_peer> for wireguard_control::device::PeerInfo>::from (raw=0x0) at wireguard-control/src/backends/kernel.rs:37
37      wireguard-control/src/backends/kernel.rs: No such file or directory.

A node that's partially added without a public key?

I tried building with the ASAN suggestion above, but it seems like it's not wokring due to an undefined symbol __asan_option_detect_stack_use_after_return in proc-macro-error-1.0.4/src/lib.rs. Checking what I can find as a solution for that.

A bit of info about the system:

Ubuntu 20.04.3 LTS
Linux mon-vmagent1 5.4.0-90-generic #101-Ubuntu SMP Fri Oct 15 20:00:55 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

linuskendall · 2021-11-24T09:06:02Z

Valgrind stack trace:

valgrind.txt

linuskendall · 2021-11-24T09:29:48Z

A bit more data from the logs (not sure if helpful). Yesterday I did a reboot and after reboot innernet started up:

-- Reboot --
Nov 23 05:13:54 mon-vmagent1 systemd[1]: Started innernet client daemon for ....
Nov 23 05:13:55 mon-vmagent1 innernet[843]: [*] bringing up the interface.
Nov 23 05:13:55 mon-vmagent1 innernet[843]: [*] fetching state from server.
... peer xxx .. was added
Nov 23 05:13:55 mon-vmagent1 innernet[843]: [*] updating /etc/hosts with the latest peers.
Nov 23 05:13:55 mon-vmagent1 innernet[843]: [*] updated interface ...
Nov 23 05:13:55 mon-vmagent1 innernet[843]: [*] reporting 3 interface addresses as NAT traversal candidates...
Nov 23 05:13:55 mon-vmagent1 systemd[1]: innernet@....service: Main process exited, code=dumped, status=11/SEGV
Nov 23 05:13:55 mon-vmagent1 systemd[1]: innernet@....service: Failed with result 'core-dump'.
...

After this I get a number of restart attempts each ending with SEGV/core-dump.

If I run inn fetch <interface> I get:

[*] fetching state from server...
Segmentation fault (core dumped)

mcginty · 2021-11-24T15:34:32Z

Thank you so much @linuskendall! This is great debug info. Looks like an issue with the wireguard control FFI. Relevant bit from the valgrind log (the other statx issues seem to be rust-lang/rust#68979):

==8133==  Access not within mapped region at address 0x4
==8133==    at 0x6CB85E: wireguard_control::backends::kernel::<impl core::convert::From<&wireguard_control_sys::wg_peer> for wireguard_control::device::PeerInfo>::from (kernel.rs:37)
==8133==    by 0x6BBAFA: wireguard_control::backends::kernel::parse_peers (kernel.rs:110)
==8133==    by 0x6CBE92: wireguard_control::backends::kernel::<impl core::convert::From<&wireguard_control_sys::wg_device> for wireguard_control::device::Device>::from (kernel.rs:90)
==8133==    by 0x6BD7CA: wireguard_control::backends::kernel::get_by_name (kernel.rs:397)
==8133==    by 0x6CC9FC: wireguard_control::device::Device::get (device.rs:250)
==8133==    by 0x265674: innernet::fetch (main.rs:541)
==8133==    by 0x276304: innernet::run (main.rs:1153)
==8133==    by 0x2747A7: innernet::main (main.rs:1117)

It looks like the problem is not a missing public key, but that the raw pointer is still pointing to 0x0 (what it was initialized to), but somehow the ret code is still 0.

Not sure if this is a bug in the C wireguard helper library or our assumption about if ret == 0 we can use the device pointer. Checking it out now. It still doesn't answer yet what causes such a state to exist also...

mcginty · 2021-11-24T16:18:50Z

It's interesting that the first run of innernet seems to at least bring the interface up. A couple of questions:

When innernet is in this segfault state, can you still run sudo wg and see the interface that innernet brought up normally?
Given the crash in your systemd logs, it looks like it might crash only after reporting NAT information. Could you try:
```
innernet down <interface>
# or `sudo ip link del dev <interface>`

innernet -vv up <interface> --no-nat-traversal --no-nat-candidates
```
and see if a segfault or other crash still occurs there?

Sorry, this is an annoying one! I appreciate your help getting to the bottom of it.

linuskendall · 2021-11-24T17:32:33Z

Indeed,

Yes, sudo wg shows the hosts configured as far as I can tell (there's 200+ hosts so I didn't check quite on them all)
innernet down <if> gave a segfault, but after ip link del dev <if> I can get the innernet to come back up with the no-nat-traverrsal and no-nat candidates options. Looks like it came up without any crashes and I can see the hosts in wg. inn show -t <..> and inn fetch still segfaults though.

No problem, looking forward to get to the bottom of this :)

mcginty · 2021-11-25T15:52:51Z

Does inn fetch still segfault if also used with the same --no-nat-traversal --no-nat-candidates? Just to check if the NAT code taints this somehow, does the following sequence cause a segfault?

(disabling the systemd service if it's enabled to avoid that messing with the interface)

ip link del dev <if>
inn -vv up <if> --no-nat-traversal --no-nat-candidates
inn -vv fetch <if> --no-nat-traversal --no-nat-candidates

Also, I created a debug branch at ffi-segfault-fix (diff). Could you try reproducing the segfault with that and see if anything interesting is printed out?

While I doubt it's related, do you use IPv6 or IPv4 for your internal/external endpoints, and do you use any domain names as endpoints (requiring DNS resolution)?

Lusitaniae · 2021-11-26T05:15:34Z

running: sudo target/debug/innernet -vv up <if>

[*] reporting 1 interface address as NAT traversal candidates...
[D] [ureq::pool] pulling stream from pool: http|10.42.0.1|51820 -> TcpStream { addr: 10.42.64.9:54230, peer: 10.42.0.1:51820, fd: 4 }
[D] [ureq::unit] sending request (reused connection) PUT http://10.42.0.1:51820/v1/user/candidates
[D] [ureq::unit] writing prelude: PUT /v1/user/candidates HTTP/1.1
Host: 10.42.0.1:51820
User-Agent: ureq/2.3.1
Accept: */*
X-Innernet-Server-Key: [redacted]
Content-Type: application/json
Content-Length: 23
[D] [ureq::unit] response 204 to PUT http://10.42.0.1:51820/v1/user/candidates
[D] [ureq::pool] adding stream to pool: http|10.42.0.1|51820 -> TcpStream { addr: 10.42.64.9:54230, peer: 10.42.0.1:51820, fd: 4 }
[D] reported candidates: [Endpoint { host: Ipv4(192.168.1.101), port: 34275 }]
Segmentation fault

mcginty · 2021-11-28T18:21:58Z

@Lusitaniae was this run also in with the code in the ffi-segfault-fix branch perchance?

Lusitaniae · 2021-11-28T18:38:39Z

yessir

git status
On branch ffi-segfault-fix
Your branch is up to date with 'origin/ffi-segfault-fix'.

nothing to commit, working tree clean

cargo build --bin innernet
    Finished dev [unoptimized + debuginfo] target(s) in 0.33s

sudo target/debug/innernet -vv up <if>  --no-nat-traversal --no-nat-candidates
[*] fetching state for <if> from server...
[D] [ureq::stream] connecting to 10.42.0.1:51820 at 10.42.0.1:51820
[D] [ureq::stream] created stream: TcpStream { addr: 10.42.64.9:54364, peer: 10.42.0.1:51820, fd: 4 }
[D] [ureq::unit] sending request GET http://10.42.0.1:51820/v1/user/state
[D] [ureq::unit] writing prelude: GET /v1/user/state HTTP/1.1
Host: 10.42.0.1:51820
User-Agent: ureq/2.3.1
Accept: */*
X-Innernet-Server-Key: [redacted][D] [ureq::unit] response 200 to GET http://10.42.0.1:51820/v1/user/state
[D] [ureq::pool] adding stream to pool: http|10.42.0.1|51820 -> TcpStream { addr: 10.42.64.9:54364, peer: 10.42.0.1:51820, fd: 4 }
Segmentation fault

mcginty · 2021-11-28T19:58:02Z

@Lusitaniae If you' have time, could you also try this sequence of commands and let me know if you still have a segfault? I just pushed a bit more diagnostic output to the ffi-segfault-fix branch, if you wouldn't mind fetching those as well.

ip link del dev <if>
inn -vv up <if> --no-nat-traversal --no-nat-candidates
inn -vv fetch <if> --no-nat-traversal --no-nat-candidates

And then if you try up again without the NAT traversal flags and see if the extra diagnostic information comes with those segfaults, that would be very helpful.

Greatly appreciate your help :). Annoying that I can't reproduce the issue myself!

Lusitaniae · 2021-11-29T03:48:27Z

Much appreciated for looking into this!

Seems there's no segfault now with NAT disabled

[*] reporting 0 interface addresses as NAT traversal candidates...
[D] [ureq::pool] pulling stream from pool: http|10.42.0.1|51820 -> TcpStream { addr: 10.42.64.9:54394, peer: 10.42.0.1:51820, fd: 4 }
[D] [ureq::unit] sending request (reused connection) PUT http://10.42.0.1:51820/v1/user/candidates
[D] [ureq::unit] writing prelude: PUT /v1/user/candidates HTTP/1.1
Host: 10.42.0.1:51820
User-Agent: ureq/2.3.1
Accept: */*
X-Innernet-Server-Key: [redacted]
Content-Type: application/json
Content-Length: 2
[D] [ureq::unit] response 204 to PUT http://10.42.0.1:51820/v1/user/candidates
[D] [ureq::pool] adding stream to pool: http|10.42.0.1|51820 -> TcpStream { addr: 10.42.64.9:54394, peer: 10.42.0.1:51820, fd: 4 }
[D] reported candidates: []
[D] NAT traversal explicitly disabled, not attempting.
[D] [ureq::stream] dropping stream: TcpStream { addr: 10.42.64.9:54394, peer: 10.42.0.1:51820, fd: 4 }

Removing the NAT flags sudo target/debug/innernet -vv fetch <if> or sudo target/debug/innernet -vv up <if>

D] [ureq::unit] response 204 to PUT http://10.42.0.1:51820/v1/user/candidates
[D] [ureq::pool] adding stream to pool: http|10.42.0.1|51820 -> TcpStream { addr: 10.42.64.9:54396, peer: 10.42.0.1:51820, fd: 4 }
[D] reported candidates: [Endpoint { host: Ipv4(192.168.1.101), port: 52231 }]
[T] NatTraverse::new()
[T] NatTraverse::refresh_remaining()
kernel get_by_name: FFI ret code was 0, &device is 0x563701677050, last OS error: Some(0)
kernel get_by_name: wg FFI ret code was normal.
[T] NatTraverse: retrieved device info.
[T] new remaining: []
[D] [ureq::stream] dropping stream: TcpStream { addr: 10.42.64.9:54396, peer: 10.42.0.1:51820, fd: 4 }

mcginty · 2021-11-29T04:51:59Z

@Lusitaniae huh, and so you can't reproduce the segfault now even without the flags?

This is an interesting one. So far, it seems like something about the NAT traversal code (updating endpoints?) puts the interface in a state that causes fetching information about the WireGuard interface to fail...

linuskendall · 2021-11-30T09:19:35Z

@mcginty That's interesting. That series of commands seems to have fixed/cleaned up the segfault. I ran the three commands (ip link del / inn --no... up / inn --no .. fetch) and no segfaults.

I den deleted the interface again and re-ran inn up / inn fetch with the straight innernet 1.5.1 and now I don't see any errors or segfaults (?).

linuskendall · 2021-11-30T09:19:53Z

Btw. @Lusitaniae and I are on the same team so we were seeing this on the same config/setup.

linuskendall · 2021-12-01T11:50:08Z

Just tried this on another host, got the same output as @Lusitaniae:

(ansible2.9) linuskendall@workstation:~/work/innernet$ sudo  ./target/debug/innernet fetch rpcpool   --no-nat-traversal --no-nat-candidates
[*] fetching state for rpcpool from server...
kernel get_by_name: FFI ret code was 0, &device is 0x56278f214d90, last OS error: Some(0)
kernel get_by_name: wg FFI ret code was normal.
[*] peers are already up to date
[*] reporting 0 interface addresses as NAT traversal candidates...

mcginty · 2021-12-01T17:23:14Z

Yeah, this is an interesting one, thanks so much for helping. It's going to be hard to track down further without a reproducible setup to test fixes on, but in the mean time I'll prioritize dropping this FFI all together and putting effort into improving the WireGuard netlink rust support here: little-dude/netlink#191

linuskendall · 2021-12-03T09:24:57Z

Looks like it's re-emerging, just had a client where the problem re-emerged. The three steps worked, but I guess it'll get back itno that state soon enough. We've got the innernet-server throwing segfaults right now.

mcginty · 2021-12-03T23:50:04Z

@linuskendall and when running that debug branch now when a segfault is happening there are no lines that start with kernel get_by_name:? curious if we can suss out any error codes that can help.

FYI, I've started migrating off the FFI crate in a WIP branch: #177

Lusitaniae · 2021-12-22T10:58:21Z

same as above @mcginty

[*] reporting 1 interface address as NAT traversal candidates...
kernel get_by_name: FFI ret code was 0, &device is 0x558c820cd0a0, last OS error: Some(0)
kernel get_by_name: wg FFI ret code was normal.
Segmentation fault

mcginty · 2022-01-07T09:44:13Z

@linuskendall @Lusitaniae I just merged the netlink rewrite into the main branch, so the code that was causing the segfault no longer exists :). Please feel free to give the latest main a try and see how it goes.

Lusitaniae · 2022-01-07T09:46:06Z

great news! much appreciated for the quick turn around @mcginty

we'll start rolling out the new release and keep you posted

dancamarg0 · 2022-01-07T13:57:23Z

Hi @mcginty .

I work in the same Team as @linuskendall and @Lusitaniae . I've compiled the innernet client and server from main successfully but we're getting this error when attempting to run any command, e.g: innernet fetch <interface

thread 'main' panicked at 'range end index 43052 out of range for slice of length 4096', /home/daniel/.cargo/registry/src/github.com-1ecc6299db9ec823/netlink-packet-core-0.4.1/src/message.rs:149:27

Compiled with:

rustc 1.57.0 (f1edd0429 2021-11-29)
libclang-6.0-dev
libclang-common-6.0-dev

mcginty · 2022-01-08T03:30:22Z

Ha! I love how big your network is catching these edge cases. Looks like the number of peers in the network generate a netlink packet that exceeds the max buffer size and I didn't write logic to break it into multiple parts in that case.

Will write that (and tests) and let you know. Thanks so much for testing!

mcginty · 2022-01-10T19:55:09Z

@dancamarg0 I've added code to break up large wireguard updates into multiple netlink requests and confirmed on my machine that it works with large peer updates. Could you give it another try?

dancamarg0 · 2022-01-10T23:36:10Z

@mcginty I'm currently getting the following error on both innernet and innernet-server binaries:

Error: Serialized netlink packet larger than maximum size 4096

Looks like I'm hitting this commit: 4784a69

I wonder if you did commit all your changes into main already? This one seems to be the latest commit related to netlink (2 days ago)

Let me know if we can help with anything.

mcginty · 2022-01-10T23:38:38Z

@dancamarg0 a95fa1b should be the latest commit (the meat of the fix was added in 92b60f5).

dancamarg0 · 2022-01-10T23:55:05Z

^ Cool

But yes I have the latest code then. My libclang-dev lib and Rust are still the same as reported above. I wonder if there is any other dependency i don't have installed that's breaking it?

FYI if I perform a simple command like innernet fetch <interface it works, but spills the error just at the end.

[E] rpcpool - Serialized netlink packet larger than maximum size 4096

mcginty · 2022-01-11T00:28:01Z

@dancamarg0 thanks for clarifying - I'll double-check my math and get back to you. Sorry for the number of iterations this is taking, it's tough when I can't reproduce this stuff locally!

mcginty · 2022-01-11T07:04:52Z

@dancamarg0 I think I might have caught it and fixed it in e04bd26. Let me know if you get a chance to try it out.

dancamarg0 · 2022-01-11T10:32:51Z

Appears to be working now @mcginty thanks so much!!

Both innernet and innernet-server are running without errors or segfaulting, we'll be rolling this new version slowly to our cluster over this week and let you know if we find anything.

Just a quick note about the innernet client, it seems this latest version is re-fetching all peers every time I run ./innernet fetch <interface> instead of caching results and just appending new nodes.

So, this is the expected behavior (old binary)

sudo inn fetch <interface>
[*] fetching state from server.
[*] peers are already up to date.
[*] reporting 1 interface address as NAT traversal candidates...

Rather it's generating a big output with all of of ours nodes of the sort [*] peer <peer> (<pubkey>...) was added. and at the end:

[*] updating /etc/hosts with the latest peers.

[*] updated interface <rpcpool>

[*] reporting 1 interface address as NAT traversal candidates..

I don't think this is any critical as this generates very little hardware overhead, but would be cool if we can sort this out too.

mcginty · 2022-01-11T20:13:08Z

YESSSSSSSSSSSSSSSSS FINALLY!! Thanks for sticking through that with me.

I'll look into the large updates after every fetch - it's possible that the same fix will apply in the inverse direction - wireguard's kernel module is possibly sending multi-part netlink messages to report the state of the interface and we need to have logic to join them together. Like you said, it shouldn't be a blocker for you right now though, just not as efficient as it could be :).

Going to close this as fixed now, and then make a new issue for checking the netlink code for retrieving interface state.

mcginty mentioned this issue Dec 3, 2021

(linux) wireguard-control: migrate from wireguard-control-sys to netlink crates #177

Merged

mcginty closed this as completed Jan 11, 2022

mcginty mentioned this issue Jan 11, 2022

wireguard-control: kernel backend possibly not parsing large responses in get_by_name() #187

Closed

dancamarg0 mentioned this issue Jan 14, 2022

Serialized netlink packet larger than maximum size - Inn Server #188

Closed

Segfaults #174

Segfaults #174

Comments

linuskendall commented Nov 23, 2021 • edited

hansihe commented Nov 23, 2021

bschwind commented Nov 23, 2021

mcginty commented Nov 23, 2021 • edited

linuskendall commented Nov 24, 2021 • edited

linuskendall commented Nov 24, 2021

linuskendall commented Nov 24, 2021 • edited

mcginty commented Nov 24, 2021 • edited

mcginty commented Nov 24, 2021

linuskendall commented Nov 24, 2021 • edited

mcginty commented Nov 25, 2021 • edited

Lusitaniae commented Nov 26, 2021 • edited

mcginty commented Nov 28, 2021

Lusitaniae commented Nov 28, 2021 • edited

mcginty commented Nov 28, 2021

Lusitaniae commented Nov 29, 2021 • edited by mcginty

mcginty commented Nov 29, 2021

linuskendall commented Nov 30, 2021

linuskendall commented Nov 30, 2021

linuskendall commented Dec 1, 2021 • edited

mcginty commented Dec 1, 2021

linuskendall commented Dec 3, 2021

mcginty commented Dec 3, 2021

Lusitaniae commented Dec 22, 2021

mcginty commented Jan 7, 2022

Lusitaniae commented Jan 7, 2022

dancamarg0 commented Jan 7, 2022 • edited

mcginty commented Jan 8, 2022

mcginty commented Jan 10, 2022

dancamarg0 commented Jan 10, 2022

mcginty commented Jan 10, 2022

dancamarg0 commented Jan 10, 2022

mcginty commented Jan 11, 2022

mcginty commented Jan 11, 2022

dancamarg0 commented Jan 11, 2022

mcginty commented Jan 11, 2022 • edited

linuskendall commented Nov 23, 2021 •

edited

mcginty commented Nov 23, 2021 •

edited

linuskendall commented Nov 24, 2021 •

edited

linuskendall commented Nov 24, 2021 •

edited

mcginty commented Nov 24, 2021 •

edited

linuskendall commented Nov 24, 2021 •

edited

mcginty commented Nov 25, 2021 •

edited

Lusitaniae commented Nov 26, 2021 •

edited

Lusitaniae commented Nov 28, 2021 •

edited

Lusitaniae commented Nov 29, 2021 •

edited by mcginty

linuskendall commented Dec 1, 2021 •

edited

dancamarg0 commented Jan 7, 2022 •

edited

mcginty commented Jan 11, 2022 •

edited