Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Netlink sockets #3198

Merged
merged 29 commits into from
Feb 21, 2024
Merged

Netlink sockets #3198

merged 29 commits into from
Feb 21, 2024

Conversation

ppopth
Copy link
Contributor

@ppopth ppopth commented Oct 3, 2023

Following #2980

This PR implements Netlink sockets. Please look at netlink(7) and rtnetlink(7) before reviewing the PR so that you can understand it better.

I use the ip addr command as a baseline. This PR implements anything that is required by that command: capget, catpset, setsockopt, and getsockname.

How it works

When the user calls sendmsg we check only the metadata, not the content of the message, like the destination address, the supported sendmsg flags. If it passes all the checks, we simply put the message into the buffer.

When the user calls recvmsg we get the message from the buffer, process the message, and return the response. Currently we support only RTM_GETLINK and RTM_GETADDR. For each message, we return a bunch of RTM_NEWLINKs and RTM_NEWADDRs respectively.

Commit messages

Please also look at the commit messages. I think it explains things in detail.

commit 26927f6
Author: Suphanat Chunhapanya haxx.pop@gmail.com
Date: Wed Oct 4 21:22:45 2023 +0700

Install libcap and ip to the CI

This commit installs libcap and ip to the CI because capget,capset
require libcap and ip is used to test the Netlink sockets.

In Debian/Ubuntu, the corresponding packages are libcap-dev and
iproute2.

In Fedora, the corresponding packages are libcap-devel and iproute.

commit b79a26c
Author: Suphanat Chunhapanya haxx.pop@gmail.com
Date: Mon Oct 2 14:41:30 2023 +0700

Test Netlink send buffer limit

The user can set the send buffer limit by setting SO_SNDBUF using
setsockopt. We test this for Netlink sockets by setting this option and
send a lot of messages to the socket.

commit f4c33f7
Author: Suphanat Chunhapanya haxx.pop@gmail.com
Date: Sun Oct 1 23:11:19 2023 +0700

Test Netlink with `ip addr`

Since `ip addr` uses a netlink socket to get the information of the
interfaces and addresses, we use it to test our Netlink implementation.

We check that the output of the `ip addr` command matches the
ip-addr.stdout file.

We mark the test as an extra because it requirs the iproute2 package.

commit 9778bd3
Author: Suphanat Chunhapanya haxx.pop@gmail.com
Date: Fri Sep 22 13:23:04 2023 +0700

Implement Netlink setsockopt syscall

Currently we handle only (SOL_SOCKET, SO_SNDBUF) and (SOL_SOCKET,
SO_RCVBUF). In SO_SNDBUF, we set the send buffer limit, while,
in SO_RCVBUF, we do nothing.

commit 0cd5678
Author: Suphanat Chunhapanya haxx.pop@gmail.com
Date: Sat Sep 30 17:38:50 2023 +0700

Test capget,capset

This commits adds both the linux tests and the shadow tests for both
capget and capset.

We assign them as extra tests because they require the local machine to
install libcap-dev.

commit ec4d843
Author: Suphanat Chunhapanya haxx.pop@gmail.com
Date: Wed Sep 20 14:59:19 2023 +0700

Implement capget,capset

In capget, we will always return the empty capability bitfield because
we don't allow any capability in the plugin.

In capset, we will allow to set with the empty capability bitfield.

commit 25223ea
Author: Suphanat Chunhapanya haxx.pop@gmail.com
Date: Wed Sep 20 16:59:05 2023 +0700

Implement Netlink getsockname syscall

Previously we have only the is_bound field in InitialState which is used
to indicate whether the socket is already bound or not. Now we want to
implement getsockname which requires us to return the bound address, so
we need to change from the boolean is_bound to the concrete bound_addr
field.

commit 65ea91a
Author: Suphanat Chunhapanya haxx.pop@gmail.com
Date: Fri Sep 22 12:55:43 2023 +0700

Handle RTM_GETLINK

Previously we handle only RTM_GETADDR, this commit now enables us to
handle RTM_GETLINK.

This commit implements the RTM_GETLINK handling part in a method
called handle_ifinfomsg.

In handle_ifinfomsg, it works as follows:
- Check that ifi_family of the message is either AF_UNSPEC or AF_INET
  because we support only IPv4 addresses
- Check that the rest of the fields (except ifi_change) are zeroes
  because they are unsupported
- Send the message containing all the interfaces with the following
  attributes: IFLA_IFNAME, IFLA_TXQLEN, and IFLA_MTU

We can also send the attributes IFLA_ADDRESS and IFLA_BROADCAST for the
MAC address, but we mark it as a todo instead.

The ifi_flags of the loopback interface is IFF_UP, IFF_LOOPBACK, and
IFF_RUNNING.

The ifi_flags of the ethernet interface is IFF_UP, IFF_BROADCAST,
IFF_RUNNING, and IFF_MULTICAST.

All the flags listed above are from what I observe in the `ip addr`
command.

commit 15a2073
Author: Suphanat Chunhapanya haxx.pop@gmail.com
Date: Tue Sep 26 00:01:25 2023 +0700

Support MSG_PEEK in Netlink recvmsg

Since we implemented peek_packet in SharedBuf already, we can now
support MSG_PEEK in the recvmsg syscall for Netlink sockets.

commit 7c8b20a
Author: Suphanat Chunhapanya haxx.pop@gmail.com
Date: Mon Sep 25 22:23:09 2023 +0700

Peek the packets in SharedBuf

This commit allows us to peek the packets stored inside the buffer
without removing them from the buffer.

commit 9d84b69
Author: Suphanat Chunhapanya haxx.pop@gmail.com
Date: Fri Sep 22 20:03:38 2023 +0700

Support Netlink msg without dst addresss

Previously we check that the message sent in sendmsg has the destination
address, but we won't check it anymore because we will just assume that
it will be sent to the kernel.

commit 8db9d96
Author: Suphanat Chunhapanya haxx.pop@gmail.com
Date: Tue Sep 19 11:11:16 2023 +0700

Handle unsupported Netlink syscalls

We currently don't support some syscalls for Netlink sockets: listen,
connect, getpeername, getsockname, and accept. This commit implements
dummy handlers for such syscalls, i.e. it just logs the warning and
returns the error.

commit 01385f3
Author: Suphanat Chunhapanya haxx.pop@gmail.com
Date: Sat Sep 16 23:43:10 2023 +0700

Implement Netlink close syscall

This commit adds the logic that calls add_reader to the shared buffer
when the socket is created and calls remove_reader to the shared buffer
when the socket is closed.

commit 70068ae
Author: Suphanat Chunhapanya haxx.pop@gmail.com
Date: Thu Sep 14 14:11:31 2023 +0700

Implement Netlink recvmsg syscall

This commit implements the recvmsg syscall.

The syscall handlers works as follows.
- Retreive the received message from the buffer
- Check that the message type is RTM_GETADDR because the only currently
  supported one is RTM_GETADDR
- Check that ifa_family of the message is either AF_UNSPEC or AF_INET
  because we support only IPv4 addresses
- Check that the rest of the fields are zeroes because they are
  unsupported
- Send the message containing all the interfaces

Each node in Shadow has only two interfaces: lo and eth0. The lo one has
the prefix of 8 and the eth0 one has the prefix of 24.

The detail of the interfaces is similar to the one in the getifaddrs
syscall handler.

commit 5a40f1c
Author: Suphanat Chunhapanya haxx.pop@gmail.com
Date: Mon Aug 14 20:03:57 2023 +0700

Implement Netlink sendmsg syscall

This commit implements the senmsg syscall.

Only the following parameters are supported:
- The destination pid must be kernel (pid=0), since the only entity that
  will reply the message is the kernel.
- There are no destination groups. Since the only destination is the
  kernel, multicasting does not make sense.
- The only supported flags are MSG_DONTWAIT, MSG_NOSIGNAL, and
  MSG_TRUNC.

Rate limiting is done by limiting the number of in-flight bytes by
NETLINK_SOCKET_DEFAULT_BUFFER_SIZE.

When the syscall is succesfully, the socket pushes the message into the
buffer and emits an event by using the StateEventSource instance.

commit 7464ab5
Author: Suphanat Chunhapanya haxx.pop@gmail.com
Date: Sat Aug 12 04:22:35 2023 +0700

Implement Netlink bind syscall

According to netlink(7), there are two issues worth mentioning.

If the pid of the address is zero, the kernel takes care of assigning it, but we will
leave it untouched at the moment. We can implement the assignment later when we want
to support it.

If the groups of the address is non-zero, it means that the socket wants to listen to
some groups. Since we don't support broadcasting to groups yet, we will emit the error
intead.

commit 7de08d3
Author: Suphanat Chunhapanya haxx.pop@gmail.com
Date: Fri Aug 11 13:24:02 2023 +0700

Socket boilerplate for Netlink

This commit adds Netlink to SocketRef and SocketRefMut, adds all the
boilerplate methods for NetlinkSocket, and adds the protocol states to
NetlinkSocket.

There are two protocol states in NetlinkSocket: Initial and Closed,
which is similar to the stateless Unix domain socket because, as
specified in netlink(7), Netlink is a datagram-oriented service.

commit 4825e9a
Author: Suphanat Chunhapanya haxx.pop@gmail.com
Date: Thu Aug 10 14:37:51 2023 +0700

Implement Netlink socket syscall

Parse the socket type and the netlink family from the socket syscall
parameters. The only supported socket types are SOCK_RAW and SOCK_DGRAM
and the only supported netlink family is NETLINK_ROUTE.

@github-actions github-actions bot added Component: Testing Unit and integration tests and frameworks Component: Main Composing the core Shadow executable Component: Build Build/install tools and dependencies labels Oct 3, 2023
@robgjansen
Copy link
Member

Hi @ppopth! Wow, thanks for submitting the PR!

It might take us a bit to review this, but before we do I think we will want you to rebase against the main branch so that (1) you can take care of fixing the conflicts, and (2) then we can run the shadow CI tests against it to help us during review.

Once you rebase, fix the conflicts, and have the tests passing locally, go ahead and force-push the update onto the same branch which should auto-update this PR. Then we'll find some time to take a look more closely at it.

Thanks again!

@github-actions github-actions bot added the Component: Libraries Support functions like LD_PRELOAD and logging label Oct 4, 2023
@ppopth
Copy link
Contributor Author

ppopth commented Oct 4, 2023

I rebased it to the main branch already.

@ppopth
Copy link
Contributor Author

ppopth commented Oct 5, 2023

After seeing the lint errors, I decided to change many things. It's better to close this and I will open it again soon.

@ppopth ppopth closed this Oct 5, 2023
@ppopth
Copy link
Contributor Author

ppopth commented Oct 5, 2023

Reopened this PR, since I have already changed what I think I need to. I think the code is now far better. It's now ready for review. Thanks for your patience.

The commits before 26927f6 will be untouched.

@ppopth ppopth reopened this Oct 5, 2023
@ppopth ppopth force-pushed the netlink branch 2 times, most recently from f0da39e to 26927f6 Compare October 5, 2023 19:29
@robgjansen
Copy link
Member

Hi @ppopth ! Thanks for the updates. I see that you added tests which is awesome!

We will do a more thorough review, but for now two quick points about the tests:

  • We are strongly preferring that new tests are written in Rust rather than C. I know we still have a lot of C tests sitting around, and that's probably what you modeled your tests after, but we have been slowly trying to rewrite those in Rust so that we can get rid of as much C code as possible. If you think it would be pretty simple, I think now would best time to rewrite the tests because you are the most familiar and the logic is fresh in your mind. Otherwise they will sit as C tests for a long time and someone else will have to relearn to convert to Rust.
  • There are two types of integration tests: one where we just test that a particular function works when called, and one where we try all possible combination of functionalities and input options to make sure we understand how it works on Linux, and then enable only the ones that pass when testing in Shadow. You have implemented the former type, but we strongly prefer that latter more rigorous type of test. Good examples of these are in test/socket/* and test/time/*. It is important to us to test all of the ways a particular functionality can be used because we noticed that otherwise it tends to cause many issues and bugs down the road when some other application tries to use that function but in a different way. Doing the more rigorous tests makes Shadow more stable and makes it much easier to tell why it fails if an application exercises some unsupported functionality.

This PR is already big enough, so I wouldn't change anything here. But would you be willing to work on the above two points in follow-up PRs to really help solidify NetLink as a more roundly supported and better-tested socket type?

@ppopth
Copy link
Contributor Author

ppopth commented Oct 10, 2023

  • that's probably what you modeled your tests after

Not actually true. What I thought is that it's good to write the test in C because all of the libc and POSIX syscall APIs are documented and fully supported in C, and all the examples in the man pages are in C as well.

If the tests are implemented in Rust, it's quite hard to see how the Rust code actually calls the syscall. In addition, sometimes we will probably find it hard to find a proper crate to do the syscall. While, in C, you can just use the POSIX syscall APIs provided with Linux.

I think it makes sense to implement everything that is supposed to be in Shadow and the shim in Rust, but I think that's not necessary for the tests for the reason described above.

Do you disagree with this? I'm quite curious why the tests should be all in Rust instead.

  • one where we try all possible combination of functionalities and input options to make sure we understand how it works on Linux, and then enable only the ones that pass when testing in Shadow. You have implemented the former type, but we strongly prefer that latter more rigorous type of test. Good examples of these are in test/socket/* and test/time/*

Yeah, it's a good point. I agree with that. I didn't know and didn't notice that previously. I will look at the code in test/socket/* and test/time/* more deeply and get back to you soon.

@sporksmith
Copy link
Contributor

What I thought is that it's good to write the test in C because all of the libc and POSIX syscall APIs are documented and fully supported in C, and all the examples in the man pages are in C as well.

In places where we're specifically trying to invoke a specific syscall or libc function under test, we usually do it through the libc crate for this reason. We still take advantage of Rust's higher-level APIs for other things, like setup and teardown of state, threads, etc

@robgjansen
Copy link
Member

Just building a bit on what @sporksmith wrote:

Rust's libc crate provides the sort of "direct syscall access" that is available to C code. So if you are writing a test that is designed to test say epoll_create, then in Rust you would use the libc crate to make that specific low-level syscall.

But there are many other places in a test where you just want some functionality and you don't necessarily care which syscall is being invoked, because that specific test case isn't trying to test the ancillary functionality. For that we found it much nicer (easier to write the code and reason about its correctness) when using the higher level Rust APIs.

We also found it much easier to write the more rigorous tests that I described above (let's call them fuzz tests) using Rust simply because the language has more ready-to-use constructs and crates that help make it easier. We already have some nice helper functions in rust that can make the fuzz-type testing easier. I wouldn't want us to have to write and maintain those helpers for every language.

@robgjansen
Copy link
Member

Oh also, thanks for starting this conversation about testing! ❤️ You are helping me realize that we don't have enough documentation about our expectations around testing, where to look for examples of good fuzz testing practices, etc. I will work on updating our documentation to be clear about our expectations to help clear this up for future work too :)

@stevenengler
Copy link
Contributor

stevenengler commented Oct 10, 2023

If the tests are implemented in Rust, it's quite hard to see how the Rust code actually calls the syscall. In addition, sometimes we will probably find it hard to find a proper crate to do the syscall. While, in C, you can just use the POSIX syscall APIs provided with Linux.

I'll also add that we usually only test the libc syscall wrappers (using the libc library as mentioned above), but sometimes it can be useful to test the actual kernel syscalls themselves. For example Go doesn't typically use the libc wrappers, so sometimes will make the syscalls in a different way than C code does with the libc wrappers. In this case you can use the linux-api library to test the actual syscall without any modifications that glibc makes. The linux-api library only implements a handful of syscalls, and we've been adding more as we need them.

We don't test the actual kernel syscalls very often (although maybe we should), so I would just test the libc wrapper unless the libc API is significantly different from the kernel API.

Copy link
Member

@robgjansen robgjansen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks great!

I'm going to mark as "request changes" for now and then I'll re-review the updates.

I did not repeat my prior testing-related comments here, because I assume we'll handle Rust testing and fuzz testing in separate PRs.

I'm hoping we can get @sporksmith's thoughts to help clear up some of my uncertainty around my linux-api-related comments to make sure we are doing the right thing by preferring to add parts to linux-api rather than adding a dependency on the linux-raw-sys crate.

src/main/utility/sockaddr.rs Show resolved Hide resolved
src/main/utility/sockaddr.rs Show resolved Hide resolved
src/main/host/descriptor/socket/netlink.rs Outdated Show resolved Hide resolved
src/lib/pod/src/lib.rs Outdated Show resolved Hide resolved
ci/container_scripts/install_deps.sh Outdated Show resolved Hide resolved
src/main/host/descriptor/socket/netlink.rs Outdated Show resolved Hide resolved
src/main/host/descriptor/socket/netlink.rs Outdated Show resolved Hide resolved
src/main/host/descriptor/socket/netlink.rs Outdated Show resolved Hide resolved
src/main/host/descriptor/socket/netlink.rs Outdated Show resolved Hide resolved
src/main/host/syscall/handler/clone.rs Outdated Show resolved Hide resolved
@ppopth
Copy link
Contributor Author

ppopth commented Nov 27, 2023

But there are many other places in a test where you just want some functionality and you don't necessarily care which syscall is being invoked, because that specific test case isn't trying to test the ancillary functionality. For that we found it much nicer (easier to write the code and reason about its correctness) when using the higher level Rust APIs.

Does this mean that I should remove the tests for capset/capget? since they are only invoked by ip addr and we don't really care much about it.

In addition, it also seems hard for me to find a library in Rust or Go that invokes those syscalls directly.

@ppopth
Copy link
Contributor Author

ppopth commented Nov 28, 2023

@robgjansen @sporksmith @stevenengler All the comments have been addressed already. I haven't resolved the conflicts yet because I don't want to lose the history. I think I will do it after you review the changes. Thanks.

This commit allows us to peek the packets stored inside the buffer
without removing them from the buffer.
Since we implemented peek_packet in SharedBuf already, we can now
support MSG_PEEK in the recvmsg syscall for Netlink sockets.
Previously we handle only RTM_GETADDR, this commit now enables us to
handle RTM_GETLINK.

This commit implements the RTM_GETLINK handling part in a method
called handle_ifinfomsg.

In handle_ifinfomsg, it works as follows:
- Check that ifi_family of the message is either AF_UNSPEC or AF_INET
  because we support only IPv4 addresses
- Check that the rest of the fields (except ifi_change) are zeroes
  because they are unsupported
- Send the message containing all the interfaces with the following
  attributes: IFLA_IFNAME, IFLA_TXQLEN, and IFLA_MTU

We can also send the attributes IFLA_ADDRESS and IFLA_BROADCAST for the
MAC address, but we mark it as a todo instead.

The ifi_flags of the loopback interface is IFF_UP, IFF_LOOPBACK, and
IFF_RUNNING.

The ifi_flags of the ethernet interface is IFF_UP, IFF_BROADCAST,
IFF_RUNNING, and IFF_MULTICAST.

All the flags listed above are from what I observe in the `ip addr`
command.
Previously we have only the is_bound field in InitialState which is used
to indicate whether the socket is already bound or not. Now we want to
implement getsockname which requires us to return the bound address, so
we need to change from the boolean is_bound to the concrete bound_addr
field.
In capget, we will always return the empty capability bitfield because
we don't allow any capability in the plugin.

In capset, we will allow to set with the empty capability bitfield.
This commits adds both the linux tests and the shadow tests for both
capget and capset.

We assign them as extra tests because they require the local machine to
install libcap-dev.
Currently we handle only (SOL_SOCKET, SO_SNDBUF) and (SOL_SOCKET,
SO_RCVBUF). In SO_SNDBUF, we set the send buffer limit, while,
in SO_RCVBUF, we do nothing.
Since `ip addr` uses a netlink socket to get the information of the
interfaces and addresses, we use it to test our Netlink implementation.

We check that the output of the `ip addr` command matches the
ip-addr.stdout file.

We mark the test as an extra because it requirs the iproute2 package.
The user can set the send buffer limit by setting SO_SNDBUF using
setsockopt. We test this for Netlink sockets by setting this option and
send a lot of messages to the socket.
This commit installs libcap and ip to the CI because capget,capset
require libcap and ip is used to test the Netlink sockets.

In Debian/Ubuntu, the corresponding packages are libcap-dev and
iproute2.

In Fedora, the corresponding packages are libcap-devel and iproute.
Replace some log::warn! with warn_once_then_debug!
Define everything that is neeeded from linux-raw-sys by Netlink sockets
in linux-api instead.
Don't return an error if the socket is bound to the non-zero groups,
since it's not fatal and it's better allow it to do so.
If it tries to bind to Netlink groups other than RTMGRP_IPV4_IFADDR or
RTMGRP_IPV6_IFADDR, return an error.
@ppopth
Copy link
Contributor Author

ppopth commented Feb 20, 2024

@stevenengler Is it okay if I will squash all the commits in this PR into a single commit? since I think it's already hard to squash all the fix-up commits.

I'll let @robgjansen decide on that. (Since the PR is ~3,500 lines changed, my thinking is that if you could split it into 3 commits: one for the linux-api changes, one for the netlink socket implementation, and one for the tests.)

It turned out that it's not as hard as I thought. I squashed all the fix-up commits already. See this branch https://github.com/ppopth/shadow/tree/netlink2

Please let me know when you are ready to see it in this PR.

Copy link
Member

@robgjansen robgjansen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hard work and for sticking with us during the long review process!

@robgjansen
Copy link
Member

I think you can go ahead and rebase to squash the fixups now. If you want to go further and squash into just 3 commits as suggested by @stevenengler, I think that would make sense. The code from some of the earlier commits was removed in later commits, so squashing down to 3 commits would keep things cleaner. After that, force push the cleaned branch and we'll work on merging. Thanks!

@robgjansen robgjansen merged commit 53ebe80 into shadow:main Feb 21, 2024
24 checks passed
@ppopth ppopth deleted the netlink branch February 22, 2024 10:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Build Build/install tools and dependencies Component: Libraries Support functions like LD_PRELOAD and logging Component: Main Composing the core Shadow executable Component: Testing Unit and integration tests and frameworks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants