Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boosting performance: UDP Offloading and GSO #2877

Closed
jshuhnow opened this issue Nov 15, 2020 · 24 comments · Fixed by #3808
Closed

Boosting performance: UDP Offloading and GSO #2877

jshuhnow opened this issue Nov 15, 2020 · 24 comments · Fixed by #3808

Comments

@jshuhnow
Copy link

Hello, I've been instrumenting numerous QUIC implementations for performance testing.

In particular, I see [quiche] from CloudFlare implemented UDP offloading and GSO earlier this year. [link]
quic-go has given me 470.3 Mbps with loopback setting, so it seems to have no support for both accelerations yet. (I briefly looked up the code)

Would anyone confirm this for quic-go?
Maybe we could keep open this issue for future implementation, too.

@marten-seemann
Copy link
Member

marten-seemann commented Nov 15, 2020

Support for ReadBatch (which translates to sendmmsg) is being worked on in the batched-read branch. It's currently blocked on golang/mock#498, as I need that to generate a mock implementation to test the code I've written, as well as on golang/go#42444.

I'm not sure if Go allows us to use GSO. Do you have any information about that? I'd be very interested in adding support for it in quic-go.

@jshuhnow
Copy link
Author

Thank you for the quick reply and reference.

Not that I'm aware of. Please, anybody, feel free to further the discussion.
I understand both are indispensable to achieve gigabps throughput.

@gedw99
Copy link

gedw99 commented Nov 20, 2020

@marten-seemann
Copy link
Member

@gedw99 Thanks for the link, that's very interesting. It's only for TCP though, isn't it?

@nanokatze
Copy link
Contributor

Hi! GRO and GSO can be used in Go by setting appropriate socket options and then using ReadMsgUDP and WriteMsgUDP to get and specify segment size respectively. Feel free to use the following fragment as an example: https://github.com/nanokatze/quic-at-home/blob/main/internal/udp/gso_unix.go .

@bt90
Copy link
Contributor

bt90 commented Apr 14, 2023

@marten-seemann
Copy link
Member

https://pkg.go.dev/github.com/tredeske/u/unet#Socket.SetOptGso

This seems pretty simple, probably better to just copy-paste the code than to include this massive library: https://github.com/tredeske/u/blob/b09eea740bef/unet/socket.go#L305-L313

Do I understand correctly that one can use the cmsg to set the datagram size for every batch of packets separately. That would be necessary since the server just has a single socket that it uses to serve many clients, and runs DPLPMTUD on each connection separately.

@bt90
Copy link
Contributor

bt90 commented Apr 14, 2023

Maybe @tredeske can answer that?

@tredeske
Copy link

This is an interesting discussion.

GSO allows sending of a large (up to 64k) datagram, to a single destination. The datagram you send will be split into equal size actual datagrams by GSO. The performance gain is from having a single, large datagram transit the network stack instead of a bunch of smaller ones. It's a large gain in our experience.

http://vger.kernel.org/lpc_net2018_talks/willemdebruijn-lpc2018-udpgso-paper-DRAFT-1.pdf

For best performance, using sendmmsg (note extra m) will allow you to send multiple large datagrams with a single system call, each of which will be sent on the wire as multiple MTU sized datagrams. Each large datagram can be sent to a different destination. The performance gain using this is also quite large.

The variability using cmsghdr is not necessary for our application, but I can see how that would be useful. The benchmark test for this interface shows how to do it in c.

https://github.com/torvalds/linux/blob/master/tools/testing/selftests/net/udpgso_bench_tx.c

@marten-seemann
Copy link
Member

@tredeske Thank you for chiming in here!

http://vger.kernel.org/lpc_net2018_talks/willemdebruijn-lpc2018-udpgso-paper-DRAFT-1.pdf

The link doesn't seem to work.

For best performance, using sendmmsg (note extra m) will allow you to send multiple large datagrams with a single system call, each of which will be sent on the wire as multiple MTU sized datagrams. Each large datagram can be sent to a different destination. The performance gain using this is also quite large.

Would you recommend combining these two? E.g. pass multiple (up to) 64k buffers to the kernel, if I'm sending to multiple remote addresses at the same time?

@marten-seemann
Copy link
Member

Just stumbled upon Tailscale's blog post from yesterday, which is excellent (as always): https://tailscale.com/blog/more-throughput/

@bt90
Copy link
Contributor

bt90 commented Apr 14, 2023

The link doesn't seem to work.

Make sure that you're accesing it with http://

@tredeske
Copy link

Huh. I just clicked that link in your comment and it brought up the pdf.

Yes. It's a big win. it lets you send multiple frames of the same size (except possibly the last one) at once to each destination. If you have variable sized frames, then in practice you may not be able to fill up to 64 kb in each element of the msghdr array. Worst case, you can send two per element, as the last one in each can be a different size.

@marten-seemann
Copy link
Member

Looks like my Brave is aggressively upgrading HTTP to HTTPS, to the point where links fail. Sorry for the noise!

Now I'm trying to wrap my head around how I'd use that with the new Go API (#3563). API proposal here: golang/go#45886 (comment)

If I understand correctly, I'd just pass up to UDPMsgs with 64k buffers to WriteUDPMsgs after setting the respective socket option. The cmsg that tells the kernel how large each UDP datagram (except the last one) will be would go in UDPMsg.Control.

Not sure I understand why UDPMsg.Buffer and UDPMsg.Control are [][]byte and not []byte though, it seems like we'll only ever set the first element of those slices.

@marten-seemann
Copy link
Member

I built a PoC, and it works! This is exciting! https://gist.github.com/marten-seemann/a549773b53f30960b966a9f4068b6e48

However, the Go standard library currently doesn't provide a function to serialize a Cmsghdr: golang/go#59653. This is pretty much a requirement for us to actually use this syscall.

@bt90
Copy link
Contributor

bt90 commented Apr 16, 2023

Would it be feasible to implement it for a subset of prominent architectures with a fallback to non-GSO sending?

@marten-seemann
Copy link
Member

Would it be feasible to implement it for a subset of prominent architectures with a fallback to non-GSO sending?

Probably. From looking at x/sys/unix, the main difference between architecture seems to be the size of the Len field (is it a uint32 or a uint64). One could look at unix.SizeofCmsghdr. If it's 12, then the length field is probably a uint32 (or int32?). If it's 16, it's a uint64 (or int64?). I haven't checked if this works for all architectures, but it seems to cover most of them.

@bt90
Copy link
Contributor

bt90 commented Apr 28, 2023

@marten-seemann you could take a look at how wireguard-go is handling this.

@marten-seemann
Copy link
Member

I've been working on GSO support in the gso branch. Initial benchmarks look very promising, I'm seeing more than a doubling of the transfer speed on localhost, and that's without tuning of any parameters.

However, GSO support raises an interesting API question: Once GSO is enabled, calls to WriteTo on the original net.UDPConn will fail with a sendto: invalid argument error. The kernel requires us to set the UDP_SEGMENT cmsg with every sendmsg call (which is what WriteTo does). That's not a problem inside of quic-go (we can keep track of setting the sockopt and act accordingly), but users might want to multiplex other protocols on top of the same connection (e.g. STUN). QUIC is explicitly designed to allow such demultiplexing. The user however has no indication if setting the UDP_SEGMENT sockopt succeeded (and when!).

I'm wondering if we should make the API more explicit: Instead of doing interface assertions on the net.PacketConn set on the Transport, we could introduce an explicit way:

// OOBCapablePacketConn is a connection that allows the reading of ECN bits from the IP header.
// If the PacketConn passed to Dial or Listen satisfies this interface, quic-go will use it.
// In this case, ReadMsgUDP() will be used instead of ReadFrom() to read packets.
type OOBCapablePacketConn interface {
	net.PacketConn
	SyscallConn() (syscall.RawConn, error)
	ReadMsgUDP(b, oob []byte) (n, oobn, flags int, addr *net.UDPAddr, err error)
	WriteMsgUDP(b, oob []byte, addr *net.UDPAddr) (n, oobn int, err error)
}

// EnableOptimizations takes a connection and enables a range of optimizations that are crucial for QUIC performance.
// 1. increase UDP send and receive buffers
// 2. enable ECN support
// 3. enable GSO support
// 4. enable GRO support (in a future PR)
func EnableOptimizations(conn OOBCapablePacketConn) net.PacketConn {
    // crazy syscalls
}

If enabling GSO succeeds, the returned net.PacketConn would then wrap the WriteTo method, such that the UDP_SEGMENT message would be set.

quic-go users would need to call EnableOptimizations with the connection they're using for their QUIC server / client.

Does that make sense? Or is there a better solution?

@bt90
Copy link
Contributor

bt90 commented May 3, 2023

Multiplexing support is absolutely crucial for syncthing. We're currently using https://github.com/AudriusButkevicius/pfilter/ to accomplish that. While out of scope for the GSO changes, it would be wholeheartedly welcome if pfilter could be replaced with a listener mechanism or something similar offered by quic-go.

While we already discussed that as part of #3727, it might make sense to tackle the issue if we need to restructure at this level.

(Pinging @calmh @AudriusButkevicius for comments)

@marten-seemann
Copy link
Member

@bt90 What are your thoughts on the suggested EnableOptimizations API here? Would that work for your case, assuming that at some point we'll have a callback / function on the Transport that allows you to retrieve the non-QUIC packets?

@calmh
Copy link

calmh commented May 3, 2023

I believe it would, yes.

@MarcoPolo
Copy link
Collaborator

If I'm understanding this correctly, users could do something like this:

ln, err := quic.Listen(EnableOptimizations(serverUDPConn), getTLSConfig(), serverConfig)

Or probably more like:

serverUDPConn, err = net.ListenUDP("udp", addr)
optimizedUDPConn = EnableOptimizations(serverUDPConn)
ln, err := quic.Listen(optimizedUDPConn, getTLSConfig(), serverConfig)

Since they may way the optimizedUDPConn to write to themselves.

And I believe this wouldn't affect users who aren't multiplexing.

With my above understanding, I think this API makes sense to me 👍 . I was thinking of something similar from the problem statement "Once GSO is enabled, calls to WriteTo on the original net.UDPConn will fail with a sendto: invalid argument error."

@marten-seemann marten-seemann mentioned this issue May 6, 2023
@bt90
Copy link
Contributor

bt90 commented Jun 5, 2023

v0.36.0 milestone?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants