Boosting performance: UDP Offloading and GSO #2877

jshuhnow · 2020-11-15T06:32:41Z

Hello, I've been instrumenting numerous QUIC implementations for performance testing.

In particular, I see [quiche] from CloudFlare implemented UDP offloading and GSO earlier this year. [link]
quic-go has given me 470.3 Mbps with loopback setting, so it seems to have no support for both accelerations yet. (I briefly looked up the code)

Would anyone confirm this for quic-go?
Maybe we could keep open this issue for future implementation, too.

The text was updated successfully, but these errors were encountered:

marten-seemann · 2020-11-15T06:52:58Z

Support for ReadBatch (which translates to sendmmsg) is being worked on in the batched-read branch. It's currently blocked on golang/mock#498, as I need that to generate a mock implementation to test the code I've written, as well as on golang/go#42444.

I'm not sure if Go allows us to use GSO. Do you have any information about that? I'd be very interested in adding support for it in quic-go.

jshuhnow · 2020-11-17T04:52:15Z

Thank you for the quick reply and reference.

Not that I'm aware of. Please, anybody, feel free to further the discussion.
I understand both are indispensable to achieve gigabps throughput.

gedw99 · 2020-11-20T18:29:36Z

https://godoc.org/github.com/google/netstack/tcpip/stack#GSO

marten-seemann · 2020-11-21T03:12:43Z

@gedw99 Thanks for the link, that's very interesting. It's only for TCP though, isn't it?

nanokatze · 2022-04-27T00:22:49Z

Hi! GRO and GSO can be used in Go by setting appropriate socket options and then using ReadMsgUDP and WriteMsgUDP to get and specify segment size respectively. Feel free to use the following fragment as an example: https://github.com/nanokatze/quic-at-home/blob/main/internal/udp/gso_unix.go .

bt90 · 2023-04-14T08:32:00Z

https://pkg.go.dev/github.com/tredeske/u/unet#Socket.SetOptGso

marten-seemann · 2023-04-14T08:54:30Z

https://pkg.go.dev/github.com/tredeske/u/unet#Socket.SetOptGso

This seems pretty simple, probably better to just copy-paste the code than to include this massive library: https://github.com/tredeske/u/blob/b09eea740bef/unet/socket.go#L305-L313

Do I understand correctly that one can use the cmsg to set the datagram size for every batch of packets separately. That would be necessary since the server just has a single socket that it uses to serve many clients, and runs DPLPMTUD on each connection separately.

bt90 · 2023-04-14T08:58:10Z

Maybe @tredeske can answer that?

tredeske · 2023-04-14T09:26:53Z

This is an interesting discussion.

GSO allows sending of a large (up to 64k) datagram, to a single destination. The datagram you send will be split into equal size actual datagrams by GSO. The performance gain is from having a single, large datagram transit the network stack instead of a bunch of smaller ones. It's a large gain in our experience.

http://vger.kernel.org/lpc_net2018_talks/willemdebruijn-lpc2018-udpgso-paper-DRAFT-1.pdf

For best performance, using sendmmsg (note extra m) will allow you to send multiple large datagrams with a single system call, each of which will be sent on the wire as multiple MTU sized datagrams. Each large datagram can be sent to a different destination. The performance gain using this is also quite large.

The variability using cmsghdr is not necessary for our application, but I can see how that would be useful. The benchmark test for this interface shows how to do it in c.

https://github.com/torvalds/linux/blob/master/tools/testing/selftests/net/udpgso_bench_tx.c

marten-seemann · 2023-04-14T12:38:59Z

@tredeske Thank you for chiming in here!

http://vger.kernel.org/lpc_net2018_talks/willemdebruijn-lpc2018-udpgso-paper-DRAFT-1.pdf

The link doesn't seem to work.

For best performance, using sendmmsg (note extra m) will allow you to send multiple large datagrams with a single system call, each of which will be sent on the wire as multiple MTU sized datagrams. Each large datagram can be sent to a different destination. The performance gain using this is also quite large.

Would you recommend combining these two? E.g. pass multiple (up to) 64k buffers to the kernel, if I'm sending to multiple remote addresses at the same time?

marten-seemann · 2023-04-14T12:50:40Z

Just stumbled upon Tailscale's blog post from yesterday, which is excellent (as always): https://tailscale.com/blog/more-throughput/

bt90 · 2023-04-14T12:51:05Z

The link doesn't seem to work.

Make sure that you're accesing it with http://

tredeske · 2023-04-14T12:52:23Z

Huh. I just clicked that link in your comment and it brought up the pdf.

Yes. It's a big win. it lets you send multiple frames of the same size (except possibly the last one) at once to each destination. If you have variable sized frames, then in practice you may not be able to fill up to 64 kb in each element of the msghdr array. Worst case, you can send two per element, as the last one in each can be a different size.

marten-seemann · 2023-04-14T13:09:32Z

Looks like my Brave is aggressively upgrading HTTP to HTTPS, to the point where links fail. Sorry for the noise!

Now I'm trying to wrap my head around how I'd use that with the new Go API (#3563). API proposal here: golang/go#45886 (comment)

If I understand correctly, I'd just pass up to UDPMsgs with 64k buffers to WriteUDPMsgs after setting the respective socket option. The cmsg that tells the kernel how large each UDP datagram (except the last one) will be would go in UDPMsg.Control.

Not sure I understand why UDPMsg.Buffer and UDPMsg.Control are [][]byte and not []byte though, it seems like we'll only ever set the first element of those slices.

marten-seemann · 2023-04-16T10:32:51Z

I built a PoC, and it works! This is exciting! https://gist.github.com/marten-seemann/a549773b53f30960b966a9f4068b6e48

However, the Go standard library currently doesn't provide a function to serialize a Cmsghdr: golang/go#59653. This is pretty much a requirement for us to actually use this syscall.

bt90 · 2023-04-16T11:37:43Z

Would it be feasible to implement it for a subset of prominent architectures with a fallback to non-GSO sending?

marten-seemann · 2023-04-16T12:56:26Z

Would it be feasible to implement it for a subset of prominent architectures with a fallback to non-GSO sending?

Probably. From looking at x/sys/unix, the main difference between architecture seems to be the size of the Len field (is it a uint32 or a uint64). One could look at unix.SizeofCmsghdr. If it's 12, then the length field is probably a uint32 (or int32?). If it's 16, it's a uint64 (or int64?). I haven't checked if this works for all architectures, but it seems to cover most of them.

bt90 · 2023-04-28T11:13:12Z

@marten-seemann you could take a look at how wireguard-go is handling this.

marten-seemann · 2023-05-03T12:13:35Z

I've been working on GSO support in the gso branch. Initial benchmarks look very promising, I'm seeing more than a doubling of the transfer speed on localhost, and that's without tuning of any parameters.

However, GSO support raises an interesting API question: Once GSO is enabled, calls to WriteTo on the original net.UDPConn will fail with a sendto: invalid argument error. The kernel requires us to set the UDP_SEGMENT cmsg with every sendmsg call (which is what WriteTo does). That's not a problem inside of quic-go (we can keep track of setting the sockopt and act accordingly), but users might want to multiplex other protocols on top of the same connection (e.g. STUN). QUIC is explicitly designed to allow such demultiplexing. The user however has no indication if setting the UDP_SEGMENT sockopt succeeded (and when!).

I'm wondering if we should make the API more explicit: Instead of doing interface assertions on the net.PacketConn set on the Transport, we could introduce an explicit way:

// OOBCapablePacketConn is a connection that allows the reading of ECN bits from the IP header.
// If the PacketConn passed to Dial or Listen satisfies this interface, quic-go will use it.
// In this case, ReadMsgUDP() will be used instead of ReadFrom() to read packets.
type OOBCapablePacketConn interface {
	net.PacketConn
	SyscallConn() (syscall.RawConn, error)
	ReadMsgUDP(b, oob []byte) (n, oobn, flags int, addr *net.UDPAddr, err error)
	WriteMsgUDP(b, oob []byte, addr *net.UDPAddr) (n, oobn int, err error)
}

// EnableOptimizations takes a connection and enables a range of optimizations that are crucial for QUIC performance.
// 1. increase UDP send and receive buffers
// 2. enable ECN support
// 3. enable GSO support
// 4. enable GRO support (in a future PR)
func EnableOptimizations(conn OOBCapablePacketConn) net.PacketConn {
    // crazy syscalls
}

If enabling GSO succeeds, the returned net.PacketConn would then wrap the WriteTo method, such that the UDP_SEGMENT message would be set.

quic-go users would need to call EnableOptimizations with the connection they're using for their QUIC server / client.

Does that make sense? Or is there a better solution?

bt90 · 2023-05-03T12:45:22Z

Multiplexing support is absolutely crucial for syncthing. We're currently using https://github.com/AudriusButkevicius/pfilter/ to accomplish that. While out of scope for the GSO changes, it would be wholeheartedly welcome if pfilter could be replaced with a listener mechanism or something similar offered by quic-go.

While we already discussed that as part of #3727, it might make sense to tackle the issue if we need to restructure at this level.

(Pinging @calmh @AudriusButkevicius for comments)

marten-seemann · 2023-05-03T15:13:36Z

@bt90 What are your thoughts on the suggested EnableOptimizations API here? Would that work for your case, assuming that at some point we'll have a callback / function on the Transport that allows you to retrieve the non-QUIC packets?

calmh · 2023-05-03T15:15:52Z

I believe it would, yes.

MarcoPolo · 2023-05-03T19:01:29Z

If I'm understanding this correctly, users could do something like this:

ln, err := quic.Listen(EnableOptimizations(serverUDPConn), getTLSConfig(), serverConfig)

Or probably more like:

serverUDPConn, err = net.ListenUDP("udp", addr)
optimizedUDPConn = EnableOptimizations(serverUDPConn)
ln, err := quic.Listen(optimizedUDPConn, getTLSConfig(), serverConfig)

Since they may way the optimizedUDPConn to write to themselves.

And I believe this wouldn't affect users who aren't multiplexing.

With my above understanding, I think this API makes sense to me 👍 . I was thinking of something similar from the problem statement "Once GSO is enabled, calls to WriteTo on the original net.UDPConn will fail with a sendto: invalid argument error."

bt90 · 2023-06-05T21:46:39Z

v0.36.0 milestone?

marten-seemann added the performance label Nov 17, 2020

bt90 mentioned this issue Aug 10, 2022

caddyhttp: Enable HTTP/3 by default caddyserver/caddy#4707

Merged

This was referenced Oct 11, 2022

Utilization of the congestion window #3585

Open

write packets in batches #3524

Closed

This was referenced Apr 14, 2023

Configurable maximum packet size #3385

Open

x/sys/unix: missing function to marshal a Cmsghdr golang/go#59653

Open

marten-seemann mentioned this issue May 6, 2023

use GSO #3808

Merged

mxinden mentioned this issue May 12, 2023

write udp [::]:51835->:0: sendto: invalid argument quic-go/perf#1

Closed

marten-seemann linked a pull request May 14, 2023 that will close this issue

use GSO #3808

Merged

marten-seemann closed this as completed in #3808 Jun 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Boosting performance: UDP Offloading and GSO #2877

Boosting performance: UDP Offloading and GSO #2877

jshuhnow commented Nov 15, 2020

marten-seemann commented Nov 15, 2020 •

edited

jshuhnow commented Nov 17, 2020

gedw99 commented Nov 20, 2020

marten-seemann commented Nov 21, 2020

nanokatze commented Apr 27, 2022

bt90 commented Apr 14, 2023

marten-seemann commented Apr 14, 2023

bt90 commented Apr 14, 2023

tredeske commented Apr 14, 2023

marten-seemann commented Apr 14, 2023

marten-seemann commented Apr 14, 2023

bt90 commented Apr 14, 2023

tredeske commented Apr 14, 2023

marten-seemann commented Apr 14, 2023

marten-seemann commented Apr 16, 2023

bt90 commented Apr 16, 2023

marten-seemann commented Apr 16, 2023

bt90 commented Apr 28, 2023

marten-seemann commented May 3, 2023

bt90 commented May 3, 2023

marten-seemann commented May 3, 2023

calmh commented May 3, 2023

MarcoPolo commented May 3, 2023

bt90 commented Jun 5, 2023

Boosting performance: UDP Offloading and GSO #2877

Boosting performance: UDP Offloading and GSO #2877

Comments

jshuhnow commented Nov 15, 2020

marten-seemann commented Nov 15, 2020 • edited

jshuhnow commented Nov 17, 2020

gedw99 commented Nov 20, 2020

marten-seemann commented Nov 21, 2020

nanokatze commented Apr 27, 2022

bt90 commented Apr 14, 2023

marten-seemann commented Apr 14, 2023

bt90 commented Apr 14, 2023

tredeske commented Apr 14, 2023

marten-seemann commented Apr 14, 2023

marten-seemann commented Apr 14, 2023

bt90 commented Apr 14, 2023

tredeske commented Apr 14, 2023

marten-seemann commented Apr 14, 2023

marten-seemann commented Apr 16, 2023

bt90 commented Apr 16, 2023

marten-seemann commented Apr 16, 2023

bt90 commented Apr 28, 2023

marten-seemann commented May 3, 2023

bt90 commented May 3, 2023

marten-seemann commented May 3, 2023

calmh commented May 3, 2023

MarcoPolo commented May 3, 2023

bt90 commented Jun 5, 2023

marten-seemann commented Nov 15, 2020 •

edited