New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Boosting performance: UDP Offloading and GSO #2877
Comments
Support for I'm not sure if Go allows us to use GSO. Do you have any information about that? I'd be very interested in adding support for it in quic-go. |
Thank you for the quick reply and reference. Not that I'm aware of. Please, anybody, feel free to further the discussion. |
@gedw99 Thanks for the link, that's very interesting. It's only for TCP though, isn't it? |
Hi! GRO and GSO can be used in Go by setting appropriate socket options and then using ReadMsgUDP and WriteMsgUDP to get and specify segment size respectively. Feel free to use the following fragment as an example: https://github.com/nanokatze/quic-at-home/blob/main/internal/udp/gso_unix.go . |
This seems pretty simple, probably better to just copy-paste the code than to include this massive library: https://github.com/tredeske/u/blob/b09eea740bef/unet/socket.go#L305-L313 Do I understand correctly that one can use the cmsg to set the datagram size for every batch of packets separately. That would be necessary since the server just has a single socket that it uses to serve many clients, and runs DPLPMTUD on each connection separately. |
Maybe @tredeske can answer that? |
This is an interesting discussion. GSO allows sending of a large (up to 64k) datagram, to a single destination. The datagram you send will be split into equal size actual datagrams by GSO. The performance gain is from having a single, large datagram transit the network stack instead of a bunch of smaller ones. It's a large gain in our experience. http://vger.kernel.org/lpc_net2018_talks/willemdebruijn-lpc2018-udpgso-paper-DRAFT-1.pdf For best performance, using sendmmsg (note extra m) will allow you to send multiple large datagrams with a single system call, each of which will be sent on the wire as multiple MTU sized datagrams. Each large datagram can be sent to a different destination. The performance gain using this is also quite large. The variability using cmsghdr is not necessary for our application, but I can see how that would be useful. The benchmark test for this interface shows how to do it in c. https://github.com/torvalds/linux/blob/master/tools/testing/selftests/net/udpgso_bench_tx.c |
@tredeske Thank you for chiming in here!
The link doesn't seem to work.
Would you recommend combining these two? E.g. pass multiple (up to) 64k buffers to the kernel, if I'm sending to multiple remote addresses at the same time? |
Just stumbled upon Tailscale's blog post from yesterday, which is excellent (as always): https://tailscale.com/blog/more-throughput/ |
Make sure that you're accesing it with |
Huh. I just clicked that link in your comment and it brought up the pdf. Yes. It's a big win. it lets you send multiple frames of the same size (except possibly the last one) at once to each destination. If you have variable sized frames, then in practice you may not be able to fill up to 64 kb in each element of the msghdr array. Worst case, you can send two per element, as the last one in each can be a different size. |
Looks like my Brave is aggressively upgrading HTTP to HTTPS, to the point where links fail. Sorry for the noise! Now I'm trying to wrap my head around how I'd use that with the new Go API (#3563). API proposal here: golang/go#45886 (comment) If I understand correctly, I'd just pass up to Not sure I understand why |
I built a PoC, and it works! This is exciting! https://gist.github.com/marten-seemann/a549773b53f30960b966a9f4068b6e48 However, the Go standard library currently doesn't provide a function to serialize a |
Would it be feasible to implement it for a subset of prominent architectures with a fallback to non-GSO sending? |
Probably. From looking at x/sys/unix, the main difference between architecture seems to be the size of the |
@marten-seemann you could take a look at how wireguard-go is handling this. |
I've been working on GSO support in the gso branch. Initial benchmarks look very promising, I'm seeing more than a doubling of the transfer speed on localhost, and that's without tuning of any parameters. However, GSO support raises an interesting API question: Once GSO is enabled, calls to I'm wondering if we should make the API more explicit: Instead of doing interface assertions on the // OOBCapablePacketConn is a connection that allows the reading of ECN bits from the IP header.
// If the PacketConn passed to Dial or Listen satisfies this interface, quic-go will use it.
// In this case, ReadMsgUDP() will be used instead of ReadFrom() to read packets.
type OOBCapablePacketConn interface {
net.PacketConn
SyscallConn() (syscall.RawConn, error)
ReadMsgUDP(b, oob []byte) (n, oobn, flags int, addr *net.UDPAddr, err error)
WriteMsgUDP(b, oob []byte, addr *net.UDPAddr) (n, oobn int, err error)
}
// EnableOptimizations takes a connection and enables a range of optimizations that are crucial for QUIC performance.
// 1. increase UDP send and receive buffers
// 2. enable ECN support
// 3. enable GSO support
// 4. enable GRO support (in a future PR)
func EnableOptimizations(conn OOBCapablePacketConn) net.PacketConn {
// crazy syscalls
} If enabling GSO succeeds, the returned quic-go users would need to call Does that make sense? Or is there a better solution? |
Multiplexing support is absolutely crucial for syncthing. We're currently using https://github.com/AudriusButkevicius/pfilter/ to accomplish that. While out of scope for the GSO changes, it would be wholeheartedly welcome if pfilter could be replaced with a listener mechanism or something similar offered by quic-go. While we already discussed that as part of #3727, it might make sense to tackle the issue if we need to restructure at this level. (Pinging @calmh @AudriusButkevicius for comments) |
@bt90 What are your thoughts on the suggested |
I believe it would, yes. |
If I'm understanding this correctly, users could do something like this:
Or probably more like:
Since they may way the optimizedUDPConn to write to themselves. And I believe this wouldn't affect users who aren't multiplexing. With my above understanding, I think this API makes sense to me 👍 . I was thinking of something similar from the problem statement "Once GSO is enabled, calls to WriteTo on the original net.UDPConn will fail with a |
v0.36.0 milestone? |
Hello, I've been instrumenting numerous QUIC implementations for performance testing.
In particular, I see [
quiche
] from CloudFlare implemented UDP offloading and GSO earlier this year. [link]quic-go
has given me 470.3 Mbps with loopback setting, so it seems to have no support for both accelerations yet. (I briefly looked up the code)Would anyone confirm this for
quic-go
?Maybe we could keep open this issue for future implementation, too.
The text was updated successfully, but these errors were encountered: