-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GSO fails on some platforms / interfaces #3911
Comments
GSO would have been my first guess as well. I suspect that for some reason we manage to enable GSO, but don't correctly recognize that it was successfully enabled. This then causes the failure you're seeing because once enabled, the kernel expects the segment length message for the Lines 70 to 83 in 21388c8
Can you try manually disabling GSO via the Lines 46 to 50 in 21388c8
This should make the failure go away. If it's actually due to GSO, can you confirm if the syscall to enable it is successful in Lines 52 to 57 in 21388c8
If my suspicion is correct, this would mean that we can't rely on the return value of the syscall that enables GSO. Maybe we need to instead read the value back using |
It is indeed GSO: With QUIC_GO_DISABLE_GSO on v0.36.0:
Without QUIC_GO_DISABLE_GSO on v0.36.0:
|
This is with debug printout (don't worry the version in the message, that's just my fork not having updated tags):
Like so: var setErr error
if err := rawConn.Control(func(fd uintptr) {
setErr = unix.SetsockoptInt(int(fd), syscall.IPPROTO_UDP, UDP_SEGMENT, 1)
}); err != nil {
setErr = err
}
log.Printf("!! DEBUG !! setErr: %+v\n", setErr)
if setErr != nil {
log.Println("failed to enable GSO")
return false
} |
Like so: var setErr error
if err := rawConn.Control(func(fd uintptr) {
setErr = unix.SetsockoptInt(int(fd), syscall.IPPROTO_UDP, UDP_SEGMENT, 1)
}); err != nil {
setErr = err
}
if setErr != nil {
log.Println("failed to enable GSO")
return false
}
rawConn.Control(func(fd uintptr) {
val, err := unix.GetsockoptInt(int(fd), syscall.IPPROTO_UDP, UDP_SEGMENT)
log.Printf("!! DEBUG !! GetsockoptInt: %+v, err: %+v\n", val, err)
}) |
Could this be the kernel not supporting GSO fully? On the client, they are running inside docker but on |
Thank you! I would’ve expected the SetSockoptInt to fail in that case, but maybe that’s not how it works? Any idea how we could determine if GSO is actually available? Seems like we can neither trust SetSockoptInt nor GetSockoptInt. And the „invalid argument“ error is also pretty unspecific, otherwise we could do some kind of error assertion. |
I think the argument might be incorrect: https://github.com/torvalds/linux/blob/master/tools/testing/selftests/net/udpgso.c The value is not 1 or 0 but the size it seems |
we have the same issue with Chromebooks (Linux vm in Chromebook). To test for GSO, quiche use "1350" instead of "1" but I've tried to change this in quic-go and the error is still there. (see detect_gso definition and calls) |
the issues seems to detect if GSO works ALSO with IPv6. |
@jfgiorgi Does the error occur immediately, or only when DPLPMTUD kicks in and attempts to send a packet larger than 1350 bytes (maybe minus IP / UDP header?). |
immediately. here an example with h3ctx (https://github.com/kgersen/h3ctx) with to reproduce:
I don't know if this can help but you might want to take a look at how msquic solved this: microsoft/msquic#1360 |
@marten-seemann can we recall the 0.36.0 release? Folks not knowing gso might explode may have their system explodes if they upgraded to 0.36.0 mistakenly |
I'm considering it, unless we can figure out a patch within the next day or so. Alternatively, we could also flip the environment variable to cc @mholt @francislavoie, just so you're aware, you'll most likely receive another PR with a patch for Caddy soon.
@kgersen The fallback is actually not the big problem. I'm fine falling back to a simple
@zllovesuki Could you try setting the value to 1400 or so and see if that fixes the problem? Does anyone have any pointer for me how to reproduce this? GSO works fine both on my Ubuntu machine (via Parallels), and on all the Linode cloud servers I have up and running. Ideally some cloud service where I can quickly spin up a machine to play around with. |
Thanks for the ping. We'll hold the Caddy 2.7 stable release until a patch is ready. We're fine with either a fix or disabling it. |
@marten-seemann Oracle offers aarch64 in their free tier |
I wonder if this is in fact a kernel bug? Might be one of those aarch64 funzies that @marcan and the Asahi team keeps running into. |
So the fix they use is to actually test gso before using it an not just believe what the kernel returns. |
Can you point me to where they test this? |
here: https://github.com/microsoft/msquic/blob/main/src/platform/datapath_epoll.c#L473 (up to line 535) I've run the Linux kernel selftests on a machine with the issue and I can confirm, in our case, it's GSO+IPv6 the cultrip:
with -4 (Ipv4) all tests are ok. |
@jfgiorgi Any chance I can spin up a machine like this myself? Would be really helpful if I could debug it myself. |
You should be able to use the service for free 😉 |
unfortunately it's not a VM in the cloud. it's a VM inside of physical device, a Chromebook. the specs: We only have shell access to the inner LXD container so I don't know which layer is causing the issue. |
Seems like it's tied to IPv6 and VMs/containers? I wonder if this patch is related? |
I tried, and it doesn't work. This user interface is a total mess. Apparently they're "out of capacity" for the aarch64 machines, and they don't let me use the 250 bucks of "free credit" they gave me to play around with beefier machines. And for some reason I'm stuck in the upgrade process of my account, which would allow me to add my credit card. What should I say, Oracle... Instead, I set up an aarch64 on AWS. However, everything works fine there, I can successfully enable GSO there. I'm not sure how to continue debugging this. By now I've probably tested on > 10 machines, and every machine I set up works fine, no matter which architecture and which distribution I select.
I don't think I'll be able to recreate this... ;) |
There isn't much to debug. it's clearly a kernel/hardware issue on some machines/configuration not related to quic-go or even Go. The safest is like msquic: detect gso but don't trust the kernel so perform a test or catch the very first error and retry once without gso. I'll take a look at Google's quiche and Chromium source later next week to see how they deal with this since our Chromebooks have the issue. |
There's a few things I'd like to try out. As @zllovesuki noticed, the we shouldn't set the option to 1, but the actual packet size. Now what I'd like to experiment with is the following:
We could implement something similar: Send a packet filled with random bytes in |
Unfortunately, I don't have a Raspberry Pi with me at the moment. @zllovesuki, can you help us out here? |
You can get EINVAL when the split would result in more than 64 datagrams. I grepped for this value in quic-go and I don't see it being enforced, though I may have missed it. See https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/udp.c?h=v6.2&id=c9c3395d5e3dcc6daee66c6908354d47bf98cb0c#n933 and the usage of https://github.com/WireGuard/wireguard-go/pull/75/files?diff=unified&w=0#diff-8fba84ef516815198fc29ef62a2486ad0c8b49e02257af3c3b96171a1cbfdc87R451 |
That check is implicit. The minimum size of packets we send is 1200 bytes, and we stop adding new packets to a batch as soon as a packet is packed that's smaller than that. Since the size is limited to 64k, we can never have more than 64 packets. |
Maybe the way we construct the cmsg doesn't work on all platforms? Just a wild guess, but it might be worth checking that the cmsg looks the same when generated in C vs. what we generate here in Go: Lines 67 to 80 in 44a58dc
|
@marten-seemann do you just need SSH access to a raspberry pi? Send me an email (address on profile) or we can chat over Signal to coordinate |
@zllovesuki Thanks for offering. I might get back to you on this in a couple of days. |
i've updated this C code to support both IPv4 & IPv6 : https://github.com/jfgiorgi/cgso on our systems with the gso issue:
it works if: |
The UDP_SEGMENT sockopt is not a binary flag like UDP_GRO. It sets the split size for all datagrams without the need for an equivalent cmsg. So, setting it to 1 and mixing with cmsg seems unintentional if quic-go is still doing that. |
I'm still seeing issues with GSO on arm64 and amd64 platform:
amd64 is a linode vm, arm64 is an oracle vm |
Is there any container runtime involved? |
I wonder if the offset calculation is correct. The internal Wouldn't |
This isn't a super helpful comment but I forgot to gather more info and tore down my environment. Figured I would comment something now instead of waiting - I will try to make this more useful though. On a GCP c3 VM with completely out-of-the-box settings connecting to another C3 VM using the latest version of quic-go, I had to disable GSO or I got the package main
import (
"context"
"crypto/rand"
"crypto/rsa"
"crypto/tls"
"crypto/x509"
"encoding/pem"
"flag"
"github.com/quic-go/quic-go"
syscall "golang.org/x/sys/unix"
"io"
"log"
"math/big"
"net"
"os"
"os/signal"
"sync"
"time"
)
var (
remote = flag.String("r", "127.0.0.1:8080", "remote addresses")
)
func main() {
flag.Parse()
go outbound()
go inbound()
WaitSignal()
}
func listenUDP(addr string) (*net.UDPConn, error) {
udpAddr, err := net.ResolveUDPAddr("udp4", addr)
if err != nil {
return nil, err
}
return net.ListenUDP("udp4", udpAddr)
}
func dialUDP(addr string, tlsConf *tls.Config, conf *quic.Config) (quic.Connection, error) {
udpConn, err := net.ListenUDP("udp4", nil)
if err != nil {
return nil, err
}
udpAddr, err := net.ResolveUDPAddr("udp4", addr)
if err != nil {
return nil, err
}
return quic.Dial(context.Background(), udpConn, udpAddr, tlsConf, conf)
}
func outbound() {
listener, err := net.Listen("tcp", "0.0.0.0:15001")
if err != nil {
panic(err)
}
for {
conn, err := listener.Accept()
if err != nil {
panic(err)
}
log.Println("accepted outbound connection")
go proxyConn(conn, *remote, true)
}
}
func inbound() {
pc, err := listenUDP("0.0.0.0:15006")
if err != nil {
panic(err)
}
listener, err := quic.Listen(pc, generateTLSConfig(), &quic.Config{
EnableDatagrams: true,
})
if err != nil {
panic(err)
}
for {
conn, err := listener.Accept(context.Background())
if err != nil {
panic(err)
}
log.Println("accepted inbound connection")
_, port, _ := net.SplitHostPort(*remote)
stream, err := conn.AcceptStream(context.Background())
if err != nil {
panic(err)
}
go proxyConn(stream, "127.0.0.1:"+port, false)
}
}
func WaitSignal() {
sigs := make(chan os.Signal, 1)
signal.Notify(sigs, syscall.SIGINT, syscall.SIGTERM)
<-sigs
}
func proxyConn(incoming io.ReadWriteCloser, dstIP string, useTLS bool) {
var outgoing io.ReadWriteCloser
if useTLS {
c, err := dialUDP(dstIP, generateTLSConfig(), &quic.Config{EnableDatagrams: true})
if err != nil {
panic(err)
}
stream, err := c.OpenStream()
if err != nil {
panic(err)
}
outgoing = stream
} else {
c, err := net.Dial("tcp", dstIP)
if err != nil {
panic(err)
}
outgoing = c
}
log.Println("connected to upstream")
t0 := time.Now()
wg := sync.WaitGroup{}
wg.Add(1)
go func() {
n, err := io.Copy(incoming, outgoing)
log.Printf("upstream complete, wrote=%v, err=%v", n, err)
incoming.Close()
outgoing.Close()
wg.Done()
}()
n, err := io.Copy(outgoing, incoming)
log.Printf("downstream complete, wrote=%v, err=%v", n, err)
incoming.Close()
outgoing.Close()
wg.Wait()
log.Println("connection closed in ", time.Since(t0))
}
func generateTLSConfig() *tls.Config {
key, err := rsa.GenerateKey(rand.Reader, 1024)
if err != nil {
panic(err)
}
template := x509.Certificate{SerialNumber: big.NewInt(1)}
certDER, err := x509.CreateCertificate(rand.Reader, &template, &template, &key.PublicKey, key)
if err != nil {
panic(err)
}
keyPEM := pem.EncodeToMemory(&pem.Block{Type: "RSA PRIVATE KEY", Bytes: x509.MarshalPKCS1PrivateKey(key)})
certPEM := pem.EncodeToMemory(&pem.Block{Type: "CERTIFICATE", Bytes: certDER})
tlsCert, err := tls.X509KeyPair(certPEM, keyPEM)
if err != nil {
panic(err)
}
return &tls.Config{
Certificates: []tls.Certificate{tlsCert},
InsecureSkipVerify: true,
}
} |
@howardjohn Not sure how to use your code, but I successfully tested quic-go on a GCP c3 machine, establishing a QUIC connection to google.com. |
When I use quic echo, the same error still occurs, the information is as follows: [root@Aliyun quic]# QUIC_GO_LOG_LEVEL=DEBUG go run main.go
2023/11/10 22:14:36 Increased receive buffer size to 4096 kiB
2023/11/10 22:14:36 Increased send buffer size to 4096 kiB
2023/11/10 22:14:36 Setting DF for IPv4 and IPv6.
2023/11/10 22:14:36 Activating reading of ECN bits for IPv4 and IPv6.
2023/11/10 22:14:36 Activating reading of packet info for IPv4 and IPv6.
2023/11/10 22:14:36 client Starting new connection to localhost ([::]:47781 -> 127.0.0.1:4242), source connection ID (empty), destination connection ID dedb29535cb7debdda558e64e544722190fdd626, version v1
2023/11/10 22:14:36 Adding connection ID (empty).
2023/11/10 22:14:36 client Not doing 0-RTT. Has sealer: false, has params: false
2023/11/10 22:14:36 client -> Sending packet 0 (1252 bytes) for connection dedb29535cb7debdda558e64e544722190fdd626, Initial
2023/11/10 22:14:36 client Long Header{Type: Initial, DestConnectionID: dedb29535cb7debdda558e64e544722190fdd626, SrcConnectionID: (empty), Token: (empty), PacketNumber: 0, PacketNumberLen: 2, Length: 1222, Version: v1}
2023/11/10 22:14:36 client -> &wire.CryptoFrame{Offset: 0, Data length: 279, Offset + Data length: 279}
2023/11/10 22:14:36 client Destroying connection with error: write udp [::]:47781->127.0.0.1:4242: sendmsg: invalid argument
2023/11/10 22:14:36 Removing connection ID (empty).
2023/11/10 22:14:36 client Connection dedb29535cb7debdda558e64e544722190fdd626 closed.
panic: INTERNAL_ERROR (local): write udp [::]:47781->127.0.0.1:4242: sendmsg: invalid argument
goroutine 1 [running]:
main.main()
/root/quic/main.go:29 +0x45
exit status 2 I added [root@Aliyun quic]# QUIC_GO_DISABLE_GSO=true QUIC_GO_LOG_LEVEL=DEBUG go run main.go
2023/11/10 22:14:27 Increased receive buffer size to 4096 kiB
2023/11/10 22:14:27 Increased send buffer size to 4096 kiB
2023/11/10 22:14:27 Setting DF for IPv4 and IPv6.
2023/11/10 22:14:27 Activating reading of ECN bits for IPv4 and IPv6.
2023/11/10 22:14:27 Activating reading of packet info for IPv4 and IPv6.
2023/11/10 22:14:27 client Starting new connection to localhost ([::]:52004 -> 127.0.0.1:4242), source connection ID (empty), destination connection ID 5ee5129391926792fbf5e57e0efbcd75, version v1
2023/11/10 22:14:27 Adding connection ID (empty).
2023/11/10 22:14:27 client Not doing 0-RTT. Has sealer: false, has params: false
2023/11/10 22:14:27 client -> Sending packet 0 (1252 bytes) for connection 5ee5129391926792fbf5e57e0efbcd75, Initial
2023/11/10 22:14:27 client Long Header{Type: Initial, DestConnectionID: 5ee5129391926792fbf5e57e0efbcd75, SrcConnectionID: (empty), Token: (empty), PacketNumber: 0, PacketNumberLen: 2, Length: 1226, Version: v1}
2023/11/10 22:14:27 client -> &wire.CryptoFrame{Offset: 0, Data length: 273, Offset + Data length: 273}
2023/11/10 22:14:27 client Destroying connection with error: write udp [::]:52004->127.0.0.1:4242: sendmsg: invalid argument
2023/11/10 22:14:27 Removing connection ID (empty).
2023/11/10 22:14:27 client Connection 5ee5129391926792fbf5e57e0efbcd75 closed.
panic: INTERNAL_ERROR (local): write udp [::]:52004->127.0.0.1:4242: sendmsg: invalid argument
goroutine 1 [running]:
main.main()
/root/quic/main.go:29 +0x45
exit status 2 about my Linux: [root@Aliyun quic]# uname -a
Linux Aliyun 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
[root@Aliyun quic]# go version
go version go1.20 linux/amd64
[root@Aliyun quic]# cat go.mod
module gquic
go 1.20
require github.com/quic-go/quic-go v0.40.0
require (
github.com/go-task/slim-sprig v0.0.0-20230315185526-52ccab3ef572 // indirect
github.com/google/pprof v0.0.0-20210407192527-94a9f03dee38 // indirect
github.com/onsi/ginkgo/v2 v2.9.5 // indirect
github.com/quic-go/qtls-go1-20 v0.4.1 // indirect
go.uber.org/mock v0.3.0 // indirect
golang.org/x/crypto v0.4.0 // indirect
golang.org/x/exp v0.0.0-20221205204356-47842c84f3db // indirect
golang.org/x/mod v0.11.0 // indirect
golang.org/x/net v0.10.0 // indirect
golang.org/x/sys v0.8.0 // indirect
golang.org/x/tools v0.9.1 // indirect
) |
UDP GSO support was introduced in 4.18: https://kernelnewbies.org/Linux_4.18#Networking |
Sorry, I didn't debug at first and came directly to this issue. I found that it was not just GSO that caused this problem. Using |
@marten-seemann do you think the |
That would be my suspicion of |
It may cause issues on some OS-es. See: quic-go/quic-go#3911
I'm seeing a lot of
INTERNAL_ERROR (local): write udp [::]:443->[REDACTED]:42830: sendmsg: invalid argument
on the server side, causing the connection to be disconnected. They only happen when the clients areaarch64
(raspberry pi).Could this be related to the GSO change?
The text was updated successfully, but these errors were encountered: