Skip to content

runtime, x/sys/unix: Connectx is broken on darwin/amd64 #71302

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
database64128 opened this issue Jan 17, 2025 · 20 comments
Closed

runtime, x/sys/unix: Connectx is broken on darwin/amd64 #71302

database64128 opened this issue Jan 17, 2025 · 20 comments
Labels
arch-amd64 BugReport Issues describing a possible bug in the Go implementation. compiler/runtime Issues related to the Go compiler and/or runtime. Critical A critical problem that affects the availability or correctness of production systems built using Go help wanted NeedsFix The path to resolution is known, but the work has not been done. OS-Darwin
Milestone

Comments

@database64128
Copy link
Contributor

database64128 commented Jan 17, 2025

Go version

go version go1.23.5 darwin/arm64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='arm64'
GOBIN=''
GOCACHE='/Users/database64128/Library/Caches/go-build'
GOENV='/Users/database64128/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/database64128/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='darwin'
GOPATH='/Users/database64128/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/opt/homebrew/Cellar/go/1.23.5/libexec'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/opt/homebrew/Cellar/go/1.23.5/libexec/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.23.5'
GODEBUG=''
GOTELEMETRY='local'
GOTELEMETRYDIR='/Users/database64128/Library/Application Support/go/telemetry'
GCCGO='gccgo'
GOARM64='v8.0'
AR='ar'
CC='cc'
CXX='c++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/xd/v25clzgj08d6_cbbp7lwszfr0000gn/T/go-build880517508=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

@wwqgtxx reports that x/sys/unix.Connectx is broken on darwin/amd64. The call succeeds, but does not return the correct number of bytes sent, and may cause memory corruption (fatal error: stack not a power of 2).

They were able to work around the issue by making the following minimal change:

diff --git a/sys/unix/zsyscall_darwin_amd64.go b/sys/unix/zsyscall_darwin_amd64.go
index 24b346e..f8cc117 100644
--- a/sys/unix/zsyscall_darwin_amd64.go
+++ b/sys/unix/zsyscall_darwin_amd64.go
@@ -848,7 +848,7 @@ func connectx(fd int, endpoints *SaEndpoints, associd SaeAssocID, flags uint32,
 	} else {
 		_p0 = unsafe.Pointer(&_zero)
 	}
-	_, _, e1 := syscall_syscall9(libc_connectx_trampoline_addr, uintptr(fd), uintptr(unsafe.Pointer(endpoints)), uintptr(associd), uintptr(flags), uintptr(_p0), uintptr(len(iov)), uintptr(unsafe.Pointer(n)), uintptr(unsafe.Pointer(connid)), 0)
+	_, _, e1 := Syscall9(SYS_CONNECTX, uintptr(fd), uintptr(unsafe.Pointer(endpoints)), uintptr(associd), uintptr(flags), uintptr(_p0), uintptr(len(iov)), uintptr(unsafe.Pointer(n)), uintptr(unsafe.Pointer(connid)), 0)
 	if e1 != 0 {
 		err = errnoErr(e1)
 	}

Connectx was added by me in golang/sys@59665e5 (CL 606155). It does not have any issues on darwin/arm64. We spent hours on this and were not able to pinpoint the exact cause.

What did you see happen?

Connectx is broken on darwin/amd64.

What did you expect to see?

Connectx works on darwin/amd64.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jan 17, 2025
@gopherbot gopherbot added this to the Unreleased milestone Jan 17, 2025
@gabyhelp gabyhelp added the BugReport Issues describing a possible bug in the Go implementation. label Jan 17, 2025
@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 17, 2025
@mknyszek
Copy link
Contributor

CC @golang/runtime

@mknyszek
Copy link
Contributor

Switching to the underlying syscall instead of libc suggests we're probably holding Darwin's libc wrong? Michael P notes maybe stack corruption from a mistake in how it's called? macOS requires us to call through libc for everything, so I'm not sure we should switch to making the underlying syscall directly.

@randall77
Copy link
Contributor

I think syscall9 is incorrect. The return values might not be plumbed back correctly. Try this patch:

diff --git a/src/runtime/sys_darwin.go b/src/runtime/sys_darwin.go
index 5c769a71ea..af10c53659 100644
--- a/src/runtime/sys_darwin.go
+++ b/src/runtime/sys_darwin.go
@@ -72,10 +72,11 @@ func syscall6()
 //go:nosplit
 //go:cgo_unsafe_args
 func syscall_syscall9(fn, a1, a2, a3, a4, a5, a6, a7, a8, a9 uintptr) (r1, r2, err uintptr) {
+       args := struct{ fn, a1, a2, a3, a4, a5, a6, a7, a8, a9, r1, r2, err uintptr }{fn, a1, a2, a3, a4, a5, a6, a7, a8, a9, 0, 0, 0}
        entersyscall()
-       libcCall(unsafe.Pointer(abi.FuncPCABI0(syscall9)), unsafe.Pointer(&fn))
+       libcCall(unsafe.Pointer(abi.FuncPCABI0(syscall9)), unsafe.Pointer(&args))
        exitsyscall()
-       return
+       return args.r1, args.r2, args.err
 }
 func syscall9()

@wwqgtxx
Copy link

wwqgtxx commented Feb 22, 2025

I tested the patch provided by @randall77 and the result was still the same. It seems that the problem may not be caused by incorrect parsing of the return value.

@ianlancetaylor
Copy link
Member

@randall77 I don't think there is a problem with syscall_syscall9. I think the problem you are pointing out is addressed by the use of //go:cgo_unsafe_args. I don't know why syscall_syscall6 doesn't use //go:cgo_unsafe_args, but I do see that syscall_syscall9 was added later (in https://go.dev/cl/446178).

What I wonder about instead is the assembler function runtime·syscall9 in runtime/sys_darwin_amd64.s. That function assumes that if we have 9 parameters, the last three will be passed in R10, R11, R12. But here we are using the calling convention for a C function, not the convention for a SYSCALL instruction. At least on Linux, when calling a C function with 9 parameters, the last three are not passed in R10, R11, R12; they are passed on the stack. For example, see page 20 of https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf.

As far as I know macOS is the same, though perhaps somebody could confirm that.

If that is the case then we need to change runtime·syscall9 to pass the last three arguments on the stack. We'll have to assume that they are all integer-shaped, which is likely for the cases where this is used.

Currently I believe that the only function that uses runtime·syscall9 other than Connectx is Getnameinfo in the internal/syscall/unix package. And the only argument that may be being passed incorrectly is the flags argument. That function is only called to do a reverse DNS name lookup. The function will mostly do the right thing regardless of the flag value. It may be that nobody has noticed any problem here.

@database64128
Copy link
Contributor Author

I don't know why syscall_syscall6 doesn't use //go:cgo_unsafe_args, but I do see that syscall_syscall9 was added later (in https://go.dev/cl/446178).

The use of //go:cgo_unsafe_args was removed in 57e3809 (CL 386719). CL 446178 was opened months after CL 386719 was merged, but somehow didn't pick up the changes.

@database64128
Copy link
Contributor Author

If that is the case then we need to change runtime·syscall9 to pass the last three arguments on the stack. We'll have to assume that they are all integer-shaped, which is likely for the cases where this is used.

Any chance we can get a fix for this? Thanks.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/665435 mentions this issue: runtime: fix 9-arg syscall on darwin/amd64

@randall77
Copy link
Contributor

Possible fixed listed above. Could someone try it?

@wwqgtxx
Copy link

wwqgtxx commented Apr 14, 2025

I tested https://go.googlesource.com/go/+/87973aae0be01e1be75ccfa60dc4aa1e539400c8 on golang1.24.2 and found that it still cannot be used normally. Moreover, it is different from the previous data error and panic. The current phenomenon seems that Connectx returns success but is not actually executed.

@randall77
Copy link
Contributor

I found a bug in my CL and pushed a fix. Could you try that?
(The bug was only on the error return path, so maybe there's something else wrong...)

@database64128
Copy link
Contributor Author

I found a bug in my CL and pushed a fix. Could you try that? (The bug was only on the error return path, so maybe there's something else wrong...)

Thank you for the quick fix! I think this is it. In the tests @wwqgtxx ran, connectx(2) has a high chance of returning -EINPROGRESS, which was not properly propagated back to the caller with the previous patch. This explains the -ENOTCONN error returned by subsequent writes.

Still, let's wait until @wwqgtxx has the new results, just to be sure.

@wwqgtxx
Copy link

wwqgtxx commented Apr 15, 2025

I found a bug in my CL and pushed a fix. Could you try that? (The bug was only on the error return path, so maybe there's something else wrong...)

Thanks for your help. After applying https://go.googlesource.com/go/+/a15345edabef8f40ae46e4e8117bd830ec0fb059, Connectx behaves normally and all tfo-go tests can pass on darwin amd64 platform.

@randall77
Copy link
Contributor

Ok, that's good. I'll get that CL in soon.

I think the remaining question is, should we backport?
In favor:

  1. This could cause really weird behavior and/or corruption on darwin/amd64. There is only one stdlib call affected (syscall/Getnameinfo), but several in x.sys (Connectx for freebsd, mmap for netbsd, dragonfly).
  2. darwin/amd64 is a first-class port.

In opposition:

  1. This has always been broken (since introduced in late 2022). It is not a regression from 1.24.
  2. Only the Connectx failure has been noticed.

I'm leaning towards backporting. The fix is pretty simple and the unfixed behavior is very hard to debug and/or work around.

@gopherbot please open backport issues

@gopherbot
Copy link
Contributor

Backport issue(s) opened: #73379 (for 1.24).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@database64128
Copy link
Contributor Author

database64128 commented Apr 15, 2025

Should we also open a CL based on #71302 (comment), not for this issue, but like 57e3809 (CL 386719) did for #51247?

@randall77
Copy link
Contributor

randall77 commented Apr 15, 2025

I was incorrect on that referenced comment.

Maybe we should fix syscall9 like we did in CL 386719, but I can't make it fail at the moment. This code

package main

import "net"

func main() {
	_, _ = net.LookupAddr("8.8.8.8")
}

Compiles just fine with -race and -gcflags=-N -l on darwin/arm64. So maybe that bug is not triggerable with just this.

It would be good to be consistent between the different syscall functions, though.

@cherrymui

@cherrymui
Copy link
Member

Yeah, perhaps this particular code path doesn't use as much stack space. It may be still good to be consistent. If we apply the patch as #71302 (comment) , we should also remove the cgo_unsafe_arg pragma.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/665695 mentions this issue: runtime: don't use cgo_unsafe_args for syscall9 wrapper

gopherbot pushed a commit that referenced this issue Apr 16, 2025
It uses less stack space this way.

Similar to CL 386719
Update #71302

Change-Id: I585bde5f681a90a6900cbd326994ab8a122fd148
Reviewed-on: https://go-review.googlesource.com/c/go/+/665695
Reviewed-by: Keith Randall <khr@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
@github-project-automation github-project-automation bot moved this from Todo to Done in Go Compiler / Runtime Apr 16, 2025
@prattmic prattmic added the Critical A critical problem that affects the availability or correctness of production systems built using Go label Apr 16, 2025
@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Apr 16, 2025
@dmitshur dmitshur changed the title x/sys/unix: Connectx is broken on darwin/amd64 runtime, x/sys/unix: Connectx is broken on darwin/amd64 Apr 16, 2025
@dmitshur dmitshur modified the milestones: Unreleased, Go1.25 Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-amd64 BugReport Issues describing a possible bug in the Go implementation. compiler/runtime Issues related to the Go compiler and/or runtime. Critical A critical problem that affects the availability or correctness of production systems built using Go help wanted NeedsFix The path to resolution is known, but the work has not been done. OS-Darwin
Projects
Development

No branches or pull requests

10 participants