Summary
internal/nccl/nccl.go is one of the last remaining CGo holdouts in the GPU bindings layer. Per project convention (zerfoo CLAUDE.md): "No CGo by default. GPU bindings use purego/dlopen. Build tags (cuda, rocm, opencl) are optional and only used for CGo-based alternative paths."
CUDA runtime, cuBLAS, cuDNN, and rocBLAS already have full purego implementations. NCCL should follow the same pattern so that distributed training builds work without CGo and without the //go:build cuda tag.
Current state
internal/nccl/nccl.go (168 LOC): //go:build cuda, #cgo LDFLAGS: -lnccl, #include <nccl.h>. Pure C API surface — communicator lifecycle, AllReduce, Broadcast, UniqueID.
- An identical 168-line copy lives at
github.com/zerfoo/zerfoo internal/nccl/nccl.go and should be removed in favor of importing from ztensor (or migrated in lockstep).
Why this is a clean win
- Pure C ABI — no C++, no inlined kernels, no header-only templates.
- Static surface area: ~10 functions (GetUniqueID, CommInitRank, CommDestroy, AllReduce, Broadcast, GroupStart, GroupEnd, GetErrorString, plus enums).
- Mirrors the existing cuBLAS purego pattern (
internal/cublas/cublas_purego.go) — function pointer cache resolved via dlsym at init.
- Removes the last reason distributed training requires
-tags cuda.
Proposed approach
- Add
internal/nccl/nccl_purego.go modeled on cublas_purego.go:
- dlopen
libnccl.so.2 at init (Linux), graceful "NCCL unavailable" error if missing.
- dlsym each entry point into a typed function pointer.
- Replace
C.ncclXxx constants with hardcoded Go constants (NCCL ABI is stable; values match nccl.h).
- Marshal
ncclUniqueId as a fixed-size byte array (128 bytes per NCCL ABI).
- Rename existing
nccl.go to nccl_cgo.go and gate it behind //go:build cuda && cgo as the legacy path (or delete outright once purego is verified).
- Update zerfoo to either re-export from ztensor or apply the same migration.
- Tests:
nccl_test.go already exists; ensure it runs without the cuda build tag on hosts where libnccl is present, and skips cleanly when absent.
Acceptance
go build ./... (no tags) compiles internal/nccl on Linux.
nccl_test.go passes on a multi-GPU host (DGX Spark) without -tags cuda.
- Distributed training in zerfoo no longer requires the
cuda build tag for the NCCL path.
- Both ztensor and zerfoo copies are reconciled (single source of truth).
Effort
~200 LOC, low risk. Pattern is well-established by cuBLAS/cuDNN/rocBLAS migrations.
Related
- See
internal/cublas/cublas_purego.go for the reference pattern.
- Project convention: zerfoo
CLAUDE.md — "No CGo by default".
Summary
internal/nccl/nccl.gois one of the last remaining CGo holdouts in the GPU bindings layer. Per project convention (zerfoo CLAUDE.md): "No CGo by default. GPU bindings use purego/dlopen. Build tags (cuda, rocm, opencl) are optional and only used for CGo-based alternative paths."CUDA runtime, cuBLAS, cuDNN, and rocBLAS already have full purego implementations. NCCL should follow the same pattern so that distributed training builds work without CGo and without the
//go:build cudatag.Current state
internal/nccl/nccl.go(168 LOC)://go:build cuda,#cgo LDFLAGS: -lnccl,#include <nccl.h>. Pure C API surface — communicator lifecycle, AllReduce, Broadcast, UniqueID.github.com/zerfoo/zerfoointernal/nccl/nccl.goand should be removed in favor of importing from ztensor (or migrated in lockstep).Why this is a clean win
internal/cublas/cublas_purego.go) — function pointer cache resolved via dlsym at init.-tags cuda.Proposed approach
internal/nccl/nccl_purego.gomodeled oncublas_purego.go:libnccl.so.2at init (Linux), graceful "NCCL unavailable" error if missing.C.ncclXxxconstants with hardcoded Go constants (NCCL ABI is stable; values matchnccl.h).ncclUniqueIdas a fixed-size byte array (128 bytes per NCCL ABI).nccl.gotonccl_cgo.goand gate it behind//go:build cuda && cgoas the legacy path (or delete outright once purego is verified).nccl_test.goalready exists; ensure it runs without thecudabuild tag on hosts where libnccl is present, and skips cleanly when absent.Acceptance
go build ./...(no tags) compilesinternal/ncclon Linux.nccl_test.gopasses on a multi-GPU host (DGX Spark) without-tags cuda.cudabuild tag for the NCCL path.Effort
~200 LOC, low risk. Pattern is well-established by cuBLAS/cuDNN/rocBLAS migrations.
Related
internal/cublas/cublas_purego.gofor the reference pattern.CLAUDE.md— "No CGo by default".