Skip to content

Migrate internal/nccl from CGo to purego/dlopen #78

@dndungu

Description

@dndungu

Summary

internal/nccl/nccl.go is one of the last remaining CGo holdouts in the GPU bindings layer. Per project convention (zerfoo CLAUDE.md): "No CGo by default. GPU bindings use purego/dlopen. Build tags (cuda, rocm, opencl) are optional and only used for CGo-based alternative paths."

CUDA runtime, cuBLAS, cuDNN, and rocBLAS already have full purego implementations. NCCL should follow the same pattern so that distributed training builds work without CGo and without the //go:build cuda tag.

Current state

  • internal/nccl/nccl.go (168 LOC): //go:build cuda, #cgo LDFLAGS: -lnccl, #include <nccl.h>. Pure C API surface — communicator lifecycle, AllReduce, Broadcast, UniqueID.
  • An identical 168-line copy lives at github.com/zerfoo/zerfoo internal/nccl/nccl.go and should be removed in favor of importing from ztensor (or migrated in lockstep).

Why this is a clean win

  • Pure C ABI — no C++, no inlined kernels, no header-only templates.
  • Static surface area: ~10 functions (GetUniqueID, CommInitRank, CommDestroy, AllReduce, Broadcast, GroupStart, GroupEnd, GetErrorString, plus enums).
  • Mirrors the existing cuBLAS purego pattern (internal/cublas/cublas_purego.go) — function pointer cache resolved via dlsym at init.
  • Removes the last reason distributed training requires -tags cuda.

Proposed approach

  1. Add internal/nccl/nccl_purego.go modeled on cublas_purego.go:
    • dlopen libnccl.so.2 at init (Linux), graceful "NCCL unavailable" error if missing.
    • dlsym each entry point into a typed function pointer.
    • Replace C.ncclXxx constants with hardcoded Go constants (NCCL ABI is stable; values match nccl.h).
    • Marshal ncclUniqueId as a fixed-size byte array (128 bytes per NCCL ABI).
  2. Rename existing nccl.go to nccl_cgo.go and gate it behind //go:build cuda && cgo as the legacy path (or delete outright once purego is verified).
  3. Update zerfoo to either re-export from ztensor or apply the same migration.
  4. Tests: nccl_test.go already exists; ensure it runs without the cuda build tag on hosts where libnccl is present, and skips cleanly when absent.

Acceptance

  • go build ./... (no tags) compiles internal/nccl on Linux.
  • nccl_test.go passes on a multi-GPU host (DGX Spark) without -tags cuda.
  • Distributed training in zerfoo no longer requires the cuda build tag for the NCCL path.
  • Both ztensor and zerfoo copies are reconciled (single source of truth).

Effort

~200 LOC, low risk. Pattern is well-established by cuBLAS/cuDNN/rocBLAS migrations.

Related

  • See internal/cublas/cublas_purego.go for the reference pattern.
  • Project convention: zerfoo CLAUDE.md — "No CGo by default".

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions