Migrate internal/nccl from CGo to purego/dlopen

## Summary

`internal/nccl/nccl.go` is one of the last remaining CGo holdouts in the GPU bindings layer. Per project convention (zerfoo CLAUDE.md): *"No CGo by default. GPU bindings use purego/dlopen. Build tags (cuda, rocm, opencl) are optional and only used for CGo-based alternative paths."*

CUDA runtime, cuBLAS, cuDNN, and rocBLAS already have full purego implementations. NCCL should follow the same pattern so that distributed training builds work without CGo and without the `//go:build cuda` tag.

## Current state

- `internal/nccl/nccl.go` (168 LOC): `//go:build cuda`, `#cgo LDFLAGS: -lnccl`, `#include <nccl.h>`. Pure C API surface — communicator lifecycle, AllReduce, Broadcast, UniqueID.
- An identical 168-line copy lives at `github.com/zerfoo/zerfoo` `internal/nccl/nccl.go` and should be removed in favor of importing from ztensor (or migrated in lockstep).

## Why this is a clean win

- Pure C ABI — no C++, no inlined kernels, no header-only templates.
- Static surface area: ~10 functions (GetUniqueID, CommInitRank, CommDestroy, AllReduce, Broadcast, GroupStart, GroupEnd, GetErrorString, plus enums).
- Mirrors the existing cuBLAS purego pattern (`internal/cublas/cublas_purego.go`) — function pointer cache resolved via dlsym at init.
- Removes the last reason distributed training requires `-tags cuda`.

## Proposed approach

1. Add `internal/nccl/nccl_purego.go` modeled on `cublas_purego.go`:
   - dlopen `libnccl.so.2` at init (Linux), graceful "NCCL unavailable" error if missing.
   - dlsym each entry point into a typed function pointer.
   - Replace `C.ncclXxx` constants with hardcoded Go constants (NCCL ABI is stable; values match `nccl.h`).
   - Marshal `ncclUniqueId` as a fixed-size byte array (128 bytes per NCCL ABI).
2. Rename existing `nccl.go` to `nccl_cgo.go` and gate it behind `//go:build cuda && cgo` as the legacy path (or delete outright once purego is verified).
3. Update zerfoo to either re-export from ztensor or apply the same migration.
4. Tests: `nccl_test.go` already exists; ensure it runs without the `cuda` build tag on hosts where libnccl is present, and skips cleanly when absent.

## Acceptance

- `go build ./...` (no tags) compiles `internal/nccl` on Linux.
- `nccl_test.go` passes on a multi-GPU host (DGX Spark) without `-tags cuda`.
- Distributed training in zerfoo no longer requires the `cuda` build tag for the NCCL path.
- Both ztensor and zerfoo copies are reconciled (single source of truth).

## Effort

~200 LOC, low risk. Pattern is well-established by cuBLAS/cuDNN/rocBLAS migrations.

## Related

- See `internal/cublas/cublas_purego.go` for the reference pattern.
- Project convention: zerfoo `CLAUDE.md` — "No CGo by default".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate internal/nccl from CGo to purego/dlopen #78

Summary

Current state

Why this is a clean win

Proposed approach

Acceptance

Effort

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Migrate internal/nccl from CGo to purego/dlopen #78

Description

Summary

Current state

Why this is a clean win

Proposed approach

Acceptance

Effort

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions