Skip to content

bulkUploadF32 wedges GB10 driver (uninterruptible D-state) on large one-shot uploads (~213k tensors); needs chunking #106

@dndungu

Description

@dndungu

Summary

GPUEngine.bulkUploadF32 (compute/gpu_engine.go) wedges the NVIDIA GB10
(sm_121, aarch64 unified-memory)
driver in an uninterruptible D-state when
asked to bulk-upload a large number of f32 tensors in one shot (~213k tensors
observed; historically reproduced >300k). The wedge is below the Go runtime: the
container's main OS thread is stuck in a CUDA driver ioctl that never returns and
cannot be SIGKILLed, which makes the whole container unkillable.

Reproduced 2026-06-05 from Wolf train-crossasset (CrossAsset multiscale model)
at the T4.2 sample pre-upload:
UploadWeights -> bulkUploadF32(213304 tensors + 50 params) never returns.

Environment

  • Device: NVIDIA GB10 (Grace-Blackwell, sm_121, aarch64, 128 GB unified memory)
  • ztensor: v1.8.0 (repro image built from tb1/upload-trace-instr @ 5cf9cc3,
    bulkUploadF32 logic identical to v1.8.0 release)
  • Go cgo CUDA path, ZERFOO_ENABLE_MANAGED_MEM unset -> e.managedMem=false
    -> the non-managed branch: single e.runtime.Malloc(total) + single
    e.runtime.Memcpy(devPtr, host, total, HostToDevice) (gpu_engine.go:456 / :480).

Evidence that it is an uninterruptible CUDA driver call (not a Go deadlock)

When the upload hangs:

  • podman exec into the container, log streaming, and podman rm / pod DELETE
    all wedge (HTTP-layer timeouts), while the orchestrator control plane stays
    fully responsive.
  • A Go futex/channel deadlock would NOT wedge podman exec — exec spawns a fresh
    process in the namespace, unaffected by a blocked goroutine.
  • The Wolf-side heartbeat goroutine keeps ticking; only the main goroutine (in the
    cgo CUDA call) is stuck.

=> main thread is in D-state in a CUDA driver ioctl. SIGKILL cannot reap
D-state, so the container is unkillable. (On our Spark/podman orchestrator this
surfaces as a permanent "leaked running pod" that needs a host-level restart.)

Root cause (current code)

bulkUploadF32 consolidates all eligible f32 tensors into one device
allocation of total bytes and uploads it in one Memcpy (or, with managed
memory enabled, one cudaMallocManaged(total) + one host-copy):

// gpu_engine.go ~448-486
devPtr, err = e.runtime.Malloc(total)            // one giant cudaMalloc
...
host := make([]byte, total)                      // one giant staging buffer
for _, en := range eligible { copy(host[...], src) }
e.runtime.Memcpy(devPtr, &host[0], total, HostToDevice)  // one giant H2D

At the CrossAsset sample-upload scale (hundreds of thousands of tensors -> a
multi-GB single buffer) this single large alloc/copy wedges the GB10 driver. There
is no upper bound on total or on the per-call tensor count beyond
bulkUploadF32MinTensors = 64 (a lower bound).

Proposed fix

Chunk the bulk upload so no single Malloc/Memcpy exceeds a bounded size:

  • Add a max-bytes-per-chunk (and/or max-tensors-per-chunk) cap.
  • Iterate eligible in chunks; per chunk: one bounded Malloc(chunkBytes) +
    staging copy + Memcpy, then SetStorage views with chunk-local offsets, and
    append each chunk's devPtr to bulkUploadBuffers.
  • Same for the managed-memory branch (bounded cudaMallocManaged per chunk).

This preserves the bulk-upload win (few large copies instead of per-tensor
uploads) while keeping every driver call under the GB10 wedge threshold. The
resulting GPU storage views are identical; existing bulk_upload_test.go coverage
should hold.

Questions for ztensor maintainers

  1. Is there a known GB10/sm_121 cudaMalloc/cudaMemcpy size threshold that
    wedges uninterruptibly under unified memory?
  2. Preferred cap: bytes-based (e.g. 256 MB) or tensor-count-based? Happy to send a
    PR implementing the chunking once you confirm the cap shape.

Cross-ref

Wolf devlog 2026-06-05 (T8.1), Wolf parity plan E8/T8.1. Wolf caller:
internal/crossasset/crossasset.go trainWithResult -> UploadWeights.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions