Summary
GPUEngine.bulkUploadF32 (compute/gpu_engine.go) wedges the NVIDIA GB10
(sm_121, aarch64 unified-memory) driver in an uninterruptible D-state when
asked to bulk-upload a large number of f32 tensors in one shot (~213k tensors
observed; historically reproduced >300k). The wedge is below the Go runtime: the
container's main OS thread is stuck in a CUDA driver ioctl that never returns and
cannot be SIGKILLed, which makes the whole container unkillable.
Reproduced 2026-06-05 from Wolf train-crossasset (CrossAsset multiscale model)
at the T4.2 sample pre-upload:
UploadWeights -> bulkUploadF32(213304 tensors + 50 params) never returns.
Environment
- Device: NVIDIA GB10 (Grace-Blackwell, sm_121, aarch64, 128 GB unified memory)
- ztensor: v1.8.0 (repro image built from
tb1/upload-trace-instr @ 5cf9cc3,
bulkUploadF32 logic identical to v1.8.0 release)
- Go cgo CUDA path,
ZERFOO_ENABLE_MANAGED_MEM unset -> e.managedMem=false
-> the non-managed branch: single e.runtime.Malloc(total) + single
e.runtime.Memcpy(devPtr, host, total, HostToDevice) (gpu_engine.go:456 / :480).
Evidence that it is an uninterruptible CUDA driver call (not a Go deadlock)
When the upload hangs:
podman exec into the container, log streaming, and podman rm / pod DELETE
all wedge (HTTP-layer timeouts), while the orchestrator control plane stays
fully responsive.
- A Go futex/channel deadlock would NOT wedge
podman exec — exec spawns a fresh
process in the namespace, unaffected by a blocked goroutine.
- The Wolf-side heartbeat goroutine keeps ticking; only the main goroutine (in the
cgo CUDA call) is stuck.
=> main thread is in D-state in a CUDA driver ioctl. SIGKILL cannot reap
D-state, so the container is unkillable. (On our Spark/podman orchestrator this
surfaces as a permanent "leaked running pod" that needs a host-level restart.)
Root cause (current code)
bulkUploadF32 consolidates all eligible f32 tensors into one device
allocation of total bytes and uploads it in one Memcpy (or, with managed
memory enabled, one cudaMallocManaged(total) + one host-copy):
// gpu_engine.go ~448-486
devPtr, err = e.runtime.Malloc(total) // one giant cudaMalloc
...
host := make([]byte, total) // one giant staging buffer
for _, en := range eligible { copy(host[...], src) }
e.runtime.Memcpy(devPtr, &host[0], total, HostToDevice) // one giant H2D
At the CrossAsset sample-upload scale (hundreds of thousands of tensors -> a
multi-GB single buffer) this single large alloc/copy wedges the GB10 driver. There
is no upper bound on total or on the per-call tensor count beyond
bulkUploadF32MinTensors = 64 (a lower bound).
Proposed fix
Chunk the bulk upload so no single Malloc/Memcpy exceeds a bounded size:
- Add a max-bytes-per-chunk (and/or max-tensors-per-chunk) cap.
- Iterate
eligible in chunks; per chunk: one bounded Malloc(chunkBytes) +
staging copy + Memcpy, then SetStorage views with chunk-local offsets, and
append each chunk's devPtr to bulkUploadBuffers.
- Same for the managed-memory branch (bounded
cudaMallocManaged per chunk).
This preserves the bulk-upload win (few large copies instead of per-tensor
uploads) while keeping every driver call under the GB10 wedge threshold. The
resulting GPU storage views are identical; existing bulk_upload_test.go coverage
should hold.
Questions for ztensor maintainers
- Is there a known GB10/sm_121
cudaMalloc/cudaMemcpy size threshold that
wedges uninterruptibly under unified memory?
- Preferred cap: bytes-based (e.g. 256 MB) or tensor-count-based? Happy to send a
PR implementing the chunking once you confirm the cap shape.
Cross-ref
Wolf devlog 2026-06-05 (T8.1), Wolf parity plan E8/T8.1. Wolf caller:
internal/crossasset/crossasset.go trainWithResult -> UploadWeights.
Summary
GPUEngine.bulkUploadF32(compute/gpu_engine.go) wedges the NVIDIA GB10(sm_121, aarch64 unified-memory) driver in an uninterruptible D-state when
asked to bulk-upload a large number of f32 tensors in one shot (~213k tensors
observed; historically reproduced >300k). The wedge is below the Go runtime: the
container's main OS thread is stuck in a CUDA driver ioctl that never returns and
cannot be SIGKILLed, which makes the whole container unkillable.
Reproduced 2026-06-05 from Wolf
train-crossasset(CrossAsset multiscale model)at the T4.2 sample pre-upload:
UploadWeights -> bulkUploadF32(213304 tensors + 50 params)never returns.Environment
tb1/upload-trace-instr@5cf9cc3,bulkUploadF32 logic identical to v1.8.0 release)
ZERFOO_ENABLE_MANAGED_MEMunset ->e.managedMem=false-> the non-managed branch: single
e.runtime.Malloc(total)+ singlee.runtime.Memcpy(devPtr, host, total, HostToDevice)(gpu_engine.go:456 / :480).Evidence that it is an uninterruptible CUDA driver call (not a Go deadlock)
When the upload hangs:
podman execinto the container, log streaming, andpodman rm/ pod DELETEall wedge (HTTP-layer timeouts), while the orchestrator control plane stays
fully responsive.
podman exec— exec spawns a freshprocess in the namespace, unaffected by a blocked goroutine.
cgo CUDA call) is stuck.
=> main thread is in D-state in a CUDA driver ioctl. SIGKILL cannot reap
D-state, so the container is unkillable. (On our Spark/podman orchestrator this
surfaces as a permanent "leaked running pod" that needs a host-level restart.)
Root cause (current code)
bulkUploadF32consolidates all eligible f32 tensors into one deviceallocation of
totalbytes and uploads it in oneMemcpy(or, with managedmemory enabled, one
cudaMallocManaged(total)+ one host-copy):At the CrossAsset sample-upload scale (hundreds of thousands of tensors -> a
multi-GB single buffer) this single large alloc/copy wedges the GB10 driver. There
is no upper bound on
totalor on the per-call tensor count beyondbulkUploadF32MinTensors = 64(a lower bound).Proposed fix
Chunk the bulk upload so no single
Malloc/Memcpyexceeds a bounded size:eligiblein chunks; per chunk: one boundedMalloc(chunkBytes)+staging copy +
Memcpy, thenSetStorageviews with chunk-local offsets, andappend each chunk's devPtr to
bulkUploadBuffers.cudaMallocManagedper chunk).This preserves the bulk-upload win (few large copies instead of per-tensor
uploads) while keeping every driver call under the GB10 wedge threshold. The
resulting GPU storage views are identical; existing
bulk_upload_test.gocoverageshould hold.
Questions for ztensor maintainers
cudaMalloc/cudaMemcpysize threshold thatwedges uninterruptibly under unified memory?
PR implementing the chunking once you confirm the cap shape.
Cross-ref
Wolf devlog 2026-06-05 (T8.1), Wolf parity plan E8/T8.1. Wolf caller:
internal/crossasset/crossasset.gotrainWithResult->UploadWeights.