bulkUploadF32 wedges GB10 driver (uninterruptible D-state) on large one-shot uploads (~213k tensors); needs chunking

## Summary

`GPUEngine.bulkUploadF32` (compute/gpu_engine.go) wedges the **NVIDIA GB10
(sm_121, aarch64 unified-memory)** driver in an **uninterruptible D-state** when
asked to bulk-upload a large number of f32 tensors in one shot (~213k tensors
observed; historically reproduced >300k). The wedge is below the Go runtime: the
container's main OS thread is stuck in a CUDA driver ioctl that never returns and
**cannot be SIGKILLed**, which makes the whole container unkillable.

Reproduced 2026-06-05 from Wolf `train-crossasset` (CrossAsset multiscale model)
at the T4.2 sample pre-upload:
`UploadWeights -> bulkUploadF32(213304 tensors + 50 params)` never returns.

## Environment

- Device: NVIDIA GB10 (Grace-Blackwell, sm_121, aarch64, 128 GB unified memory)
- ztensor: v1.8.0 (repro image built from `tb1/upload-trace-instr` @ `5cf9cc3`,
  bulkUploadF32 logic identical to v1.8.0 release)
- Go cgo CUDA path, `ZERFOO_ENABLE_MANAGED_MEM` **unset** -> `e.managedMem=false`
  -> the non-managed branch: single `e.runtime.Malloc(total)` + single
  `e.runtime.Memcpy(devPtr, host, total, HostToDevice)` (gpu_engine.go:456 / :480).

## Evidence that it is an uninterruptible CUDA driver call (not a Go deadlock)

When the upload hangs:
- `podman exec` into the container, log streaming, and `podman rm` / pod DELETE
  **all wedge** (HTTP-layer timeouts), while the orchestrator control plane stays
  fully responsive.
- A Go futex/channel deadlock would NOT wedge `podman exec` — exec spawns a fresh
  process in the namespace, unaffected by a blocked goroutine.
- The Wolf-side heartbeat goroutine keeps ticking; only the main goroutine (in the
  cgo CUDA call) is stuck.

=> main thread is in **D-state in a CUDA driver ioctl**. SIGKILL cannot reap
D-state, so the container is unkillable. (On our Spark/podman orchestrator this
surfaces as a permanent "leaked running pod" that needs a host-level restart.)

## Root cause (current code)

`bulkUploadF32` consolidates **all** eligible f32 tensors into **one** device
allocation of `total` bytes and uploads it in **one** `Memcpy` (or, with managed
memory enabled, one `cudaMallocManaged(total)` + one host-copy):

```
// gpu_engine.go ~448-486
devPtr, err = e.runtime.Malloc(total)            // one giant cudaMalloc
...
host := make([]byte, total)                      // one giant staging buffer
for _, en := range eligible { copy(host[...], src) }
e.runtime.Memcpy(devPtr, &host[0], total, HostToDevice)  // one giant H2D
```

At the CrossAsset sample-upload scale (hundreds of thousands of tensors -> a
multi-GB single buffer) this single large alloc/copy wedges the GB10 driver. There
is no upper bound on `total` or on the per-call tensor count beyond
`bulkUploadF32MinTensors = 64` (a *lower* bound).

## Proposed fix

Chunk the bulk upload so no single `Malloc`/`Memcpy` exceeds a bounded size:

- Add a max-bytes-per-chunk (and/or max-tensors-per-chunk) cap.
- Iterate `eligible` in chunks; per chunk: one bounded `Malloc(chunkBytes)` +
  staging copy + `Memcpy`, then `SetStorage` views with chunk-local offsets, and
  append each chunk's devPtr to `bulkUploadBuffers`.
- Same for the managed-memory branch (bounded `cudaMallocManaged` per chunk).

This preserves the bulk-upload win (few large copies instead of per-tensor
uploads) while keeping every driver call under the GB10 wedge threshold. The
resulting GPU storage views are identical; existing `bulk_upload_test.go` coverage
should hold.

## Questions for ztensor maintainers

1. Is there a known GB10/sm_121 `cudaMalloc`/`cudaMemcpy` size threshold that
   wedges uninterruptibly under unified memory?
2. Preferred cap: bytes-based (e.g. 256 MB) or tensor-count-based? Happy to send a
   PR implementing the chunking once you confirm the cap shape.

## Cross-ref

Wolf devlog 2026-06-05 (T8.1), Wolf parity plan E8/T8.1. Wolf caller:
`internal/crossasset/crossasset.go` `trainWithResult` -> `UploadWeights`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bulkUploadF32 wedges GB10 driver (uninterruptible D-state) on large one-shot uploads (~213k tensors); needs chunking #106

Summary

Environment

Evidence that it is an uninterruptible CUDA driver call (not a Go deadlock)

Root cause (current code)

Proposed fix

Questions for ztensor maintainers

Cross-ref

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

bulkUploadF32 wedges GB10 driver (uninterruptible D-state) on large one-shot uploads (~213k tensors); needs chunking #106

Description

Summary

Environment

Evidence that it is an uninterruptible CUDA driver call (not a Go deadlock)

Root cause (current code)

Proposed fix

Questions for ztensor maintainers

Cross-ref

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions