fix(compute): chunk bulkUploadF32 to stop wedging the GB10 driver (#106)#107
Conversation
bulkUploadF32 consolidated ALL eligible f32 tensors into ONE device allocation + ONE H2D copy. At CrossAsset sample-upload scale (~213k tensors -> multi-GB) that single large cudaMalloc/cudaMemcpy wedges the GB10 (sm_121) driver in an uninterruptible ioctl: the worker thread stays in D-state, the container becomes unkillable, and podman rm / exec / logs all hang (this also drove the recurring orchestrator pod-leak). Upload in bounded chunks instead: cap each device allocation + copy at bulkUploadF32MaxChunkBytes (64 MiB) / bulkUploadF32MaxChunkTensors (4096), appending each chunk buffer to bulkUploadBuffers. Preserves the few-round-trips win over the per-tensor path; GPU storage views are identical. Chunk-boundary math is extracted to bulkUploadChunkRanges and unit-tested on CPU (tiling, both caps, lone-oversized tensor, and the 213k-count bound). Refs #106.
|
Built Wolf train-crossasset against this branch (4eaae4b, no vendoring — the chunked code is in the binary) and ran the matched repro (full COIN bars, 213,304-tensor pre-upload on GB10). It wedged identically: after reaching the upload, So capping each alloc/copy at 64 MiB / 4096 tensors did not prevent the wedge. The failure point is therefore not (only) the single large Holding this PR until the exact wedging cgo frame is captured (persisted /proc kernel-stack dump that doesn't depend on the wedged data-plane). The chunking is still a reasonable defensive bound, but it is not the fix on its own. Will update #106 with the pinned frame. |
Add TestGPUEngine_UploadWeights_MultiChunk: uploads 256 MiB (256x1MiB tensors) so the bounded-chunk path issues 4 real 64 MiB device allocs + copies, proving a 64 MiB chunk does not wedge the GB10 driver and that cross-chunk GPUStorage views round-trip. Skips without CUDA. Refs #106.
TestGPUEngine_UploadWeights_MultiChunk PASSED on DGX GB10 (Spark pod ztensor-issue106-multichunk-guard-3c04539, exit-0 guard). 256 MiB uploaded as 4 bounded 64 MiB chunks, no driver wedge, cross-chunk views round-trip. Marks E2 done; adds the validation manifest. Refs #106.
What
Fixes #106.
bulkUploadF32consolidated all eligible f32 tensors into onedevice allocation + one H2D copy. At CrossAsset sample-upload scale (~213k
tensors → multi-GB) that single large
cudaMalloc/cudaMemcpywedges theGB10 (sm_121) driver in an uninterruptible ioctl: the worker thread sits in
D-state, the container becomes unkillable, and
podman exec/logs/rmallhang (this also drove the recurring orchestrator pod-leak).
Change
Upload in bounded chunks: cap each device allocation + copy at
bulkUploadF32MaxChunkBytes(64 MiB) /bulkUploadF32MaxChunkTensors(4096),appending each chunk buffer to
bulkUploadBuffers. Preserves thefew-round-trips win over the per-tensor path; the resulting GPU storage views
are byte-identical to before.
Chunk-boundary math is extracted into a pure
bulkUploadChunkRangeshelper andunit-tested on CPU (no GPU required): tiling (no gaps/overlaps), both caps, a
lone-oversized tensor getting its own range, and the 213k-count production case
splitting into bounded chunks.
Test
go test ./compute/ -run TestBulkUploadChunkRanges— green (CPU).go build ./...,go vet ./compute/— clean.TestGPUEngine_UploadWeights_BulkPathunchanged (small input stillcollapses to one chunk → one buffer).
is being run downstream; will post the result.
Notes
bulkUploadF32MaxChunkBytesis avarso a test (or caller) can force themulti-chunk path with small inputs. Caps are conservative (a prior per-tensor
smoke at ~16k tensors worked; 64 MiB / 4096 leaves wide margin under the
observed wedge threshold).