feat(compute): bulk-upload F32 weights to one device buffer (#103) by dndungu · Pull Request #104 · zerfoo/ztensor

dndungu · 2026-04-29T03:45:18Z

Summary

Adds a bulk path in GPUEngine.UploadWeights that collapses many small per-tensor cudaMalloc + cudaMemcpy round-trips into a single allocation and a single H2D copy when the input slice exceeds 64 eligible F32 tensors.
The engine retains the bulk pointer in a new bulkUploadBuffers slice and frees it in Close. Per-tensor views are non-owning (NewGPUStorageViewFromPtr).
Capture-time uploads bypass the bulk path; the per-tensor route already records async ops as graph nodes, and capture-time inputs are small.

Fixes the GB10 driver wedge reported in #103 where Wolf's training pre-upload of ~106K sample tensors hangs the CUDA context (no progress, no error, nvidia-smi hangs, DELETE times out).

Test plan

go build ./compute/ clean
go vet ./compute/ clean
go test -count=1 ./... -- all packages green on the dev box (CUDA tests skip)
Run on DGX GB10 once Spark scheduler issue (High-priority GPU pod stays pending forever with zero events when another non-GPU pod is running feza-ai/spark#32) is unblocked: confirm Wolf training advances past trainWithResult: T4.2 pre-uploading 106652 sample tensors + 50 graph params to GPU
New tests: TestGPUEngine_UploadWeights_BulkPath (N=128, asserts single bulk buffer + GPU storage + data preservation) and TestGPUEngine_UploadWeights_BelowBulkThreshold (N=8, asserts per-tensor path retained)

Closes #103.

UploadWeights previously issued one cudaMalloc + cudaMemcpy per input tensor. Wolf's CrossAsset training pre-uploads ~106K sample tensors plus ~50 graph parameters in a single call, producing ~213K back-to-back synchronous driver round-trips. On GB10 Blackwell with unified memory, the driver's allocation-table lock contends with the default-stream queue and wedges the CUDA context: nvidia-smi hangs, podman cannot tear down the container, no error surfaces because cudaMalloc never returns. Add a bulk path that activates when >= 64 eligible F32 tensors are queued. It computes the total byte size, issues one allocWeight, copies all source data into the resulting buffer (one cudaMemcpy H2D, or one host-side memcpy on managed-memory hosts), and creates non-owning GPUStorage views per tensor. The engine retains the bulk pointer in bulkUploadBuffers and frees it in Close. Capture-time uploads bypass the bulk path: the per-tensor route already emits async ops as graph nodes, and capture-time tensor counts are small. Tests cover both the bulk activation (N=128) and the below-threshold fallback (N=8) on real CUDA. Both build and vet clean on aarch64 and amd64; the existing compute test suite is unchanged. Closes #103.

dndungu merged commit 9ca83f6 into main Apr 29, 2026
1 check passed

dndungu deleted the fix/issue103-bulk-upload branch April 29, 2026 03:55

dndungu mentioned this pull request Apr 29, 2026

GB10/Blackwell: training hangs at sample-tensor pre-upload (sample 0/N stalled, no progress) #103

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(compute): bulk-upload F32 weights to one device buffer (#103)#104

feat(compute): bulk-upload F32 weights to one device buffer (#103)#104
dndungu merged 1 commit intomainfrom
fix/issue103-bulk-upload

dndungu commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dndungu commented Apr 29, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant