You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On NVIDIA GB10 Blackwell (aarch64, unified memory), the GPU training path hangs silently and permanently during the bulk sample-tensor + graph-parameter pre-upload step. The operator observes:
Pod remains phase=running in Spark indefinitely — no log output after the pre-upload init line (see Stall location below).
nvidia-smi hangs on the host.
DELETE /api/v1/pods/{name} via the Spark HTTP API times out (>180 s) — host-side podman is wedged.
No error is returned; no panic; no CUDA error code surfaced.
The issue is reproducible across 6 independent attempts spanning ztensor v1.6.0 and earlier revisions (see What we've tried).
Hardware / Software
Field
Value
GPU
NVIDIA GB10 Blackwell
Architecture
aarch64 (linux/arm64)
Memory model
Unified memory (no discrete VRAM)
ztensor
v1.6.0
Zerfoo
github.com/zerfoo/zerfoo v1.48.1-0.20260428224953-7539627aad66 (HEAD at time of hang)
The caller's trainWithResult logic (Wolf internal/crossasset/crossasset.go) calls uploader.UploadWeights(f32) where uploader is the ztensor GPU engine, passing 106,652 sample tensors plus ~50 autograd graph parameter tensors. The call never returns.
Stall location
Last log line printed before the hang (timestamp 23:36:13, no further output after 36+ minutes):
This is the log line immediately before uploader.UploadWeights(f32) is called. The call enters ztensor's GPUEngine.UploadWeights (ztensor/compute/gpu_engine.go) and never returns.
What we've tried (6 attempts, same symptom family)
ztensor v1.6.0 (reverted from v1.7.0 via PR #141), ztensor#93 CUDA-graph-capture fix in place, T9.12 softmax fix applied
Stall advanced one stage deeper — now inside trainWithResult at the UploadWeights call — but same fingerprint: 0 GPU progress, no error, host wedged
Prior to attempt #6, attempts #1–#5 stalled earlier (before trainWithResult) due to a separate CUDA graph capture deadlock resolved by ztensor#93. That fix was necessary but not sufficient.
Workaround in use
None on the GPU path. Plan B is CPUEngine[float64] (Wolf T9.22), which bypasses UploadWeights entirely.
What we'd need to investigate further
nsys trace — Per Wolf lore entry L-0008, nsys --trace=cuda itself deadlocks GB10 GPU training (CUPTI listener conflict with CUDA graph capture). If ztensor v1.6.0 CUDA graph capture is active during UploadWeights, nsys profiling may not be viable without first disabling capture for the upload phase.
CUDA driver / runtime versions on GB10 — exact libcuda.so version and CUDA runtime version on the aarch64 host would help narrow whether this is a driver-level unified-memory page-migration stall or a CUDA API hang.
GB10/Blackwell-specific UploadWeights constraints — any known restrictions on: maximum single-allocation size, tensor alignment requirements for unified-memory H2D bulk copies, or interaction between CUDA graph capture and large cudaMemcpy calls on GB10.
Does disabling CUDA graph capture for the upload phase unblock it? — If UploadWeights is called while the CUDA graph capture context is active, the memcpy may be recorded rather than executed, causing the hang. A flag or API to temporarily exit capture mode during bulk upload would be a minimal fix to test.
Symptom
On NVIDIA GB10 Blackwell (aarch64, unified memory), the GPU training path hangs silently and permanently during the bulk sample-tensor + graph-parameter pre-upload step. The operator observes:
phase=runningin Spark indefinitely — no log output after the pre-upload init line (see Stall location below).nvidia-smihangs on the host.DELETE /api/v1/pods/{name}via the Spark HTTP API times out (>180 s) — host-side podman is wedged.Hardware / Software
github.com/zerfoo/zerfoo v1.48.1-0.20260428224953-7539627aad66(HEAD at time of hang)35087d91(training imagelocalhost:5000/wolf-train-crossasset:t98-35087d91)Reproduction
Submit the following Spark Pod manifest to
http://<dgx-host>:8080/api/v1/pods:Single-shot submit command:
The caller's
trainWithResultlogic (Wolfinternal/crossasset/crossasset.go) callsuploader.UploadWeights(f32)whereuploaderis the ztensor GPU engine, passing 106,652 sample tensors plus ~50 autograd graph parameter tensors. The call never returns.Stall location
Last log line printed before the hang (timestamp 23:36:13, no further output after 36+ minutes):
This is the log line immediately before
uploader.UploadWeights(f32)is called. The call enters ztensor'sGPUEngine.UploadWeights(ztensor/compute/gpu_engine.go) and never returns.What we've tried (6 attempts, same symptom family)
ZERFOO_DISABLE_CUDA_GRAPH=1), different epoch/fold counts, different memory limitstrainWithResulteven startedtrainWithResultat theUploadWeightscall — but same fingerprint: 0 GPU progress, no error, host wedgedPrior to attempt #6, attempts #1–#5 stalled earlier (before
trainWithResult) due to a separate CUDA graph capture deadlock resolved by ztensor#93. That fix was necessary but not sufficient.Workaround in use
None on the GPU path. Plan B is
CPUEngine[float64](Wolf T9.22), which bypassesUploadWeightsentirely.What we'd need to investigate further
nsys --trace=cudaitself deadlocks GB10 GPU training (CUPTI listener conflict with CUDA graph capture). If ztensor v1.6.0 CUDA graph capture is active duringUploadWeights, nsys profiling may not be viable without first disabling capture for the upload phase.libcuda.soversion and CUDA runtime version on the aarch64 host would help narrow whether this is a driver-level unified-memory page-migration stall or a CUDA API hang.UploadWeightsconstraints — any known restrictions on: maximum single-allocation size, tensor alignment requirements for unified-memory H2D bulk copies, or interaction between CUDA graph capture and largecudaMemcpycalls on GB10.UploadWeightsis called while the CUDA graph capture context is active, the memcpy may be recorded rather than executed, causing the hang. A flag or API to temporarily exit capture mode during bulk upload would be a minimal fix to test.Cross-references
nsys --trace=cudadeadlocks Wolf GPU training on GB10 (CUPTI/CUDA-graph conflict)