Skip to content

GB10/Blackwell: training hangs at sample-tensor pre-upload (sample 0/N stalled, no progress) #103

@dndungu

Description

@dndungu

Symptom

On NVIDIA GB10 Blackwell (aarch64, unified memory), the GPU training path hangs silently and permanently during the bulk sample-tensor + graph-parameter pre-upload step. The operator observes:

  • Pod remains phase=running in Spark indefinitely — no log output after the pre-upload init line (see Stall location below).
  • nvidia-smi hangs on the host.
  • DELETE /api/v1/pods/{name} via the Spark HTTP API times out (>180 s) — host-side podman is wedged.
  • No error is returned; no panic; no CUDA error code surfaced.
  • The issue is reproducible across 6 independent attempts spanning ztensor v1.6.0 and earlier revisions (see What we've tried).

Hardware / Software

Field Value
GPU NVIDIA GB10 Blackwell
Architecture aarch64 (linux/arm64)
Memory model Unified memory (no discrete VRAM)
ztensor v1.6.0
Zerfoo github.com/zerfoo/zerfoo v1.48.1-0.20260428224953-7539627aad66 (HEAD at time of hang)
Wolf commit 35087d91 (training image localhost:5000/wolf-train-crossasset:t98-35087d91)
CUDA graph capture Enabled (default in ztensor v1.6.0; ztensor#93 resolved the prior capture hang)

Reproduction

Submit the following Spark Pod manifest to http://<dgx-host>:8080/api/v1/pods:

apiVersion: v1
kind: Pod
metadata:
  name: wolf-train-crossasset-gpu-35087d91
spec:
  priorityClassName: high
  restartPolicy: Never
  containers:
    - name: train
      image: localhost:5000/wolf-train-crossasset:t98-35087d91
      args:
        - -gpu
        - -bars=/data/experiments/coin-all-1m-bars.csv
        - -epochs=50
        - -folds=5
        - -output=/models/crossasset-gpu-results.json
        - -save=/models/coin-crossasset-gpu.zcam
      env:
        - name: LD_LIBRARY_PATH
          value: /usr/local/cuda/lib64:/opt/zerfoo/lib
      resources:
        limits:
          nvidia.com/gpu: "1"
          cpu: "6"
          memory: "24Gi"
        requests:
          cpu: "6"
          memory: "16Gi"

Single-shot submit command:

curl -X POST http://192.168.86.250:8080/api/v1/pods \
  --data-binary @deploy/spark/train-crossasset-gpu.yaml \
  -H 'Content-Type: application/yaml'

The caller's trainWithResult logic (Wolf internal/crossasset/crossasset.go) calls uploader.UploadWeights(f32) where uploader is the ztensor GPU engine, passing 106,652 sample tensors plus ~50 autograd graph parameter tensors. The call never returns.

Stall location

Last log line printed before the hang (timestamp 23:36:13, no further output after 36+ minutes):

trainWithResult: T4.2 pre-uploading 106652 sample tensors + 50 graph params to GPU

This is the log line immediately before uploader.UploadWeights(f32) is called. The call enters ztensor's GPUEngine.UploadWeights (ztensor/compute/gpu_engine.go) and never returns.

What we've tried (6 attempts, same symptom family)

Attempt Change from prior Outcome
#1#5 Various: CUDA graph disabled (ZERFOO_DISABLE_CUDA_GRAPH=1), different epoch/fold counts, different memory limits Silent stall past "Using GPU engine" — hang before trainWithResult even started
#6 (2026-04-28) ztensor v1.6.0 (reverted from v1.7.0 via PR #141), ztensor#93 CUDA-graph-capture fix in place, T9.12 softmax fix applied Stall advanced one stage deeper — now inside trainWithResult at the UploadWeights call — but same fingerprint: 0 GPU progress, no error, host wedged

Prior to attempt #6, attempts #1#5 stalled earlier (before trainWithResult) due to a separate CUDA graph capture deadlock resolved by ztensor#93. That fix was necessary but not sufficient.

Workaround in use

None on the GPU path. Plan B is CPUEngine[float64] (Wolf T9.22), which bypasses UploadWeights entirely.

What we'd need to investigate further

  1. nsys trace — Per Wolf lore entry L-0008, nsys --trace=cuda itself deadlocks GB10 GPU training (CUPTI listener conflict with CUDA graph capture). If ztensor v1.6.0 CUDA graph capture is active during UploadWeights, nsys profiling may not be viable without first disabling capture for the upload phase.
  2. CUDA driver / runtime versions on GB10 — exact libcuda.so version and CUDA runtime version on the aarch64 host would help narrow whether this is a driver-level unified-memory page-migration stall or a CUDA API hang.
  3. GB10/Blackwell-specific UploadWeights constraints — any known restrictions on: maximum single-allocation size, tensor alignment requirements for unified-memory H2D bulk copies, or interaction between CUDA graph capture and large cudaMemcpy calls on GB10.
  4. Does disabling CUDA graph capture for the upload phase unblock it? — If UploadWeights is called while the CUDA graph capture context is active, the memcpy may be recorded rather than executed, causing the hang. A flag or API to temporarily exit capture mode during bulk upload would be a minimal fix to test.

Cross-references

  • Wolf PR #176: https://github.com/feza-ai/wolf/pull/176 — doc-only stall report for T9.8 attempt chore(main): release 0.4.0 #6
  • Wolf lore L-0008: nsys --trace=cuda deadlocks Wolf GPU training on GB10 (CUPTI/CUDA-graph conflict)
  • Wolf ADR-062: PyTorch speed parity strategy (context for the GPU training goal)
  • ztensor#93: prior CUDA graph capture hang on GB10 (resolved, but insufficient for this stall)
  • ztensor PR #141: revert of ztensor v1.7.0 (separate deadlock)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions