GB10/Blackwell: training hangs at sample-tensor pre-upload (sample 0/N stalled, no progress)

## Symptom

On NVIDIA GB10 Blackwell (aarch64, unified memory), the GPU training path hangs silently and permanently during the bulk sample-tensor + graph-parameter pre-upload step. The operator observes:

- Pod remains `phase=running` in Spark indefinitely — no log output after the pre-upload init line (see **Stall location** below).
- `nvidia-smi` hangs on the host.
- `DELETE /api/v1/pods/{name}` via the Spark HTTP API times out (>180 s) — host-side podman is wedged.
- No error is returned; no panic; no CUDA error code surfaced.
- The issue is reproducible across 6 independent attempts spanning ztensor v1.6.0 and earlier revisions (see **What we've tried**).

## Hardware / Software

| Field | Value |
|---|---|
| GPU | NVIDIA GB10 Blackwell |
| Architecture | aarch64 (linux/arm64) |
| Memory model | Unified memory (no discrete VRAM) |
| ztensor | v1.6.0 |
| Zerfoo | `github.com/zerfoo/zerfoo v1.48.1-0.20260428224953-7539627aad66` (HEAD at time of hang) |
| Wolf commit | `35087d91` (training image `localhost:5000/wolf-train-crossasset:t98-35087d91`) |
| CUDA graph capture | Enabled (default in ztensor v1.6.0; ztensor#93 resolved the prior capture hang) |

## Reproduction

Submit the following Spark Pod manifest to `http://<dgx-host>:8080/api/v1/pods`:

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: wolf-train-crossasset-gpu-35087d91
spec:
  priorityClassName: high
  restartPolicy: Never
  containers:
    - name: train
      image: localhost:5000/wolf-train-crossasset:t98-35087d91
      args:
        - -gpu
        - -bars=/data/experiments/coin-all-1m-bars.csv
        - -epochs=50
        - -folds=5
        - -output=/models/crossasset-gpu-results.json
        - -save=/models/coin-crossasset-gpu.zcam
      env:
        - name: LD_LIBRARY_PATH
          value: /usr/local/cuda/lib64:/opt/zerfoo/lib
      resources:
        limits:
          nvidia.com/gpu: "1"
          cpu: "6"
          memory: "24Gi"
        requests:
          cpu: "6"
          memory: "16Gi"
```

Single-shot submit command:
```
curl -X POST http://192.168.86.250:8080/api/v1/pods \
  --data-binary @deploy/spark/train-crossasset-gpu.yaml \
  -H 'Content-Type: application/yaml'
```

The caller's `trainWithResult` logic (Wolf `internal/crossasset/crossasset.go`) calls `uploader.UploadWeights(f32)` where `uploader` is the ztensor GPU engine, passing 106,652 sample tensors plus ~50 autograd graph parameter tensors. The call never returns.

## Stall location

Last log line printed before the hang (timestamp 23:36:13, no further output after 36+ minutes):

```
trainWithResult: T4.2 pre-uploading 106652 sample tensors + 50 graph params to GPU
```

This is the log line immediately before `uploader.UploadWeights(f32)` is called. The call enters ztensor's `GPUEngine.UploadWeights` (`ztensor/compute/gpu_engine.go`) and never returns.

## What we've tried (6 attempts, same symptom family)

| Attempt | Change from prior | Outcome |
|---|---|---|
| #1–#5 | Various: CUDA graph disabled (`ZERFOO_DISABLE_CUDA_GRAPH=1`), different epoch/fold counts, different memory limits | Silent stall past "Using GPU engine" — hang before `trainWithResult` even started |
| #6 (2026-04-28) | ztensor v1.6.0 (reverted from v1.7.0 via PR #141), ztensor#93 CUDA-graph-capture fix in place, T9.12 softmax fix applied | Stall advanced one stage deeper — now inside `trainWithResult` at the `UploadWeights` call — but same fingerprint: 0 GPU progress, no error, host wedged |

Prior to attempt #6, attempts #1–#5 stalled earlier (before `trainWithResult`) due to a separate CUDA graph capture deadlock resolved by ztensor#93. That fix was necessary but not sufficient.

## Workaround in use

None on the GPU path. Plan B is `CPUEngine[float64]` (Wolf T9.22), which bypasses `UploadWeights` entirely.

## What we'd need to investigate further

1. **nsys trace** — Per Wolf lore entry L-0008, `nsys --trace=cuda` itself deadlocks GB10 GPU training (CUPTI listener conflict with CUDA graph capture). If ztensor v1.6.0 CUDA graph capture is active during `UploadWeights`, nsys profiling may not be viable without first disabling capture for the upload phase.
2. **CUDA driver / runtime versions on GB10** — exact `libcuda.so` version and CUDA runtime version on the aarch64 host would help narrow whether this is a driver-level unified-memory page-migration stall or a CUDA API hang.
3. **GB10/Blackwell-specific `UploadWeights` constraints** — any known restrictions on: maximum single-allocation size, tensor alignment requirements for unified-memory H2D bulk copies, or interaction between CUDA graph capture and large `cudaMemcpy` calls on GB10.
4. **Does disabling CUDA graph capture for the upload phase unblock it?** — If `UploadWeights` is called while the CUDA graph capture context is active, the memcpy may be recorded rather than executed, causing the hang. A flag or API to temporarily exit capture mode during bulk upload would be a minimal fix to test.

## Cross-references

- Wolf PR #176: https://github.com/feza-ai/wolf/pull/176 — doc-only stall report for T9.8 attempt #6
- Wolf lore L-0008: `nsys --trace=cuda` deadlocks Wolf GPU training on GB10 (CUPTI/CUDA-graph conflict)
- Wolf ADR-062: PyTorch speed parity strategy (context for the GPU training goal)
- ztensor#93: prior CUDA graph capture hang on GB10 (resolved, but insufficient for this stall)
- ztensor PR #141: revert of ztensor v1.7.0 (separate deadlock)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GB10/Blackwell: training hangs at sample-tensor pre-upload (sample 0/N stalled, no progress) #103

Symptom

Hardware / Software

Reproduction

Stall location

What we've tried (6 attempts, same symptom family)

Workaround in use

What we'd need to investigate further

Cross-references

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Field	Value
GPU	NVIDIA GB10 Blackwell
Architecture	aarch64 (linux/arm64)
Memory model	Unified memory (no discrete VRAM)
ztensor	v1.6.0
Zerfoo	`github.com/zerfoo/zerfoo v1.48.1-0.20260428224953-7539627aad66` (HEAD at time of hang)
Wolf commit	`35087d91` (training image `localhost:5000/wolf-train-crossasset:t98-35087d91`)
CUDA graph capture	Enabled (default in ztensor v1.6.0; ztensor#93 resolved the prior capture hang)

Attempt	Change from prior	Outcome
#1–#5	Various: CUDA graph disabled (`ZERFOO_DISABLE_CUDA_GRAPH=1`), different epoch/fold counts, different memory limits	Silent stall past "Using GPU engine" — hang before `trainWithResult` even started
#6 (2026-04-28)	ztensor v1.6.0 (reverted from v1.7.0 via PR #141), ztensor#93 CUDA-graph-capture fix in place, T9.12 softmax fix applied	Stall advanced one stage deeper — now inside `trainWithResult` at the `UploadWeights` call — but same fingerprint: 0 GPU progress, no error, host wedged

GB10/Blackwell: training hangs at sample-tensor pre-upload (sample 0/N stalled, no progress) #103

Description

Symptom

Hardware / Software

Reproduction

Stall location

What we've tried (6 attempts, same symptom family)

Workaround in use

What we'd need to investigate further

Cross-references

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions