[DO NOT MERGE] ci: instrument HF download paths to diagnose stalls by rascani · Pull Request #19352 · pytorch/executorch

rascani · 2026-05-06T23:11:58Z

CUDA jobs that download HuggingFace model weights have been silently hanging mid-download until the 90-min job timeout fires, with no exception logged and no progress-bar updates. Running theories (library version, hf_xet, empty-chunk filtering) are speculation without instrumentation.

Add diagnostic instrumentation to the three known surfaces:

export_model_artifact.sh's two snapshot_download python -c calls plus the surrounding env (HF_HUB_VERBOSITY=debug, PYTHONUNBUFFERED=1)
test_huggingface_optimum_model.py's top-level setup, before any import that transitively pulls in huggingface_hub
mlx.yml's inline Voxtral snapshot_download

Each Python entry point installs faulthandler.dump_traceback_later (60, repeat=True) so a hung process surfaces a stack trace every 60s — pinpointing exactly where execution is stuck. DEBUG-level logging on huggingface_hub, httpx, httpcore, urllib3 shows the protocol-level conversation.

Revert once we have signal from a stalled run.

Authored with Claude Code.

CUDA jobs that download HuggingFace model weights have been silently hanging mid-download until the 90-min job timeout fires, with no exception logged and no progress-bar updates. Running theories (library version, hf_xet, empty-chunk filtering) are speculation without instrumentation. Add diagnostic instrumentation to the three known surfaces: - `export_model_artifact.sh`'s two `snapshot_download` python -c calls plus the surrounding env (HF_HUB_VERBOSITY=debug, PYTHONUNBUFFERED=1) - `test_huggingface_optimum_model.py`'s top-level setup, before any import that transitively pulls in huggingface_hub - `mlx.yml`'s inline Voxtral snapshot_download Each Python entry point installs `faulthandler.dump_traceback_later (60, repeat=True)` so a hung process surfaces a stack trace every 60s — pinpointing exactly where execution is stuck. DEBUG-level logging on `huggingface_hub`, `httpx`, `httpcore`, `urllib3` shows the protocol-level conversation. Revert once we have signal from a stalled run. Authored with Claude Code.

pytorch-bot · 2026-05-06T23:12:02Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19352

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 2 Cancelled Jobs, 60 Unrelated Failures

As of commit 0ad69c3 with merge base 851cffb ():

NEW FAILURES - The following jobs have failed:

.github/workflows/mlx.yml (gh)
Build Windows Wheels / pytorch/executorch / build-wheel-py3_10-cpu (gh)
Process completed with exit code 1.
Build Windows Wheels / pytorch/executorch / upload / upload-wheel-py3_10-cpu (gh)
Unable to download artifact(s): Artifact not found for name: pytorch_executorch__3.10_cpu_x64

CANCELLED JOBS - The following jobs were cancelled. Please retry:

pull / test-voxtral-realtime-xnnpack-linux / linux-job (gh)
##[error]The operation was canceled.
trunk / test-models-macos-cpu (ic4, xnnpack-quantization-delegation) / macos-job (gh)
##[error]The operation was canceled.

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / unittest / windows / windows-job (gh) (matched win rule in flaky-rules.json)
##[error]The operation was canceled.
pull / unittest-editable / windows / windows-job (gh) (matched win rule in flaky-rules.json)
##[error]The operation was canceled.
Test CUDA Builds / export-model-cuda-artifact (facebook, dinov2-small-imagenet1k-1-layer, non-quantized) / linux-job (gh) (similar failure)
Test CUDA Builds / export-model-cuda-artifact (mistralai, Voxtral-Mini-3B-2507, non-quantized) / linux-job (gh) (similar failure)
Test CUDA Builds / export-model-cuda-artifact (mistralai, Voxtral-Mini-3B-2507, quantized-int4-tile-packed) / linux-job (gh) (similar failure)
Test CUDA Builds / export-model-cuda-artifact (mistralai, Voxtral-Mini-3B-2507, quantized-int4-weight-only) / linux-job (gh) (similar failure)
Test CUDA Builds / export-model-cuda-artifact (openai, whisper-small, non-quantized) / linux-job (gh) (similar failure)
Test CUDA Builds / export-model-cuda-artifact (Qwen, Qwen3-0.6B, non-quantized) / linux-job (gh) (similar failure)
Test CUDA Builds / export-model-cuda-artifact (Qwen, Qwen3-0.6B, quantized-int4-tile-packed) / linux-job (gh) (similar failure)
Test CUDA Builds / export-model-cuda-artifact (SocialLocalMobile, Qwen3.5-35B-A3B-HQQ-INT4, quantized-int4-tile-packed) / linux-job (gh) (similar failure)
Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (mistralai, Voxtral-Mini-3B-2507, quantized-int4-weight-only) / linux-job (gh) (similar failure)
trunk / test-llama-runner-mac (fp32, xnnpack+custom+quantize_kv) / macos-job (gh) (similar failure)
trunk / test-models-linux-aarch64 (phi_4_mini, portable, linux.arm64.m7g.4xlarge) / linux-job (gh) (similar failure)
trunk / test-models-windows (mobilebert, portable) / windows-job (gh) (matched win rule in flaky-rules.json)
##[error]The operation was canceled.
trunk / test-models-windows (mobilebert, xnnpack-q8) / windows-job (gh) (matched win rule in flaky-rules.json)
##[error]The operation was canceled.

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-lora-linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-lora-multimethod-linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-models-linux (ic4, portable, linux.4xlarge.memory) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-models-linux (ic4, xnnpack-quantization-delegation, linux.4xlarge.memory) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-models-linux (mobilebert, portable, linux.2xlarge) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-models-linux (mobilebert, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-models-linux (phi_4_mini, portable, linux.4xlarge.memory) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-moshi-linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-phi-3-mini-runner-linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-sqnr-static-llm-qnn-linux (smollm2_135m) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-vulkan-models-linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest / linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest / macos / macos-job (gh) (trunk failure)
[ FAILED ] OpGridSampler2dTest.BatchSizeMismatchDies
pull / unittest-editable / linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / macos / macos-job (gh) (trunk failure)
[ FAILED ] OpGridSampler2dTest.BatchSizeMismatchDies
Test CUDA Builds / export-model-cuda-artifact (mistralai, Voxtral-Mini-4B-Realtime-2602, quantized-int4-tile-packed) / linux-job (gh) (trunk failure)
Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (facebook, dinov2-small-imagenet1k-1-layer, non-quantized) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (mistralai, Voxtral-Mini-3B-2507, non-quantized) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (mistralai, Voxtral-Mini-4B-Realtime-2602, quantized-int4-tile... / linux-job (gh) (trunk failure)
trunk / test-arm-ootb-linux (run_deit_e2e_ethos_u) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / test-models-linux-aarch64 (ic4, portable, linux.arm64.2xlarge) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / test-models-linux-aarch64 (ic4, xnnpack-quantization-delegation, linux.arm64.2xlarge) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / test-models-linux-aarch64 (mobilebert, portable, linux.arm64.2xlarge) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / test-models-linux-aarch64 (mobilebert, xnnpack-quantization-delegation, linux.arm64.2xlarge) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / test-models-linux-aarch64 (qwen2_5_1_5b, portable, linux.arm64.2xlarge) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / test-qnn-model (fp32, conv_former) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / test-qnn-model (fp32, ic4) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / test-qnn-model (fp32, mb) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / test-qnn-optimum-model (fp32, cvt) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / test-qnn-optimum-model (fp32, distilbert) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / test-qnn-optimum-model (fp32, dit) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / test-qnn-optimum-model (fp32, focalnet) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / test-qnn-optimum-model (fp32, mobilevit_v1) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / test-qnn-optimum-model (fp32, mobilevit_v2) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / test-qnn-optimum-model (fp32, pvt) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / test-qnn-optimum-model (fp32, swin) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / test-torchao-huggingface-checkpoints (lfm2_5_1_2b, linux.2xlarge, executorch-ubuntu-22.04-clang12... / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / test-torchao-huggingface-checkpoints (lfm2_5_1_2b, linux.arm64.2xlarge, executorch-ubuntu-22.04-g... / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / test-torchao-huggingface-checkpoints (phi_4_mini, linux.2xlarge, executorch-ubuntu-22.04-clang12,... / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / test-torchao-huggingface-checkpoints (phi_4_mini, linux.arm64.2xlarge, executorch-ubuntu-22.04-gc... / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / test-torchao-huggingface-checkpoints (qwen3_4b, linux.2xlarge, executorch-ubuntu-22.04-clang12, x... / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / test-torchao-huggingface-checkpoints (qwen3_4b, linux.arm64.2xlarge, executorch-ubuntu-22.04-gcc1... / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / unittest-release / linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / unittest-release / macos / macos-job (gh) (trunk failure)
[ FAILED ] OpGridSampler2dTest.BatchSizeMismatchDies
trunk / unittest-release / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-05-06T23:13:07Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

HuggingFace's Xet storage backend stalls mid-download on CI runners, causing the 90-minute job timeout to fire before model weights finish downloading. Force standard HTTP downloads instead. (from debug logs in #19352)

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 6, 2026

rascani added the ciflow/cuda label May 6, 2026

digantdesai mentioned this pull request May 7, 2026

Disable HF Xet storage to fix CI export timeouts #19358

Merged

rascani closed this May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE] ci: instrument HF download paths to diagnose stalls#19352

[DO NOT MERGE] ci: instrument HF download paths to diagnose stalls#19352
rascani wants to merge 1 commit intopytorch:mainfrom
rascani:debug-hf-download-stalls

rascani commented May 6, 2026

Uh oh!

pytorch-bot Bot commented May 6, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rascani commented May 6, 2026

Uh oh!

pytorch-bot Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19352

❌ 3 New Failures, 2 Cancelled Jobs, 60 Unrelated Failures

Uh oh!

github-actions Bot commented May 6, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pytorch-bot Bot commented May 6, 2026 •

edited

Loading

This PR needs a `release notes:` label