Skip to content

[DO NOT MERGE] ci: instrument HF download paths to diagnose stalls#19352

Closed
rascani wants to merge 1 commit intopytorch:mainfrom
rascani:debug-hf-download-stalls
Closed

[DO NOT MERGE] ci: instrument HF download paths to diagnose stalls#19352
rascani wants to merge 1 commit intopytorch:mainfrom
rascani:debug-hf-download-stalls

Conversation

@rascani
Copy link
Copy Markdown
Contributor

@rascani rascani commented May 6, 2026

CUDA jobs that download HuggingFace model weights have been silently hanging mid-download until the 90-min job timeout fires, with no exception logged and no progress-bar updates. Running theories (library version, hf_xet, empty-chunk filtering) are speculation without instrumentation.

Add diagnostic instrumentation to the three known surfaces:

  • export_model_artifact.sh's two snapshot_download python -c calls plus the surrounding env (HF_HUB_VERBOSITY=debug, PYTHONUNBUFFERED=1)
  • test_huggingface_optimum_model.py's top-level setup, before any import that transitively pulls in huggingface_hub
  • mlx.yml's inline Voxtral snapshot_download

Each Python entry point installs faulthandler.dump_traceback_later (60, repeat=True) so a hung process surfaces a stack trace every 60s — pinpointing exactly where execution is stuck. DEBUG-level logging on huggingface_hub, httpx, httpcore, urllib3 shows the protocol-level conversation.

Revert once we have signal from a stalled run.

Authored with Claude Code.

CUDA jobs that download HuggingFace model weights have been silently
hanging mid-download until the 90-min job timeout fires, with no
exception logged and no progress-bar updates. Running theories
(library version, hf_xet, empty-chunk filtering) are speculation
without instrumentation.

Add diagnostic instrumentation to the three known surfaces:
- `export_model_artifact.sh`'s two `snapshot_download` python -c
  calls plus the surrounding env (HF_HUB_VERBOSITY=debug,
  PYTHONUNBUFFERED=1)
- `test_huggingface_optimum_model.py`'s top-level setup, before any
  import that transitively pulls in huggingface_hub
- `mlx.yml`'s inline Voxtral snapshot_download

Each Python entry point installs `faulthandler.dump_traceback_later
(60, repeat=True)` so a hung process surfaces a stack trace every
60s — pinpointing exactly where execution is stuck. DEBUG-level
logging on `huggingface_hub`, `httpx`, `httpcore`, `urllib3` shows
the protocol-level conversation.

Revert once we have signal from a stalled run.

Authored with Claude Code.
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 6, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19352

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 2 Cancelled Jobs, 60 Unrelated Failures

As of commit 0ad69c3 with merge base 851cffb (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 6, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

digantdesai added a commit that referenced this pull request May 7, 2026
HuggingFace's Xet storage backend stalls mid-download on CI runners,
causing the 90-minute job timeout to fire before model weights finish
downloading. Force standard HTTP downloads instead.

(from debug logs in #19352)
@rascani rascani closed this May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/cuda CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant