Skip to content

Conversation

@d4l3k
Copy link
Member

@d4l3k d4l3k commented Nov 8, 2025

This adds a debug HTTP server for debugging stuck or slow jobs. It runs the WorkerServer on every worker and then launches a separate flask process on rank 0 to have users connect to for debugging.

This can easily be improved to trigger profilers as well as visualize the data much better.

Initial handlers:

  • pytorch profiler
  • FlightRecorder data
  • Python stacks
os.environ["TORCH_NCCL_TRACE_BUFFER_SIZE"] = "2000"

from torch.distributed.debug import enable_debug_server

enable_debug_server()

Test plan:

torchrun --nnodes 1 --nproc_per_node=gpu ~/scripts/debug_test.py
20251107_17h10m47s_grim 20251107_17h10m39s_grim 20251107_18h35m38s_grim 20251107_18h35m31s_grim

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @pragupta @msaroufim @dcci

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 8, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167395

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures

As of commit d9bd0f2 with merge base bfddfde (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Nov 8, 2025
@d4l3k d4l3k requested review from fduwjj and suo November 8, 2025 01:13
@d4l3k d4l3k force-pushed the d4l3k/debug_plane branch 2 times, most recently from 67f237a to 04c8da3 Compare November 13, 2025 00:50
@d4l3k d4l3k force-pushed the d4l3k/debug_plane branch 2 times, most recently from 9e781cb to 38bb262 Compare November 13, 2025 19:16
@d4l3k d4l3k requested a review from jeffdaily as a code owner November 13, 2025 19:16
@d4l3k d4l3k force-pushed the d4l3k/debug_plane branch 2 times, most recently from d8dc8f4 to 0eb983b Compare November 13, 2025 23:45
@d4l3k d4l3k force-pushed the d4l3k/debug_plane branch 3 times, most recently from 34c05d7 to 5bf46e3 Compare November 14, 2025 18:28
Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm.

@d4l3k
Copy link
Member Author

d4l3k commented Nov 14, 2025

@atalman these seem like broken infr and unrelated to this PR, good to land?

@d4l3k d4l3k force-pushed the d4l3k/debug_plane branch from 5bf46e3 to d9bd0f2 Compare November 14, 2025 20:11
@d4l3k
Copy link
Member Author

d4l3k commented Nov 14, 2025

From HUD those issues are broken on trunk

@d4l3k
Copy link
Member Author

d4l3k commented Nov 14, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 14, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 2 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@d4l3k
Copy link
Member Author

d4l3k commented Nov 14, 2025

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-py3-clang12-executorch / build

Details for Dev Infra team Raised by workflow job

@d4l3k
Copy link
Member Author

d4l3k commented Nov 14, 2025

@pytorchbot merge -f "infra failures"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorch-auto-revert
Copy link

@pytorchbot revert -m "Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable" -c autorevert

This PR is attributed to have caused regression in:

Please investigate and fix the issues.

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Nov 15, 2025
…obs (#167395)"

This reverts commit 4ed26f7.

Reverted #167395 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#167395 (comment)))
@pytorchmergebot
Copy link
Collaborator

@d4l3k your PR has been successfully reverted.

@pytorchmergebot pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Nov 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category Reverted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants