Skip to content

Conversation

@anijain2305
Copy link
Contributor

@anijain2305 anijain2305 commented Oct 1, 2025

Actually we would like to not graph break even in the case of Dynamo. But there is a weird-unsolved bug with Kineto + Dynamo when there are distributed jobs that lead to NCCL timeouts. This bug is a rare edege case, but we have not been able to root cause it yet.

But for export, we do not anticipate JIT tracing in distributed job training and therefore this PR is safe for export.

Fixes #ISSUE_NUMBER

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @Lucaskabela

…ed for export

Actually we would like to not graph break even in the case of Dynamo.
But there is a weird-unsolved bug with Kineto + Dynamo when there are
distributed jobs that lead to NCCL timeouts. This bug is a rare edege
case, but we have not been able to root cause it yet.

But for export, we do not anticipate JIT tracing in distributed job
training and therefore this PR is safe for export.
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 1, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164418

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit 937ffab with merge base 566ea4e (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

@anijain2305 has imported this pull request. If you are a Meta employee, you can view this in D83708030.

@anijain2305 anijain2305 added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 2, 2025
@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Oct 21, 2025
…ed for export (pytorch#164418)

Actually we would like to not graph break even in the case of Dynamo. But there is a weird-unsolved bug with Kineto + Dynamo when there are distributed jobs that lead to NCCL timeouts. This bug is a rare edege case, but we have not been able to root cause it yet.

But for export, we do not anticipate JIT tracing in distributed job training and therefore this PR is safe for export.

Fixes #ISSUE_NUMBER

Pull Request resolved: pytorch#164418
Approved by: https://github.com/StrongerXi, https://github.com/williamwen42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants