Skip to content

Conversation

michaelay
Copy link
Contributor

@michaelay michaelay commented Oct 10, 2024

Summary: In WorkNCCL::handleException, log to c10d logger with strings["work_nccl_exception"].

Test Plan: Test run job to verify NCCL exception is logged.

Differential Revision: D62603322

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Copy link

linux-foundation-easycla bot commented Oct 10, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: michaelay / name: Michael Au-Yeung (16ef007)

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Oct 10, 2024
Copy link

pytorch-bot bot commented Oct 10, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137736

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 16ef007 with merge base 8486d3d (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D62603322

@michaelay michaelay changed the title Log exception to AITO error infra (signpost) Log WorkNCCL exception string to C10dLogger Oct 10, 2024
@michaelay michaelay requested a review from c-p-i-o October 10, 2024 21:59
@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 11, 2024
Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Summary:

In WorkNCCL::handleException, log to c10d logger with `strings["work_nccl_exception"]`.

Test Plan: Test run job to verify NCCL exception is logged.

Reviewed By: fduwjj, c-p-i-o

Differential Revision: D62603322
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D62603322

@cyyever
Copy link
Collaborator

cyyever commented Oct 13, 2024

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 1 checks: trunk / macos-py3-arm64 / test (default, 3, 3, macos-m1-stable)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants