Skip to content

Conversation

dongyuzheng
Copy link
Contributor

Summary: I had a job fail due to rank mismatch but didn't find enough information in the assertion message. This change makes the message more informative.

Test Plan:
CI tests and I ran a test job which failed as expected:

Rank 1 has different values for step: 8016.0. Other ranks: 7870.0

Differential Revision: D51322046

Copy link

pytorch-bot bot commented Nov 15, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113765

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit dee59a2 with merge base d40d270 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link

linux-foundation-easycla bot commented Nov 15, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: dongyuzheng / name: Gary (dee59a2)

@pytorch-bot pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Nov 15, 2023
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D51322046

Copy link
Contributor

@wz337 wz337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. cc. @fegin

@fegin
Copy link
Contributor

fegin commented Nov 15, 2023

@dongyuzheng Please sign CLA

@dongyuzheng
Copy link
Contributor Author

/easycla

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D51322046

…torch#113765)

Summary:

I had a job fail due to rank mismatch but didn't find enough information in the assertion message. This change makes the message more informative.

Test Plan:
CI tests and I ran a test job which failed as expected:

```
Rank 1 has different values for step: 8016.0. Other ranks: 7870.0
```

Reviewed By: wz337, fegin

Differential Revision: D51322046
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D51322046

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 20, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged release notes: distributed (fsdp) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants