Skip to content

Conversation

ppwwyyxx
Copy link
Collaborator

@ppwwyyxx ppwwyyxx commented Jul 20, 2024

When one process fails, others are immediately killed. This prevents other processes to do necessary cleanups, or dump debug information (in particular, the NCCL flight recorder).

This PR adds a grace period. Default behavior is unchanged.

Copy link

pytorch-bot bot commented Jul 20, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131278

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7fbfa5c with merge base ca38f28 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@ppwwyyxx ppwwyyxx force-pushed the graceful-shutdown branch from 1bf2e0f to c884f19 Compare July 20, 2024 07:03
@albanD albanD self-requested a review July 23, 2024 22:40
@albanD albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 23, 2024
@ppwwyyxx
Copy link
Collaborator Author

ppwwyyxx commented Aug 5, 2024

@albanD Hi Alban, could you take a look? thx

@albanD
Copy link
Collaborator

albanD commented Aug 5, 2024

cc @wconstab from the distributed side and @andrewkho on the dataloader side. Any comment on this proposal?

@ppwwyyxx
Copy link
Collaborator Author

@wconstab any comments?

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's your strategy to test this?

@andrewkho
Copy link
Contributor

Sorry for the delay in responding, in general the idea looks good to me, need to review the implementation more closely. i do agree with all of Alban's suggestions so please make those

@ppwwyyxx
Copy link
Collaborator Author

@pytorchbot label "release notes: distributed (miscellaneous)"

@ppwwyyxx
Copy link
Collaborator Author

Added a unittest and addressed @albanD 's comments.

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@ppwwyyxx
Copy link
Collaborator Author

ppwwyyxx commented Oct 7, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 7, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: distributed (miscellaneous) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants