-
Notifications
You must be signed in to change notification settings - Fork 25.6k
multiprocessing.spawn: allow a grace period when shutdown #131278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131278
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 7fbfa5c with merge base ca38f28 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
1bf2e0f
to
c884f19
Compare
@albanD Hi Alban, could you take a look? thx |
cc @wconstab from the distributed side and @andrewkho on the dataloader side. Any comment on this proposal? |
@wconstab any comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's your strategy to test this?
Sorry for the delay in responding, in general the idea looks good to me, need to review the implementation more closely. i do agree with all of Alban's suggestions so please make those |
c884f19
to
49f19e6
Compare
@pytorchbot label "release notes: distributed (miscellaneous)" |
Added a unittest and addressed @albanD 's comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
863189e
to
7fbfa5c
Compare
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
When one process fails, others are immediately killed. This prevents other processes to do necessary cleanups, or dump debug information (in particular, the NCCL flight recorder).
This PR adds a grace period. Default behavior is unchanged.