New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix distributed store to use add for the counter of DL shared seed #80348
Conversation
🔗 Helpful links
✅ No Failures (0 Pending)As of commit 0de7982 (more details on the Dr. CI page): Expand to see more💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@pytorchbot merge -g |
@pytorchbot successfully started a merge job. Check the current status here |
Hey @ejguan. |
…80348) (#80348) Summary: In order to get the result of `_shared_seed_recv_cnt` properly, switch from `store.get` to `store.add(key, 0)`. See the comment from distributed team for the reason: https://github.com/pytorch/pytorch/blob/590d3e5774110e4657dcaa6acdb387ef69e41b47/torch/distributed/distributed_c10d.py#L242-L246 Pull Request resolved: #80348 Approved by: https://github.com/VitalyFedyunin, https://github.com/NivekT Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/3ec9d34f21edfb330076a2e57dd9b30649070e80 Reviewed By: NivekT Differential Revision: D37458370 Pulled By: ejguan fbshipit-source-id: 386457bef43dbb47e3c5b8bb4524d456b5f4343a
…ytorch#80348) (pytorch#80348) Summary: In order to get the result of `_shared_seed_recv_cnt` properly, switch from `store.get` to `store.add(key, 0)`. See the comment from distributed team for the reason: https://github.com/pytorch/pytorch/blob/590d3e5774110e4657dcaa6acdb387ef69e41b47/torch/distributed/distributed_c10d.py#L242-L246 Pull Request resolved: pytorch#80348 Approved by: https://github.com/VitalyFedyunin, https://github.com/NivekT Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/3ec9d34f21edfb330076a2e57dd9b30649070e80 Reviewed By: NivekT Differential Revision: D37458370 Pulled By: ejguan fbshipit-source-id: 386457bef43dbb47e3c5b8bb4524d456b5f4343a
…80348) (#80348) (#81860) Summary: In order to get the result of `_shared_seed_recv_cnt` properly, switch from `store.get` to `store.add(key, 0)`. See the comment from distributed team for the reason: https://github.com/pytorch/pytorch/blob/590d3e5774110e4657dcaa6acdb387ef69e41b47/torch/distributed/distributed_c10d.py#L242-L246 Pull Request resolved: #80348 Approved by: https://github.com/VitalyFedyunin, https://github.com/NivekT Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/3ec9d34f21edfb330076a2e57dd9b30649070e80 Reviewed By: NivekT Differential Revision: D37458370 Pulled By: ejguan fbshipit-source-id: 386457bef43dbb47e3c5b8bb4524d456b5f4343a Co-authored-by: erjia (Meta Employee) <erjia@fb.com>
In order to get the result of
_shared_seed_recv_cnt
properly, switch fromstore.get
tostore.add(key, 0)
.See the comment from distributed team for the reason:
pytorch/torch/distributed/distributed_c10d.py
Lines 242 to 246 in 590d3e5