Skip to content

Conversation

d4l3k
Copy link
Member

@d4l3k d4l3k commented Apr 9, 2025

This adds a new clone() method to Store which will return a new Store instance that can be used from a different thread.

This is intended to better support multiple threads with stores such as when ProcessGroupNCCL needs a store to do error propagation.

Related issue: #150943

Test plan:

pytest test/distributed/test_store.py -k PythonStore
pytest test/distributed/test_store.py -k clone

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab

@d4l3k d4l3k requested review from wconstab, fduwjj and H-Huang April 9, 2025 22:23
Copy link

pytorch-bot bot commented Apr 9, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150966

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 914c66a with merge base d751698 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Apr 9, 2025
Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, for the PGNCCL part, do you refer to the use case of store->check()?

@d4l3k
Copy link
Member Author

d4l3k commented Apr 9, 2025

@fduwjj yes, I'm referring to the check call -- with this change we could instead do the more efficient .wait() which will block until the signal is sent.

@d4l3k
Copy link
Member Author

d4l3k commented Apr 9, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 9, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@fduwjj
Copy link
Contributor

fduwjj commented Apr 9, 2025

@d4l3k the non-blocking check is actually intentional because it also checks the heartbeat. Or maybe we are ok with a third thread which only dumps FR and this thread is going to be class member.

@d4l3k d4l3k deleted the d4l3k/store_clone branch April 10, 2025 15:34
@atalman
Copy link
Contributor

atalman commented Apr 10, 2025

@pytorchmergebot revert -c ghfirst -m "failing internally"

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Apr 10, 2025
This reverts commit 205881e.

Reverted #150966 on behalf of https://github.com/atalman due to failing internally ([comment](#150966 (comment)))
@pytorchmergebot
Copy link
Collaborator

@d4l3k your PR has been successfully reverted.

@pytorchmergebot pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Apr 10, 2025
facebook-github-bot pushed a commit that referenced this pull request Apr 10, 2025
Summary:

This adds a new `clone()` method to Store which will return a new Store instance that can be used from a different thread.

This is intended to better support multiple threads with stores such as when ProcessGroupNCCL needs a store to do error propagation.

Related issue: #150943

Approved by: https://github.com/fduwjj

Test Plan:
contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/205881ea4a451574c3a3de87c42484043a955d6e

Test plan from GitHub:
```
pytest test/distributed/test_store.py -k PythonStore
pytest test/distributed/test_store.py -k clone
```

Differential Revision: D72789690
pytorchmergebot pushed a commit that referenced this pull request Apr 11, 2025
Summary:
This adds a new `clone()` method to Store which will return a new Store instance that can be used from a different thread.

This is intended to better support multiple threads with stores such as when ProcessGroupNCCL needs a store to do error propagation.

Related issue: #150943

Approved by: https://github.com/fduwjj

Test Plan:
contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/205881ea4a451574c3a3de87c42484043a955d6e

Test plan from GitHub:
```
pytest test/distributed/test_store.py -k PythonStore
pytest test/distributed/test_store.py -k clone
```

Differential Revision: D72789690

Pull Request resolved: #151045
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
timocafe pushed a commit to timocafe/pytorch that referenced this pull request Apr 16, 2025
This adds a new `clone()` method to Store which will return a new Store instance that can be used from a different thread.

This is intended to better support multiple threads with stores such as when ProcessGroupNCCL needs a store to do error propagation.

Related issue: pytorch#150943

Test plan:

```
pytest test/distributed/test_store.py -k PythonStore
pytest test/distributed/test_store.py -k clone
```

Pull Request resolved: pytorch#150966
Approved by: https://github.com/fduwjj
timocafe pushed a commit to timocafe/pytorch that referenced this pull request Apr 16, 2025
This reverts commit 205881e.

Reverted pytorch#150966 on behalf of https://github.com/atalman due to failing internally ([comment](pytorch#150966 (comment)))
timocafe pushed a commit to timocafe/pytorch that referenced this pull request Apr 16, 2025
…rch#151045)

Summary:
This adds a new `clone()` method to Store which will return a new Store instance that can be used from a different thread.

This is intended to better support multiple threads with stores such as when ProcessGroupNCCL needs a store to do error propagation.

Related issue: pytorch#150943

Approved by: https://github.com/fduwjj

Test Plan:
contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/205881ea4a451574c3a3de87c42484043a955d6e

Test plan from GitHub:
```
pytest test/distributed/test_store.py -k PythonStore
pytest test/distributed/test_store.py -k clone
```

Differential Revision: D72789690

Pull Request resolved: pytorch#151045
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
amathewc pushed a commit to amathewc/pytorch that referenced this pull request Apr 17, 2025
This adds a new `clone()` method to Store which will return a new Store instance that can be used from a different thread.

This is intended to better support multiple threads with stores such as when ProcessGroupNCCL needs a store to do error propagation.

Related issue: pytorch#150943

Test plan:

```
pytest test/distributed/test_store.py -k PythonStore
pytest test/distributed/test_store.py -k clone
```

Pull Request resolved: pytorch#150966
Approved by: https://github.com/fduwjj
amathewc pushed a commit to amathewc/pytorch that referenced this pull request Apr 17, 2025
This reverts commit 205881e.

Reverted pytorch#150966 on behalf of https://github.com/atalman due to failing internally ([comment](pytorch#150966 (comment)))
amathewc pushed a commit to amathewc/pytorch that referenced this pull request Apr 17, 2025
…rch#151045)

Summary:
This adds a new `clone()` method to Store which will return a new Store instance that can be used from a different thread.

This is intended to better support multiple threads with stores such as when ProcessGroupNCCL needs a store to do error propagation.

Related issue: pytorch#150943

Approved by: https://github.com/fduwjj

Test Plan:
contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/205881ea4a451574c3a3de87c42484043a955d6e

Test plan from GitHub:
```
pytest test/distributed/test_store.py -k PythonStore
pytest test/distributed/test_store.py -k clone
```

Differential Revision: D72789690

Pull Request resolved: pytorch#151045
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-no-td Do not run TD on this PR ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category Reverted
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants