New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix TCPStore type coercion #49685
Fix TCPStore type coercion #49685
Conversation
💊 CI failures summary and remediationsAs of commit 3b00af3 (more details on the Dr. CI page):
ci.pytorch.org: 1 failedThis comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. This comment has been revised 8 times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@H-Huang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing!
torch/csrc/distributed/c10d/init.cpp
Outdated
>>> store.set("first_key", "first_value") | ||
>>> store.get("first_key") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we either keep the naming of the store vars or be clear that set
is called on server and get
is called on client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data is stored on the server, but the set
and get
methods can be used from both the client and the server.
That is what I thought the original comment in the docs (# Use any of the store methods from either the client or server after initialization
) was mentioning, or do I not have the correct understanding? cc: @osalpekar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, you can call set
and get
from both client and sever. It's just a bit clearer if we name them client and server as before to better illustrate that one store was instantiated with is_server=false
.
Additionally, this change will actually overwrite the store
variable to be just the client store, which may make the example more confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, this change will actually overwrite the store variable to be just the client store, which may make the example more confusing.
The example as it is given doesn't work if all the commands are run together in one file. Since the world_size is 2 and TCP store will block until all workers have been connected, the client and server need to be instantiated on separate processes. I had named them both store
since there would only be 1 store
variable for each process and wanted to show the methods could be called on either process.
But I see what you are saying about the naming and it is confusing, so I will change it back!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this!
Codecov Report
@@ Coverage Diff @@
## master #49685 +/- ##
==========================================
+ Coverage 75.24% 80.55% +5.31%
==========================================
Files 1883 1887 +4
Lines 204470 204600 +130
==========================================
+ Hits 153851 164821 +10970
+ Misses 50619 39779 -10840 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@H-Huang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: Fixes pytorch#49052 The TCPStore example with 4 arguments was working because the datetime value was being implicitly converted to a bool. Modified the pybind definition and updated documentation. Pull Request resolved: pytorch#49685 Test Plan: ``` import torch.distributed as dist from datetime import timedelta dist.TCPStore("127.0.0.1", 0, True, timedelta(seconds=30)) ``` Now fails with ``` TypeError: __init__(): incompatible constructor arguments. The following argument types are supported: 1. torch._C._distributed_c10d.TCPStore(host_name: str, port: int, world_size: int, is_master: bool, timeout: datetime.timedelta = datetime.timedelta(seconds=300)) Invoked with: '127.0.0.1', 0, True, datetime.timedelta(seconds=30) ``` Reviewed By: mrshenli, ngimel Differential Revision: D25668021 Pulled By: H-Huang fbshipit-source-id: ce40b8648d0a414f0255666fbc680f1a66fae090
Fixes #49052
The TCPStore example with 4 arguments was working because the datetime value was being implicitly converted to a bool. Modified the pybind definition and updated documentation.
Test:
Now fails with