Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TCPStore type coercion #49685

Closed
wants to merge 3 commits into from
Closed

Conversation

H-Huang
Copy link
Member

@H-Huang H-Huang commented Dec 21, 2020

Fixes #49052

The TCPStore example with 4 arguments was working because the datetime value was being implicitly converted to a bool. Modified the pybind definition and updated documentation.

Test:

import torch.distributed as dist
from datetime import timedelta

dist.TCPStore("127.0.0.1", 0, True, timedelta(seconds=30))

Now fails with

TypeError: __init__(): incompatible constructor arguments. The following argument types are supported:
    1. torch._C._distributed_c10d.TCPStore(host_name: str, port: int, world_size: int, is_master: bool, timeout: datetime.timedelta = datetime.timedelta(seconds=300))

Invoked with: '127.0.0.1', 0, True, datetime.timedelta(seconds=30)

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Dec 21, 2020

💊 CI failures summary and remediations

As of commit 3b00af3 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 2/2 non-CircleCI failure(s)

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

This comment has been revised 8 times.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@H-Huang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing!

Comment on lines 634 to 635
>>> store.set("first_key", "first_value")
>>> store.get("first_key")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we either keep the naming of the store vars or be clear that set is called on server and get is called on client?

Copy link
Member Author

@H-Huang H-Huang Dec 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data is stored on the server, but the set and get methods can be used from both the client and the server.

That is what I thought the original comment in the docs (# Use any of the store methods from either the client or server after initialization) was mentioning, or do I not have the correct understanding? cc: @osalpekar

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, you can call set and get from both client and sever. It's just a bit clearer if we name them client and server as before to better illustrate that one store was instantiated with is_server=false.

Additionally, this change will actually overwrite the store variable to be just the client store, which may make the example more confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, this change will actually overwrite the store variable to be just the client store, which may make the example more confusing.

The example as it is given doesn't work if all the commands are run together in one file. Since the world_size is 2 and TCP store will block until all workers have been connected, the client and server need to be instantiated on separate processes. I had named them both store since there would only be 1 store variable for each process and wanted to show the methods could be called on either process.

But I see what you are saying about the naming and it is confusing, so I will change it back!

Copy link
Member

@osalpekar osalpekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!

torch/csrc/distributed/c10d/init.cpp Show resolved Hide resolved
torch/csrc/distributed/c10d/init.cpp Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Dec 21, 2020

Codecov Report

Merging #49685 (3b00af3) into master (7ed140a) will increase coverage by 5.31%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master   #49685      +/-   ##
==========================================
+ Coverage   75.24%   80.55%   +5.31%     
==========================================
  Files        1883     1887       +4     
  Lines      204470   204600     +130     
==========================================
+ Hits       153851   164821   +10970     
+ Misses      50619    39779   -10840     

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@H-Huang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@H-Huang merged this pull request in 7eb392d.

hwangdeyu pushed a commit to hwangdeyu/pytorch that referenced this pull request Jan 6, 2021
Summary:
Fixes pytorch#49052

The TCPStore example with 4 arguments was working because the datetime value was being implicitly converted to a bool. Modified the pybind definition and updated documentation.

Pull Request resolved: pytorch#49685

Test Plan:
```
import torch.distributed as dist
from datetime import timedelta

dist.TCPStore("127.0.0.1", 0, True, timedelta(seconds=30))
```

Now fails with
```
TypeError: __init__(): incompatible constructor arguments. The following argument types are supported:
    1. torch._C._distributed_c10d.TCPStore(host_name: str, port: int, world_size: int, is_master: bool, timeout: datetime.timedelta = datetime.timedelta(seconds=300))

Invoked with: '127.0.0.1', 0, True, datetime.timedelta(seconds=30)
```

Reviewed By: mrshenli, ngimel

Differential Revision: D25668021

Pulled By: H-Huang

fbshipit-source-id: ce40b8648d0a414f0255666fbc680f1a66fae090
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TCPStore constructor arguments mismatch unexpected behavior
4 participants