Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix typing errors in torch.distributed.distributed_c10d.* #47532

Closed
wants to merge 10 commits into from

Conversation

xuzhao9
Copy link
Contributor

@xuzhao9 xuzhao9 commented Nov 6, 2020

Stack from ghstack:

Differential Revision: D24952501

@dr-ci
Copy link

dr-ci bot commented Nov 6, 2020

💊 CI failures summary and remediations

As of commit 23127ef (more details on the Dr. CI page):


  • 2/2 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build docker-pytorch-linux-bionic-py3.8-gcc9 (1/2)

Step: "Check if image should be built" (full log | diagnosis details | 🔁 rerun)

ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch
+ docker manifest inspect 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-py3.8-gcc9:9cc3a74f0e401cccf5d075d1c2835af60ce8c310 
 102 115 46 100 105 102 102 46 116 97 114 46 103 122 105 112 34 44 10 32 32 32 32 32 32 32 32 32 34 115 105 122 101 34 58 32 53 52 49 52 53 50 51 56 44 10 32 32 32 32 32 32 32 32 32 34 100 105 103 101 115 116 34 58 32 34 115 104 97 50 53 54 58 54 52 55 49 54 54 100 55 53 53 98 99 51 97 101 98 98 48 97 52 56 99 99 54 98 56 53 50 102 52 57 100 100 97 50 52 100 50 102 53 57 49 52 53 48 54 55 55 56 56 53 55 101 100 100 102 55 101 56 51 97 97 55 97 34 10 32 32 32 32 32 32 125 10 32 32 32 93 10 125]} 
++ git merge-base HEAD 2981ef28c139b416449642419401d7ae6f3f9a8a 
+ git rev-parse 2981ef28c139b416449642419401d7ae6f3f9a8a:.circleci/docker 
9cc3a74f0e401cccf5d075d1c2835af60ce8c310 
+++ git merge-base HEAD 2981ef28c139b416449642419401d7ae6f3f9a8a 
++ git rev-parse 2981ef28c139b416449642419401d7ae6f3f9a8a:.circleci/docker 
+ PREVIOUS_DOCKER_TAG=9cc3a74f0e401cccf5d075d1c2835af60ce8c310 
+ [[ 9cc3a74f0e401cccf5d075d1c2835af60ce8c310 = \9\c\c\3\a\7\4\f\0\e\4\0\1\c\c\c\f\5\d\0\7\5\d\1\c\2\8\3\5\a\f\6\0\c\e\8\c\3\1\0 ]] 
+ echo 'ERROR: Something has gone wrong and the previous image isn'\''t available for the merge-base of your branch' 
ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch 
+ echo '       contact the PyTorch team to restore the original images' 
       contact the PyTorch team to restore the original images 
+ exit 1 

See CircleCI build pytorch_linux_bionic_py3_6_clang9_test (2/2)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Nov 16 20:56:34 [E request_callback_no_python.cpp:592] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future
Nov 16 20:56:34 At: 
Nov 16 20:56:34   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(98): serialize 
Nov 16 20:56:34   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(150): serialize 
Nov 16 20:56:34  
Nov 16 20:56:34 [E request_callback_no_python.cpp:592] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Nov 16 20:56:34  
Nov 16 20:56:34 At: 
Nov 16 20:56:34   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(98): serialize 
Nov 16 20:56:34   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(150): serialize 
Nov 16 20:56:34  
Nov 16 20:56:34 [E request_callback_no_python.cpp:592] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Nov 16 20:56:34  
Nov 16 20:56:34 At: 
Nov 16 20:56:34   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(98): serialize 
Nov 16 20:56:34   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(150): serialize 
Nov 16 20:56:34  
Nov 16 20:56:35 ok (1.647s) 
Nov 16 20:56:36   test_return_future_remote (__main__.ProcessGroupRpcTestWithSpawn) ... RPC was initialized with the PROCESS_GROUP backend which is deprecated and slated to be removed and superseded by the TENSORPIPE backend. It is recommended to migrate to the TENSORPIPE backend. 
Nov 16 20:56:36 RPC was initialized with the PROCESS_GROUP backend which is deprecated and slated to be removed and superseded by the TENSORPIPE backend. It is recommended to migrate to the TENSORPIPE backend. 
Nov 16 20:56:36 RPC was initialized with the PROCESS_GROUP backend which is deprecated and slated to be removed and superseded by the TENSORPIPE backend. It is recommended to migrate to the TENSORPIPE backend. 
Nov 16 20:56:36 RPC was initialized with the PROCESS_GROUP backend which is deprecated and slated to be removed and superseded by the TENSORPIPE backend. It is recommended to migrate to the TENSORPIPE backend. 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 48 times.

torch/distributed/distributed_c10d.py Show resolved Hide resolved
@@ -1335,7 +1360,7 @@ def all_gather_multigpu(output_tensor_lists,

def _object_to_tensor(obj):
buffer = pickle.dumps(obj)
byte_storage = torch.ByteStorage.from_buffer(buffer)
byte_storage = torch.ByteStorage.from_buffer(buffer) # type: ignore[attr-defined]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither mypy nor I can find from_buffer() function in torch.ByteStorage.

@@ -1389,7 +1414,7 @@ def all_gather_object(object_list, obj, group=group.WORLD):
input_tensor, local_size = _object_to_tensor(obj)
group_backend = get_backend(group)
is_nccl_backend = group_backend == Backend.NCCL
current_device = torch.device("cpu")
current_device: Union[int, torch.device] = torch.device("cpu")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.cuda.current_device() returns int, and torch.device() returns torch.device.

@@ -1400,7 +1425,7 @@ def all_gather_object(object_list, obj, group=group.WORLD):
# Gather all local sizes. This is so that we can find the max size, and index
# until the correct size when deserializing the tensors.
group_size = get_world_size(group=group)
object_sizes_tensor = torch.zeros(group_size, dtype=int, device=current_device)
object_sizes_tensor = torch.zeros(group_size, dtype=int, device=current_device) # type: ignore[call-overload]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If is_nccl_backend is set, current_device is an int, which will not be accepted by torch.zeros().

@@ -1410,7 +1435,7 @@ def all_gather_object(object_list, obj, group=group.WORLD):
# Resize tensor to max size across all ranks.
input_tensor.resize_(max_object_size)
coalesced_output_tensor = torch.empty(
max_object_size * group_size, dtype=torch.uint8, device=current_device
max_object_size * group_size, dtype=torch.uint8, device=current_device # type: ignore[arg-type]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If is_nccl_backend is set, current_device is an int, which will not be accepted by torch.empty().

torch/distributed/distributed_c10d.py Show resolved Hide resolved
if is_nccl_backend:
# See note about using torch.cuda.current_device() here in docstring.
# We cannot simply use my_rank since rank == device is not necessarily
# true.
current_device = torch.cuda.current_device()
object_sizes_tensor = object_sizes_tensor.to(current_device)
object_sizes_tensor = object_sizes_tensor.to(current_device)
object_sizes_tensor = object_sizes_tensor.to(current_device) # type: ignore[call-overload]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If is_nccl_backend is set, current_device is an int, which will not be accepted by torch.Tensor.to().

torch/_C/_distributed_c10d.pyi Outdated Show resolved Hide resolved
torch/distributed/distributed_c10d.py Outdated Show resolved Hide resolved
torch/distributed/distributed_c10d.py Outdated Show resolved Hide resolved
torch/distributed/distributed_c10d.py Outdated Show resolved Hide resolved
torch/_C/_distributed_c10d.pyi Outdated Show resolved Hide resolved
torch/distributed/distributed_c10d.py Outdated Show resolved Hide resolved
torch/distributed/distributed_c10d.py Outdated Show resolved Hide resolved
torch/distributed/distributed_c10d.py Outdated Show resolved Hide resolved
torch/distributed/distributed_c10d.py Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@xuzhao9 merged this pull request in 915050e.

@facebook-github-bot facebook-github-bot deleted the gh/xuzhao9/5/head branch November 20, 2020 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants