Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building with USE_TENSORPIPE=0 causes errors on import torch for MacOS #56417

Open
rohan-varma opened this issue Apr 19, 2021 · 3 comments
Open
Labels
module: tensorpipe Related to Tensorpipe RPC Agent oncall: distributed Add this issue/PR to distributed oncall triage queue

Comments

@rohan-varma
Copy link
Member

rohan-varma commented Apr 19, 2021

馃悰 Bug

Building with the following command on macos:

USE_TENSORPIPE=0 USE_GLOO=1 USE_CUDNN=0 BUILD_CAFFE2_OPS=0 USE_NNPACK=0 USE_QNNPACK=0 USE_MKLDNN=0 USE_FBGEMM=0 USE_CUDA=0 USE_MKLDNN=0 USE_DISTRIBUTED=1 python setup.py develop

Results in the following error:

>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/rvarm1/Desktop/pytorch/torch/__init__.py", line 632, in <module>
    from .functional import *  # noqa: F403
  File "/Users/rvarm1/Desktop/pytorch/torch/functional.py", line 6, in <module>
    import torch.nn.functional as F
  File "/Users/rvarm1/Desktop/pytorch/torch/nn/__init__.py", line 1, in <module>
    from .modules import *  # noqa: F403
  File "/Users/rvarm1/Desktop/pytorch/torch/nn/modules/__init__.py", line 2, in <module>
    from .linear import Identity, Linear, Bilinear, LazyLinear
  File "/Users/rvarm1/Desktop/pytorch/torch/nn/modules/linear.py", line 6, in <module>
    from .. import functional as F
  File "/Users/rvarm1/Desktop/pytorch/torch/nn/functional.py", line 11, in <module>
    from .._jit_internal import boolean_dispatch, _overload, BroadcastingList1, BroadcastingList2, BroadcastingList3
  File "/Users/rvarm1/Desktop/pytorch/torch/_jit_internal.py", line 21, in <module>
    import torch.distributed.rpc
  File "/Users/rvarm1/Desktop/pytorch/torch/distributed/rpc/__init__.py", line 25, in <module>
    from . import api, backend_registry, functions
  File "/Users/rvarm1/Desktop/pytorch/torch/distributed/rpc/api.py", line 39, in <module>
    from .constants import DEFAULT_SHUTDOWN_TIMEOUT, UNSET_RPC_TIMEOUT
  File "/Users/rvarm1/Desktop/pytorch/torch/distributed/rpc/constants.py", line 3, in <module>
    from torch._C._distributed_rpc import (
ImportError: cannot import name '_DEFAULT_NUM_WORKER_THREADS' from 'torch._C._distributed_rpc' (unknown location)

To Reproduce

Build pytorch on MacOS as above.

Expected behavior

Since tensorpipe is the default backend, it makes sense that if user sets USE_DISTRIBUTED=1, then USE_TENSORPIPE=1 should also be enabled, unless there are certain platforms where distributed is enabled but tensorpipe is not. In this case we should probably have errors during build time and not when importing torch.

Looking at the issue, it looks like _DEFAULT_NUM_WORKER_THREADS is only defined when #ifdef USE_TENSORPIPE is true, but in python we import this without checking if tensorpipe is available.

cc @osalpekar @jiayisuse @lw @beauby @pritamdamania87 @mrshenli @jjlilley @gqchen @rohan-varma @pietern @zhaojuanmao @satgera @aazzolini @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu

@rohan-varma rohan-varma added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Apr 19, 2021
@pritamdamania87 pritamdamania87 added the module: tensorpipe Related to Tensorpipe RPC Agent label Apr 20, 2021
@lw
Copy link
Contributor

lw commented Apr 21, 2021

I think this is the same issue as #54325?

The specific issue at hand could be resolved easily (add a proper guard around _DEFAULT_NUM_WORKER_THREADS) but the core problem would remain: when USE_TENSORPIPE=0 the RPC module would be unusable (especially after 1.9 when we'll remove the ProcessGroup agent). It's unclear what to do there: make TensorPipe a mandatory dependency, or find a way to disable the entire RPC module if TensorPipe isn't available? (Imagine something like torch.distributed.rpc.is_available())

@rohan-varma
Copy link
Member Author

@lw Yeah, looks like the same issue. Looks like @mrshenli mentioned to disable RPC if tensorpipe is not available, should we go ahead and do that?

@lw
Copy link
Contributor

lw commented Apr 26, 2021

@lw Yeah, looks like the same issue. Looks like @mrshenli mentioned to disable RPC if tensorpipe is not available, should we go ahead and do that?

It's easier said than done ;)

I see there is a USE_RPC flag in our CMake files but it appears it's only there to disable RPC on Windows:

USE_RPC

list(APPEND TORCH_PYTHON_COMPILE_DEFINITIONS USE_RPC)

Hence it's not possible for the user to control it (I believe) and it's not clear if it has been tested on Linux/OSX. Moreover I think that's only a C++ preprocessor flag, which doesn't control anything within Python, hence the .py files would still be installed, but probably something bad would happen if someone imported and tried using the RPC module. Hence we'd need to check that this works, and add a torch.distributed.rpc.is_available() function. This is not too hard, but still some work.

Another option would be to make USE_TENSORPIPE mandatory whenever USE_DISTRIBUTED is enabled.

If we want to fix the issue at hand (i.e., the missing _DEFAULT_NUM_WORKER_THREADS) it's probably best to send an ad-hoc PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: tensorpipe Related to Tensorpipe RPC Agent oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

No branches or pull requests

3 participants