-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AWS EC2 P3DN, EFA is enabled] Torch RPC tensorpipe/common/ibv.h:172 "": Operation not supported #65022
Comments
after I changed the backend to processgroup, it works. But I want to use torchpipe to improve the performance.
|
@lw Was wondering if you could help out here, this seems to be coming from TensorPipe ibv support. |
I also have another question: ❓ Questions and HelpFor PyTorch pipeline, I think we have two ways to pass tensors to a skipped layer. Which way has higher performance?
|
@chaoyanghe Can you try the following workaround for the tensorpipe issue:
|
@pritamdamania87 Is this what you expect?
|
@pritamdamania87 I tested by just adding "_transports=["uv"]", it works now. Thank you! It's better to handle this automatically by pytorch APIs. |
@pritamdamania87 After I integrated this demo into our project. I met this issue:
|
Another bug: |
@chaoyanghe I cannot find the error log in the repo you pointed to. |
Never mind, I went back in the history and found it. The logs are all garbled, because multiple processes wrote to the same file without synchronization, but the error seems to be this:
This is something I've seen before: on AWS, the EFA card presents itself as capable of carrying InfiniBand traffic, and this "tricks" TensorPipe into trying to use it, however then it doesn't support some of the features that TensorPipe tries to use. We could do a more nuanced detection logic that probes for these features earlier, however I haven't gotten to it yet (I need to figure out how to use AWS). For now the workaround proposed by @pritamdamania87 is the best I could offer. |
@chaoyanghe Do you have a repro for this issue? @lw Looks like something was failing even after "_transports=["uv"]" was set. |
I see, those EOF errors. In my experience these tend to just be "side effects" of another worker abruptly crashing, hence it may help to search the logs of the mentioned workers for the real root cause. |
## Title Fix rpc bug on AWS ## Description - rpc.TensorPipeRpcBackendOptions returns an error when run on AWS. ![image](https://github.com/EleutherAI/oslo/assets/26476095/4bb98124-1e0a-4d02-b473-cbe3ddaf7610) related issues - pytorch/pytorch#65022 - pytorch/tensorpipe#413 - pytorch/pytorch#65093 - pytorch/pytorch#65022 ## Linked Issues - resolved #00
🐛 Bug
I got the following error when using sync.Pipe and initializing RPC.
To Reproduce
I provide a small github repo and related script to reproduce this issue (https://github.com/chaoyanghe/pytorch_bug_reproduce). A full error log is also maintained there.
Expected behavior
Successfully run the Pipe demo.
Environment
I run my source code on AWS EC2 P3DN GPU server. EFA is enabled.
Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A
OS: Amazon Linux 2 (x86_64)
GCC version: (GCC) 7.3.1 20180712 (Red Hat 7.3.1-9)
Clang version: 7.0.1 (Amazon Linux 2 7.0.1-1.amzn2.0.2)
CMake version: version 3.18.2
Libc version: glibc-2.2.5
Python version: 2.7.18 (default, Aug 27 2020, 21:22:52) [GCC 7.3.1 20180712 (Red Hat 7.3.1-9)] (64-bit runtime)
Python platform: Linux-4.14.200-155.322.amzn2.x86_64-x86_64-with-glibc2.2.5
Is CUDA available: N/A
CUDA runtime version: 11.0.221
GPU models and configuration:
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB
GPU 4: Tesla V100-SXM2-32GB
GPU 5: Tesla V100-SXM2-32GB
GPU 6: Tesla V100-SXM2-32GB
GPU 7: Tesla V100-SXM2-32GB
Nvidia driver version: 450.80.02
cuDNN version: Probably one of the following:
/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudnn.so.7.6.5
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn.so.7.6.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn.so.8.0.4
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.4
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.4
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.4
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.4
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.4
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.4
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip] numpy==1.16.6
[conda] Could not collect
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @cbalioglu @gcramer23 @jjlilley @mrzzd @lw @beauby
The text was updated successfully, but these errors were encountered: