New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA Extension Bug for Pytorch 2.2.0 #118842
Comments
Has tutel been rebuilt for PyTorch 2.2? I would expect a rebuild to be needed |
Yes, it is a fresh installation. I verified it is not related to Tutel. All CUDAExtension will fail to register to Pytorch registry. (1) It is not related to Python version (e.g. Python3.8 - 3.12); |
Do you have a repro that involves directly loading a cpp extension from the Python API (e.g., passing in c++ source code?) |
For Tutel, you can disable these lines to install extension using cpp only instead of CUDA. |
When I build and install tutel with a build from source of pytorch, it works. So it's either an environment problem or a binaries specific problem. |
I have encountered same issue (in python 3.10), but in my case after installing Detectron2 and using it. Thought it was on them the error. Thanks for the temporal solution @ghostplant (downgrading to previous 2.x versions). |
@ezyang Can you try Ubuntu 18.04 + Python:any + Pytorch 2.2.0? Seems like Ubuntu 18.04 is always reproducible. The package below is the last version that works with CUDA extension on Ubunu 18.04:
|
@ezyang Seems like Ubuntu 18.04 is the root cause making Pytorch CUDA extension no longer work since 2.2.0, while a lot of machines stick to 18.04 environment. The problem can be reproduced with this environment: FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu18.04
...
RUN apt install python3-pip python3.8 -y
... Next, this one works using 2.1.0:
This setting doesn't work using 2.2.0:
When disabling CUDA extension, this setting still works using 2.2.0:
|
@malfet is this the gcc upgrade thing? Sounds like the gcc upgrade thing |
I encountered the same problem, using CUDAExtension on pytorch 2.2.0 and 2.2.1 and nightly, and there was no valid log. System environment:
I compile torch from source but no work.
|
No doublt that Pytorch 2.2.0 release is not compatible with Ubuntu 18.04 (10-Year LTS) This is the last version working well, is it possible to revert to this state:
|
High pri based on activity |
This looks related to #120020 Very likely yes, switch from |
If one to cherry-pick change from #120126 into 2.2 branch error becomes obvious, but installing gcc-9 fixes it as expected I.e. if one runs artifacts from the following dockerfile, tutel works as expected FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu18.04 as ubuntu18.04-torch2.2.0
RUN apt update && \
apt install software-properties-common -y && \
add-apt-repository ppa:ubuntu-toolchain-r/test -y && \
apt-get update -y && \
apt install python3-pip python3.8-dev g++-9 git -y && \
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 60 --slave /usr/bin/g++ g++ /usr/bin/g++-9 && \
update-alternatives --install /usr/bin/x86_64-linux-gnu-gcc x86_64-linux-gnu-gcc /usr/bin/gcc-9 60 --slave /usr/bin/x86_64-linux-gnu-g++ x86_64-linux-gnu-g++ /usr/bin/g++-9 && \
python3.8 -mpip install --upgrade pip && \
python3.8 -mpip install torch==2.2.0 I.e.
|
Ubuntu 18.04's gcc >7 is not officially supported. Does gcc-7 compatible |
@ghostplant I'm not sure I understand what do you mean. In your option, what are the benefits of continued support of gcc-7 compatibility? |
OK, if gcc-7 is not supported. I think Ubuntu 18.04 environment to support Pytorch 2.2.0 would be a big problem. Is it possible that users who still stick to Ubuntu 18.04 environment (gcc-7 based) to avoid errors like this by avoid using |
Moving to milestone 2.3.0 since cherry-picking window for 2.2.2 is closed |
@atalman Is this issue fixed by current daily build for 2.3.0? If so, I'll have a try. Thanks! |
Looks like Ubuntu 20.04 also has this issue with Pytorch 2.2, whose Python is provided by Conda 2023: Python 3.8.18 (default, Sep 11 2023, 13:40:15)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import tutel_custom_kernel
terminate called after throwing an instance of 'c10::Error'
what(): !dispatch_key_.has_value() INTERNAL ASSERT FAILED at "../aten/src/ATen/core/library.cpp":82, please report a bug to PyTorch. (Error occurred while processing TORCH_LIBRARY block at ./tutel/custom/custom_kernel.cpp:891)
Exception raised from Library at ../aten/src/ATen/core/library.cpp:82 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6c1333ed87 in /anaconda/envs/py38_default/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f6c132ef75f in /anaconda/envs/py38_default/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::string const&) + 0x3f (0x7f6c1333c8bf in /anaconda/envs/py38_default/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #3: torch::Library::Library(torch::Library::Kind, std::string, std::optional<c10::DispatchKey>, char const*, unsigned int) + 0x96c (0x7f6c474c720c in /anaconda/envs/py38_default/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0xc136 (0x7f6b0e43f136 in /anaconda/envs/py38_default/lib/python3.8/site-packages/tutel_custom_kernel.cpython-38-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x11b9a (0x7f6c5f732b9a in /lib64/ld-linux-x86-64.so.2)
frame #6: <unknown function> + 0x11ca1 (0x7f6c5f732ca1 in /lib64/ld-linux-x86-64.so.2)
frame #7: _dl_catch_exception + 0xe5 (0x7f6c5f4f3ba5 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x160cf (0x7f6c5f7370cf in /lib64/ld-linux-x86-64.so.2)
frame #9: _dl_catch_exception + 0x88 (0x7f6c5f4f3b48 in /lib/x86_64-linux-gnu/libc.so.6)
frame #10: <unknown function> + 0x1560a (0x7f6c5f73660a in /lib64/ld-linux-x86-64.so.2)
frame #11: <unknown function> + 0x134c (0x7f6c5f6da34c in /lib/x86_64-linux-gnu/libdl.so.2)
frame #12: _dl_catch_exception + 0x88 (0x7f6c5f4f3b48 in /lib/x86_64-linux-gnu/libc.so.6)
frame #13: _dl_catch_error + 0x33 (0x7f6c5f4f3c13 in /lib/x86_64-linux-gnu/libc.so.6)
frame #14: <unknown function> + 0x1b59 (0x7f6c5f6dab59 in /lib/x86_64-linux-gnu/libdl.so.2)
frame #15: dlopen + 0x4a (0x7f6c5f6da3da in /lib/x86_64-linux-gnu/libdl.so.2)
<omitting python frames>
frame #18: python3() [0x5a3250]
frame #19: python3() [0x4e8cfb]
frame #34: python3() [0x4e7aeb]
frame #41: python3() [0x5a5bd1]
frame #42: python3() [0x5a4bdf]
frame #43: python3() [0x4c0e24]
frame #46: python3() [0x45000c]
frame #48: __libc_start_main + 0xf3 (0x7f6c5f3b7083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #49: python3() [0x579d3d]
Aborted (core dumped) |
I assume Conda's GLIBC is using Ubuntu 18.04's in order to be compatible with Ubuntu 18.04 environment. |
馃悰 Describe the bug
Seems like the released Pytorch 2.2.0 for CUDA Linux has bugs to run C++ extensions. However, Pytorch 2.2.0 for CPU is not affected by this. So looks like this is a "2.2.0 + CUDA-only" bug.
To reproduce:
If I turn to use Pytorch 2.0.0 or 2.1.0 with
python3 -m pip install https://download.pytorch.org/whl/cu118/torch-2.1.0%2Bcu118-cp38-cp38-linux_x86_64.whl
which is an all-in-one package, everything is fine.BTW, I also tried other C++ Extensions, Pytorch 2.2.0 for CUDA always produces the same error.
Versions
Verified that it is not related to Python version. This bug happens in both Python3.8 and Python3.12.
cc @ezyang @gchanan @zou3519 @kadeng @malfet @ptrblck
The text was updated successfully, but these errors were encountered: