Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU-only c++ extension libraries (functorch, torchtext) built against PyTorch wheels are not fully compatible with PyTorch wheels #80489

Closed
zou3519 opened this issue Jun 29, 2022 · 15 comments
Labels
high priority module: build Build system issues module: cpp-extensions Related to torch.utils.cpp_extension topic: binaries triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Milestone

Comments

@zou3519
Copy link
Contributor

zou3519 commented Jun 29, 2022

🐛 Describe the bug

When installing functorch alongside a different PyTorch wheel (torch 1.12 {cpu, cu102, cu113, cu116}) than it was built with, we are experiencing either

  1. missing symbol issues on import functorch
  2. exception handling issues with functorch where the exception handling produces unexpected output. Independently, torchtext exhibits the same issue.

These seem to stem from different symbols existing in the torch (cpu, cu113, cu116) wheels vs the torch (cu102) wheels. Possibly related: pytorch/builder#1028 .

We (@malfet and I) are not sure if this is a problem with PyTorch or the way we build extensions. FWIW this did not happen during the last functorch releases (0.1.x).

functorch repro

See pytorch/functorch#916 for original issue.

Case 1: built functorch against the torch 1.12 (cpu) wheels.

  • When installing functorch with torch (cu102), on the AWS cluster, import torch; import functorch errors with missing symbol _ZNSt19basic_ostringstreamIcSt11char_traitsIcESaIcEEC1Ev
  • When installing functorch with torch (cpu, cu113, cu116), there is no noticeable problem

Case 2: built functorch against the torch 1.12 (cu102) wheels

  • When installing functorch with torch (cu102): repro.py gives the expected output
  • When installing functorch with torch(cpu, cu113, cu116): gives unexpected output
# repro.py
import torch
from functorch import vmap
x = torch.randn(2, 3, 5)
vmap(lambda x: x, out_dims=3)(x)
Expected output

>>> vmap(lambda x: x, out_dims=3)(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/private/home/rzou/functorch4/functorch/_src/vmap.py", line 361, in wrapped
    return _flat_vmap(
  File "/private/home/rzou/functorch4/functorch/_src/vmap.py", line 488, in _flat_vmap
    return _unwrap_batched(batched_outputs, out_dims, vmap_level, batch_size, func)
  File "/private/home/rzou/functorch4/functorch/_src/vmap.py", line 165, in _unwrap_batched
    flat_outputs = [
  File "/private/home/rzou/functorch4/functorch/_src/vmap.py", line 166, in <listcomp>
    _remove_batch_dim(batched_output, vmap_level, batch_size, out_dim)
IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3)

unexpected output

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/private/home/rzou/local/miniconda3/envs/py39/lib/python3.9/site-packages/functorch/_src/vmap.py", line 366, in wrapped
    return _unwrap_batched(batched_outputs, out_dims, vmap_level, batch_size, func)
  File "/private/home/rzou/local/miniconda3/envs/py39/lib/python3.9/site-packages/functorch/_src/vmap.py", line 165, in _unwrap_batched
    flat_outputs = [
  File "/private/home/rzou/local/miniconda3/envs/py39/lib/python3.9/site-packages/functorch/_src/vmap.py", line 166, in <listcomp>
    _remove_batch_dim(batched_output, vmap_level, batch_size, out_dim)
RuntimeError: Dimension out of range (expected to be in range of [-3, 2], but got 3)
Exception raised from maybe_wrap_dim_slow at ../c10/core/WrapDimMinimal.cpp:29 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f10a018e612 in /private/home/rzou/local/miniconda3/envs/py39/lib/python3.9/site-packa
ges/torch/lib/libc10.so)
frame #1: c10::detail::maybe_wrap_dim_slow(long, long, bool) + 0x3d3 (0x7f10a017c023 in /private/home/rzou/local/miniconda3/envs/py39/lib/python3.9/site-packa
ges/torch/lib/libc10.so)
frame #2: at::functorch::_remove_batch_dim(at::Tensor const&, long, long, long) + 0x5e8 (0x7f0ff6088678 in /private/home/rzou/local/miniconda3/envs/py39/lib/p
ython3.9/site-packages/functorch/_C.so)
frame #3: <unknown function> + 0x23b502 (0x7f0ff608c502 in /private/home/rzou/local/miniconda3/envs/py39/lib/python3.9/site-packages/functorch/_C.so)
frame #4: <unknown function> + 0x1ff6e2 (0x7f0ff60506e2 in /private/home/rzou/local/miniconda3/envs/py39/lib/python3.9/site-packages/functorch/_C.so)
<omitting python frames>
frame #27: __libc_start_main + 0xf3 (0x7f10f1ae70b3 in /lib/x86_64-linux-gnu/libc.so.6)

The exception handling appears to be incorrect.

torchtext repro

torchtext is built against torch (cu102).

import torchtext
torchtext._torchtext._build_vocab_from_text_file_using_python_tokenizer("doesnotexist", 10, 10)

When installing torchtext with torch (cpu) and running the above two lines, we get the following error message:

error message

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Cannot open input file doesnotexist
Exception raised from _infer_lines at /root/project/torchtext/csrc/vocab.cpp:143 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7fbbf0feebbe in /private/home/rzou/local/miniconda3/envs/py310/lib/python3.10/site-pack
ages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5c (0x7fbbf0fc9e38 in /private/home/rzou/local/miniconda3
/envs/py310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: torchtext::_infer_lines(std::string const&) + 0x254 (0x7fbb4e94cd84 in /private/home/rzou/local/miniconda3/envs/py310/lib/python3.10/site-packages/to
rchtext/lib/libtorchtext.so)
frame #3: <unknown function> + 0x14bcb (0x7fbb4e674bcb in /private/home/rzou/local/miniconda3/envs/py310/lib/python3.10/site-packages/torchtext/_torchtext.so)
frame #4: <unknown function> + 0x34fb1 (0x7fbb4e694fb1 in /private/home/rzou/local/miniconda3/envs/py310/lib/python3.10/site-packages/torchtext/_torchtext.so)
frame #5: <unknown function> + 0x2d7c9 (0x7fbb4e68d7c9 in /private/home/rzou/local/miniconda3/envs/py310/lib/python3.10/site-packages/torchtext/_torchtext.so)
<omitting python frames>
frame #19: __libc_start_main + 0xf3 (0x7fbc0bd660b3 in /lib/x86_64-linux-gnu/libc.so.6)

This exhibits the same behavior as the functorch repo; it is not expected that there is additional information about the c++ stack trace.

Versions

PyTorch 1.12 (latest release)
torchtext 0.13 (latest release)
functorch RC binaries

cc @ezyang @gchanan @zou3519 @malfet @seemethere

@zou3519 zou3519 added high priority module: build Build system issues module: cpp-extensions Related to torch.utils.cpp_extension topic: binaries labels Jun 29, 2022
@zou3519
Copy link
Contributor Author

zou3519 commented Jun 29, 2022

for some more context: this is currently blocking the functorch release. We've brainstormed a couple of options for now:

  • Option 1: just release functorch binaries that were built against torch (cu102) and live with the exception handling issues
  • Option 2: build a different functorch binary for each cuda version (cpu, cu102, cu113, cu116)
  • Option 3 (from Nikita): Rootcause/fix compatibility issue
  • Option 4 (from Nikita): we rebuild entire pytorch witht the same version of compiler (gcd(cuda_supported_compilers) is alas gcc7
  • Option 5 (from Ed): functorch drops support for cu102

@atalman
Copy link
Contributor

atalman commented Jun 29, 2022

+1 for Option 5 (from Ed). We plan on dropping cu102 for the next 1.13 release here is the reference issue: 1026

@malfet
Copy link
Contributor

malfet commented Jun 29, 2022

+1 for Option 5 (from Ed). We plan on dropping cu102 for the next 1.13 release here is the reference issue: 1026

Sure, but problem is bigger than cu102: i.e. if we release PyTorch, do we force devs to use exactly the same version of comipler to build extension, to do we allow some leeway here. If later, we need to figure out what is going on.

@atalman
Copy link
Contributor

atalman commented Jun 29, 2022

+1 for Option 5 (from Ed). We plan on dropping cu102 for the next 1.13 release here is the reference issue: 1026

Sure, but problem is bigger than cu102: i.e. if we release PyTorch, do we force devs to use exactly the same version of comipler to build extension, to do we allow some leeway here. If later, we need to figure out what is going on.

Yes I agree we need to figure out whats goin on anyways just to understand all our possible options here

@zou3519
Copy link
Contributor Author

zou3519 commented Jun 29, 2022

Does the devtoolset change (gcc 9 vs 7) also apply to the conda binaries? (I'm trying to determine if we need to build conda binaries as well) In the past functorch has not published conda binaries (instead, our pip wheels have worked with pytorch pip wheels and conda binaries, but maybe this is not expected)

@atalman
Copy link
Contributor

atalman commented Jun 29, 2022

Does the devtoolset change (gcc 9 vs 7) also apply to the conda binaries? (I'm trying to determine if we need to build conda binaries as well) In the past functorch has not published conda binaries (instead, our pip wheels have worked with pytorch pip wheels and conda binaries, but maybe this is not expected)

Yes its the same with conda ref 1030

@ezyang
Copy link
Contributor

ezyang commented Jun 29, 2022

We already force people to run same version of compiler.

ABI_INCOMPATIBILITY_WARNING = '''

                               !! WARNING !!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Your compiler ({}) may be ABI-incompatible with PyTorch!
Please use a compiler that is ABI-compatible with GCC 5.0 and above.
See https://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html.

See https://gist.github.com/goldsborough/d466f43e8ffc948ff92de7486c5216d6
for instructions on how to install GCC 5 or higher.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


In fact, I'm guessing upgrading the devtoolset fixes #51039

@malfet
Copy link
Contributor

malfet commented Jun 29, 2022

We can update toolset as frequently as we want, but we can't get rid of _GLIBCXX_USE_CXX11_ABI=0 as all manylinux standards expect at to be set.

@zou3519
Copy link
Contributor Author

zou3519 commented Jun 29, 2022

To reproduce the functorch failures easily:

To reproduce the torchtext problems:

  • download the released torch-cpu wheel
  • download the released torchtext wheel
  • run the script mentioned in the issue

@zou3519
Copy link
Contributor Author

zou3519 commented Jun 30, 2022

We're unblocking the functorch release by going with Option 5 (drop support for cuda 10.2), but we should still continue to root-cause this (because it may matter for the future, even if we drop cuda 10.2 support from PyTorch)

@atalman atalman added this to the 1.12.1 milestone Jun 30, 2022
@seemethere
Copy link
Member

We can probably add a check for this in our binary smoketest as well to make sure we account for this

@malfet
Copy link
Contributor

malfet commented Jul 5, 2022

Problem originates from the fact that cu102 binaries are compiled with gcc-7 (as CUDA-10.2 is not compatible with gcc-9), but rest of the wheels/conda packages are built using gcc-9. There are slightly C++ ABI change between two compilers, see https://gcc.gnu.org/onlinedocs/gcc/C_002b_002b-Dialect-Options.html, which is preserved in torch._C._PYBIND11_BUILD_ABI

I.e. for torch-cpu wheel it returns _cxxabi1013, but for torch-cu102 _cxxabi1011

We should add the check that all PyTorch Linux nightly binaries are shipped with the same ABI suffix

@zou3519
Copy link
Contributor Author

zou3519 commented Jul 7, 2022

fyi torchdata release seems to have the same issue because it builds binaries against one of the PyTorch binaries. (https://github.com/pytorch/data/blob/release/0.4.0/.github/workflows/_build_test_upload.yml#L57). So it will also need a dot release cc @ejguan

@ejguan
Copy link
Contributor

ejguan commented Jul 8, 2022

@zou3519 Thanks for flagging this issue. I don't think this would affect torchdata though, because we only provide cpu binaries and torchdata only depends on PyTorch PyThon API rather than libtorch. Let me test

Edit: It works for torchdata (0.4.0) with torch-cu102 (1.12.0)

@zou3519
Copy link
Contributor Author

zou3519 commented Jul 8, 2022

@ejguan and I discussed offline, torchdata isn't impacted because it doesn't depend on libtorch

@albanD albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 12, 2022
facebook-github-bot pushed a commit that referenced this issue Jul 13, 2022
…aries (#81058) (#81058)

Summary:
Fixes: #80489

Test using cuda 11.3 manywheel binary:
```
import torch
print(torch.__version__)
print(torch._C._PYBIND11 (d55b25a633b7e2e6122becf6dbdf0528df6e8b13)_BUILD_ABI)
````

Output
```
1.13.0.dev20220707+cu113
_cxxabi1011
```

Functorch test torch : 1.13.0.dev20220707+cu113, functorch with cu102
```
import torch
print(torch.__version__)
print(torch._C._PYBIND11 (d55b25a633b7e2e6122becf6dbdf0528df6e8b13)_BUILD_ABI)
from functorch import vmap
x = torch.randn(2, 3, 5)
vmap(lambda x: x, out_dims=3)(x)
```

Output
```
1.13.0.dev20220707+cu113
_cxxabi1011
/home/atalman/temp/testc1.py:5: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:73.)
  x = torch.randn(2, 3, 5)
Traceback (most recent call last):
  File "/home/atalman/temp/testc1.py", line 6, in <module>
    vmap(lambda x: x, out_dims=3)(x)
  File "/home/atalman/conda/lib/python3.9/site-packages/functorch/_src/vmap.py", line 361, in wrapped
    return _flat_vmap(
  File "/home/atalman/conda/lib/python3.9/site-packages/functorch/_src/vmap.py", line 488, in _flat_vmap
    return _unwrap_batched(batched_outputs, out_dims, vmap_level, batch_size, func)
  File "/home/atalman/conda/lib/python3.9/site-packages/functorch/_src/vmap.py", line 165, in _unwrap_batched
    flat_outputs = [
  File "/home/atalman/conda/lib/python3.9/site-packages/functorch/_src/vmap.py", line 166, in <listcomp>
    _remove_batch_dim(batched_output, vmap_level, batch_size, out_dim)
IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3)
```

Related Builder  PR: pytorch/builder#1083

Test PR: #81232

Pull Request resolved: #81058
Approved by: https://github.com/zou3519, https://github.com/malfet

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/d552ba3b4f53da9b6a5f6e0463111e43b367ef8a

Reviewed By: DanilBaibak

Differential Revision: D37813240

Pulled By: atalman

fbshipit-source-id: 94d94e777b0e9d5da106173c06117b3019ba71c4
atalman added a commit to atalman/pytorch that referenced this issue Jul 21, 2022
…aries (pytorch#81058) (pytorch#81058)

Summary:
Fixes: pytorch#80489

Test using cuda 11.3 manywheel binary:
```
import torch
print(torch.__version__)
print(torch._C._PYBIND11 (pytorch@d55b25a633b7e2e6122becf6dbdf0528df6e8b13)_BUILD_ABI)
````

Output
```
1.13.0.dev20220707+cu113
_cxxabi1011
```

Functorch test torch : 1.13.0.dev20220707+cu113, functorch with cu102
```
import torch
print(torch.__version__)
print(torch._C._PYBIND11 (pytorch@d55b25a633b7e2e6122becf6dbdf0528df6e8b13)_BUILD_ABI)
from functorch import vmap
x = torch.randn(2, 3, 5)
vmap(lambda x: x, out_dims=3)(x)
```

Output
```
1.13.0.dev20220707+cu113
_cxxabi1011
/home/atalman/temp/testc1.py:5: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:73.)
  x = torch.randn(2, 3, 5)
Traceback (most recent call last):
  File "/home/atalman/temp/testc1.py", line 6, in <module>
    vmap(lambda x: x, out_dims=3)(x)
  File "/home/atalman/conda/lib/python3.9/site-packages/functorch/_src/vmap.py", line 361, in wrapped
    return _flat_vmap(
  File "/home/atalman/conda/lib/python3.9/site-packages/functorch/_src/vmap.py", line 488, in _flat_vmap
    return _unwrap_batched(batched_outputs, out_dims, vmap_level, batch_size, func)
  File "/home/atalman/conda/lib/python3.9/site-packages/functorch/_src/vmap.py", line 165, in _unwrap_batched
    flat_outputs = [
  File "/home/atalman/conda/lib/python3.9/site-packages/functorch/_src/vmap.py", line 166, in <listcomp>
    _remove_batch_dim(batched_output, vmap_level, batch_size, out_dim)
IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3)
```

Related Builder  PR: pytorch/builder#1083

Test PR: pytorch#81232

Pull Request resolved: pytorch#81058
Approved by: https://github.com/zou3519, https://github.com/malfet

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/d552ba3b4f53da9b6a5f6e0463111e43b367ef8a

Reviewed By: DanilBaibak

Differential Revision: D37813240

Pulled By: atalman

fbshipit-source-id: 94d94e777b0e9d5da106173c06117b3019ba71c4
atalman added a commit that referenced this issue Jul 21, 2022
…aries (#81058) (#81058) (#81884)

Summary:
Fixes: #80489

Test using cuda 11.3 manywheel binary:
```
import torch
print(torch.__version__)
print(torch._C._PYBIND11 (d55b25a633b7e2e6122becf6dbdf0528df6e8b13)_BUILD_ABI)
````

Output
```
1.13.0.dev20220707+cu113
_cxxabi1011
```

Functorch test torch : 1.13.0.dev20220707+cu113, functorch with cu102
```
import torch
print(torch.__version__)
print(torch._C._PYBIND11 (d55b25a633b7e2e6122becf6dbdf0528df6e8b13)_BUILD_ABI)
from functorch import vmap
x = torch.randn(2, 3, 5)
vmap(lambda x: x, out_dims=3)(x)
```

Output
```
1.13.0.dev20220707+cu113
_cxxabi1011
/home/atalman/temp/testc1.py:5: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:73.)
  x = torch.randn(2, 3, 5)
Traceback (most recent call last):
  File "/home/atalman/temp/testc1.py", line 6, in <module>
    vmap(lambda x: x, out_dims=3)(x)
  File "/home/atalman/conda/lib/python3.9/site-packages/functorch/_src/vmap.py", line 361, in wrapped
    return _flat_vmap(
  File "/home/atalman/conda/lib/python3.9/site-packages/functorch/_src/vmap.py", line 488, in _flat_vmap
    return _unwrap_batched(batched_outputs, out_dims, vmap_level, batch_size, func)
  File "/home/atalman/conda/lib/python3.9/site-packages/functorch/_src/vmap.py", line 165, in _unwrap_batched
    flat_outputs = [
  File "/home/atalman/conda/lib/python3.9/site-packages/functorch/_src/vmap.py", line 166, in <listcomp>
    _remove_batch_dim(batched_output, vmap_level, batch_size, out_dim)
IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3)
```

Related Builder  PR: pytorch/builder#1083

Test PR: #81232

Pull Request resolved: #81058
Approved by: https://github.com/zou3519, https://github.com/malfet

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/d552ba3b4f53da9b6a5f6e0463111e43b367ef8a

Reviewed By: DanilBaibak

Differential Revision: D37813240

Pulled By: atalman

fbshipit-source-id: 94d94e777b0e9d5da106173c06117b3019ba71c4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: build Build system issues module: cpp-extensions Related to torch.utils.cpp_extension topic: binaries triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants