CPU-only c++ extension libraries (functorch, torchtext) built against PyTorch wheels are not fully compatible with PyTorch wheels #80489

zou3519 · 2022-06-29T01:39:53Z

🐛 Describe the bug

When installing functorch alongside a different PyTorch wheel (torch 1.12 {cpu, cu102, cu113, cu116}) than it was built with, we are experiencing either

missing symbol issues on import functorch
exception handling issues with functorch where the exception handling produces unexpected output. Independently, torchtext exhibits the same issue.

These seem to stem from different symbols existing in the torch (cpu, cu113, cu116) wheels vs the torch (cu102) wheels. Possibly related: pytorch/builder#1028 .

We (@malfet and I) are not sure if this is a problem with PyTorch or the way we build extensions. FWIW this did not happen during the last functorch releases (0.1.x).

functorch repro

See pytorch/functorch#916 for original issue.

Case 1: built functorch against the torch 1.12 (cpu) wheels.

When installing functorch with torch (cu102), on the AWS cluster, import torch; import functorch errors with missing symbol _ZNSt19basic_ostringstreamIcSt11char_traitsIcESaIcEEC1Ev
When installing functorch with torch (cpu, cu113, cu116), there is no noticeable problem

Case 2: built functorch against the torch 1.12 (cu102) wheels

When installing functorch with torch (cu102): repro.py gives the expected output
When installing functorch with torch(cpu, cu113, cu116): gives unexpected output

# repro.py
import torch
from functorch import vmap
x = torch.randn(2, 3, 5)
vmap(lambda x: x, out_dims=3)(x)

Expected output

>>> vmap(lambda x: x, out_dims=3)(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/private/home/rzou/functorch4/functorch/_src/vmap.py", line 361, in wrapped
    return _flat_vmap(
  File "/private/home/rzou/functorch4/functorch/_src/vmap.py", line 488, in _flat_vmap
    return _unwrap_batched(batched_outputs, out_dims, vmap_level, batch_size, func)
  File "/private/home/rzou/functorch4/functorch/_src/vmap.py", line 165, in _unwrap_batched
    flat_outputs = [
  File "/private/home/rzou/functorch4/functorch/_src/vmap.py", line 166, in <listcomp>
    _remove_batch_dim(batched_output, vmap_level, batch_size, out_dim)
IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3)

unexpected output

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/private/home/rzou/local/miniconda3/envs/py39/lib/python3.9/site-packages/functorch/_src/vmap.py", line 366, in wrapped
    return _unwrap_batched(batched_outputs, out_dims, vmap_level, batch_size, func)
  File "/private/home/rzou/local/miniconda3/envs/py39/lib/python3.9/site-packages/functorch/_src/vmap.py", line 165, in _unwrap_batched
    flat_outputs = [
  File "/private/home/rzou/local/miniconda3/envs/py39/lib/python3.9/site-packages/functorch/_src/vmap.py", line 166, in <listcomp>
    _remove_batch_dim(batched_output, vmap_level, batch_size, out_dim)
RuntimeError: Dimension out of range (expected to be in range of [-3, 2], but got 3)
Exception raised from maybe_wrap_dim_slow at ../c10/core/WrapDimMinimal.cpp:29 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f10a018e612 in /private/home/rzou/local/miniconda3/envs/py39/lib/python3.9/site-packa
ges/torch/lib/libc10.so)
frame #1: c10::detail::maybe_wrap_dim_slow(long, long, bool) + 0x3d3 (0x7f10a017c023 in /private/home/rzou/local/miniconda3/envs/py39/lib/python3.9/site-packa
ges/torch/lib/libc10.so)
frame #2: at::functorch::_remove_batch_dim(at::Tensor const&, long, long, long) + 0x5e8 (0x7f0ff6088678 in /private/home/rzou/local/miniconda3/envs/py39/lib/p
ython3.9/site-packages/functorch/_C.so)
frame #3: <unknown function> + 0x23b502 (0x7f0ff608c502 in /private/home/rzou/local/miniconda3/envs/py39/lib/python3.9/site-packages/functorch/_C.so)
frame #4: <unknown function> + 0x1ff6e2 (0x7f0ff60506e2 in /private/home/rzou/local/miniconda3/envs/py39/lib/python3.9/site-packages/functorch/_C.so)
<omitting python frames>
frame #27: __libc_start_main + 0xf3 (0x7f10f1ae70b3 in /lib/x86_64-linux-gnu/libc.so.6)

The exception handling appears to be incorrect.

torchtext repro

torchtext is built against torch (cu102).

import torchtext
torchtext._torchtext._build_vocab_from_text_file_using_python_tokenizer("doesnotexist", 10, 10)

When installing torchtext with torch (cpu) and running the above two lines, we get the following error message:

error message

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Cannot open input file doesnotexist
Exception raised from _infer_lines at /root/project/torchtext/csrc/vocab.cpp:143 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7fbbf0feebbe in /private/home/rzou/local/miniconda3/envs/py310/lib/python3.10/site-pack
ages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5c (0x7fbbf0fc9e38 in /private/home/rzou/local/miniconda3
/envs/py310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: torchtext::_infer_lines(std::string const&) + 0x254 (0x7fbb4e94cd84 in /private/home/rzou/local/miniconda3/envs/py310/lib/python3.10/site-packages/to
rchtext/lib/libtorchtext.so)
frame #3: <unknown function> + 0x14bcb (0x7fbb4e674bcb in /private/home/rzou/local/miniconda3/envs/py310/lib/python3.10/site-packages/torchtext/_torchtext.so)
frame #4: <unknown function> + 0x34fb1 (0x7fbb4e694fb1 in /private/home/rzou/local/miniconda3/envs/py310/lib/python3.10/site-packages/torchtext/_torchtext.so)
frame #5: <unknown function> + 0x2d7c9 (0x7fbb4e68d7c9 in /private/home/rzou/local/miniconda3/envs/py310/lib/python3.10/site-packages/torchtext/_torchtext.so)
<omitting python frames>
frame #19: __libc_start_main + 0xf3 (0x7fbc0bd660b3 in /lib/x86_64-linux-gnu/libc.so.6)

This exhibits the same behavior as the functorch repo; it is not expected that there is additional information about the c++ stack trace.

Versions

PyTorch 1.12 (latest release)
torchtext 0.13 (latest release)
functorch RC binaries

cc @ezyang @gchanan @zou3519 @malfet @seemethere

The text was updated successfully, but these errors were encountered:

zou3519 · 2022-06-29T01:50:25Z

for some more context: this is currently blocking the functorch release. We've brainstormed a couple of options for now:

Option 1: just release functorch binaries that were built against torch (cu102) and live with the exception handling issues
Option 2: build a different functorch binary for each cuda version (cpu, cu102, cu113, cu116)
Option 3 (from Nikita): Rootcause/fix compatibility issue
Option 4 (from Nikita): we rebuild entire pytorch witht the same version of compiler (gcd(cuda_supported_compilers) is alas gcc7
Option 5 (from Ed): functorch drops support for cu102

atalman · 2022-06-29T02:15:30Z

+1 for Option 5 (from Ed). We plan on dropping cu102 for the next 1.13 release here is the reference issue: 1026

malfet · 2022-06-29T02:17:51Z

+1 for Option 5 (from Ed). We plan on dropping cu102 for the next 1.13 release here is the reference issue: 1026

Sure, but problem is bigger than cu102: i.e. if we release PyTorch, do we force devs to use exactly the same version of comipler to build extension, to do we allow some leeway here. If later, we need to figure out what is going on.

atalman · 2022-06-29T02:25:35Z

+1 for Option 5 (from Ed). We plan on dropping cu102 for the next 1.13 release here is the reference issue: 1026

Sure, but problem is bigger than cu102: i.e. if we release PyTorch, do we force devs to use exactly the same version of comipler to build extension, to do we allow some leeway here. If later, we need to figure out what is going on.

Yes I agree we need to figure out whats goin on anyways just to understand all our possible options here

zou3519 · 2022-06-29T02:34:31Z

Does the devtoolset change (gcc 9 vs 7) also apply to the conda binaries? (I'm trying to determine if we need to build conda binaries as well) In the past functorch has not published conda binaries (instead, our pip wheels have worked with pytorch pip wheels and conda binaries, but maybe this is not expected)

atalman · 2022-06-29T02:38:30Z

Does the devtoolset change (gcc 9 vs 7) also apply to the conda binaries? (I'm trying to determine if we need to build conda binaries as well) In the past functorch has not published conda binaries (instead, our pip wheels have worked with pytorch pip wheels and conda binaries, but maybe this is not expected)

Yes its the same with conda ref 1030

ezyang · 2022-06-29T03:14:48Z

We already force people to run same version of compiler.

ABI_INCOMPATIBILITY_WARNING = '''

                               !! WARNING !!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Your compiler ({}) may be ABI-incompatible with PyTorch!
Please use a compiler that is ABI-compatible with GCC 5.0 and above.
See https://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html.

See https://gist.github.com/goldsborough/d466f43e8ffc948ff92de7486c5216d6
for instructions on how to install GCC 5 or higher.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

In fact, I'm guessing upgrading the devtoolset fixes #51039

malfet · 2022-06-29T03:49:24Z

We can update toolset as frequently as we want, but we can't get rid of _GLIBCXX_USE_CXX11_ABI=0 as all manylinux standards expect at to be set.

zou3519 · 2022-06-29T14:11:18Z

To reproduce the functorch failures easily:

https://github.com/pytorch/functorch/actions/runs/2573336254 are the binaries that were built against torch-cpu. This is from the latest commit in functorch's release/0.2
https://github.com/pytorch/functorch/actions/runs/2579842152 are the binaries built against torch-cu102. This is from Will that work functorch#920

To reproduce the torchtext problems:

download the released torch-cpu wheel
download the released torchtext wheel
run the script mentioned in the issue

zou3519 · 2022-06-30T13:54:59Z

We're unblocking the functorch release by going with Option 5 (drop support for cuda 10.2), but we should still continue to root-cause this (because it may matter for the future, even if we drop cuda 10.2 support from PyTorch)

seemethere · 2022-06-30T19:01:42Z

We can probably add a check for this in our binary smoketest as well to make sure we account for this

malfet · 2022-07-05T15:24:05Z

Problem originates from the fact that cu102 binaries are compiled with gcc-7 (as CUDA-10.2 is not compatible with gcc-9), but rest of the wheels/conda packages are built using gcc-9. There are slightly C++ ABI change between two compilers, see https://gcc.gnu.org/onlinedocs/gcc/C_002b_002b-Dialect-Options.html, which is preserved in torch._C._PYBIND11_BUILD_ABI

I.e. for torch-cpu wheel it returns _cxxabi1013, but for torch-cu102 _cxxabi1011

We should add the check that all PyTorch Linux nightly binaries are shipped with the same ABI suffix

zou3519 · 2022-07-07T16:49:49Z

fyi torchdata release seems to have the same issue because it builds binaries against one of the PyTorch binaries. (https://github.com/pytorch/data/blob/release/0.4.0/.github/workflows/_build_test_upload.yml#L57). So it will also need a dot release cc @ejguan

ejguan · 2022-07-08T15:32:38Z

@zou3519 Thanks for flagging this issue. I don't think this would affect torchdata though, because we only provide cpu binaries and torchdata only depends on PyTorch PyThon API rather than libtorch. Let me test

Edit: It works for torchdata (0.4.0) with torch-cu102 (1.12.0)

zou3519 · 2022-07-08T15:42:10Z

@ejguan and I discussed offline, torchdata isn't impacted because it doesn't depend on libtorch

…aries (#81058) (#81058) Summary: Fixes: #80489 Test using cuda 11.3 manywheel binary: ``` import torch print(torch.__version__) print(torch._C._PYBIND11 (d55b25a633b7e2e6122becf6dbdf0528df6e8b13)_BUILD_ABI) ```` Output ``` 1.13.0.dev20220707+cu113 _cxxabi1011 ``` Functorch test torch : 1.13.0.dev20220707+cu113, functorch with cu102 ``` import torch print(torch.__version__) print(torch._C._PYBIND11 (d55b25a633b7e2e6122becf6dbdf0528df6e8b13)_BUILD_ABI) from functorch import vmap x = torch.randn(2, 3, 5) vmap(lambda x: x, out_dims=3)(x) ``` Output ``` 1.13.0.dev20220707+cu113 _cxxabi1011 /home/atalman/temp/testc1.py:5: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:73.) x = torch.randn(2, 3, 5) Traceback (most recent call last): File "/home/atalman/temp/testc1.py", line 6, in <module> vmap(lambda x: x, out_dims=3)(x) File "/home/atalman/conda/lib/python3.9/site-packages/functorch/_src/vmap.py", line 361, in wrapped return _flat_vmap( File "/home/atalman/conda/lib/python3.9/site-packages/functorch/_src/vmap.py", line 488, in _flat_vmap return _unwrap_batched(batched_outputs, out_dims, vmap_level, batch_size, func) File "/home/atalman/conda/lib/python3.9/site-packages/functorch/_src/vmap.py", line 165, in _unwrap_batched flat_outputs = [ File "/home/atalman/conda/lib/python3.9/site-packages/functorch/_src/vmap.py", line 166, in <listcomp> _remove_batch_dim(batched_output, vmap_level, batch_size, out_dim) IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3) ``` Related Builder PR: pytorch/builder#1083 Test PR: #81232 Pull Request resolved: #81058 Approved by: https://github.com/zou3519, https://github.com/malfet Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/d552ba3b4f53da9b6a5f6e0463111e43b367ef8a Reviewed By: DanilBaibak Differential Revision: D37813240 Pulled By: atalman fbshipit-source-id: 94d94e777b0e9d5da106173c06117b3019ba71c4

…aries (pytorch#81058) (pytorch#81058) Summary: Fixes: pytorch#80489 Test using cuda 11.3 manywheel binary: ``` import torch print(torch.__version__) print(torch._C._PYBIND11 (pytorch@d55b25a633b7e2e6122becf6dbdf0528df6e8b13)_BUILD_ABI) ```` Output ``` 1.13.0.dev20220707+cu113 _cxxabi1011 ``` Functorch test torch : 1.13.0.dev20220707+cu113, functorch with cu102 ``` import torch print(torch.__version__) print(torch._C._PYBIND11 (pytorch@d55b25a633b7e2e6122becf6dbdf0528df6e8b13)_BUILD_ABI) from functorch import vmap x = torch.randn(2, 3, 5) vmap(lambda x: x, out_dims=3)(x) ``` Output ``` 1.13.0.dev20220707+cu113 _cxxabi1011 /home/atalman/temp/testc1.py:5: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:73.) x = torch.randn(2, 3, 5) Traceback (most recent call last): File "/home/atalman/temp/testc1.py", line 6, in <module> vmap(lambda x: x, out_dims=3)(x) File "/home/atalman/conda/lib/python3.9/site-packages/functorch/_src/vmap.py", line 361, in wrapped return _flat_vmap( File "/home/atalman/conda/lib/python3.9/site-packages/functorch/_src/vmap.py", line 488, in _flat_vmap return _unwrap_batched(batched_outputs, out_dims, vmap_level, batch_size, func) File "/home/atalman/conda/lib/python3.9/site-packages/functorch/_src/vmap.py", line 165, in _unwrap_batched flat_outputs = [ File "/home/atalman/conda/lib/python3.9/site-packages/functorch/_src/vmap.py", line 166, in <listcomp> _remove_batch_dim(batched_output, vmap_level, batch_size, out_dim) IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3) ``` Related Builder PR: pytorch/builder#1083 Test PR: pytorch#81232 Pull Request resolved: pytorch#81058 Approved by: https://github.com/zou3519, https://github.com/malfet Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/d552ba3b4f53da9b6a5f6e0463111e43b367ef8a Reviewed By: DanilBaibak Differential Revision: D37813240 Pulled By: atalman fbshipit-source-id: 94d94e777b0e9d5da106173c06117b3019ba71c4

…aries (#81058) (#81058) (#81884) Summary: Fixes: #80489 Test using cuda 11.3 manywheel binary: ``` import torch print(torch.__version__) print(torch._C._PYBIND11 (d55b25a633b7e2e6122becf6dbdf0528df6e8b13)_BUILD_ABI) ```` Output ``` 1.13.0.dev20220707+cu113 _cxxabi1011 ``` Functorch test torch : 1.13.0.dev20220707+cu113, functorch with cu102 ``` import torch print(torch.__version__) print(torch._C._PYBIND11 (d55b25a633b7e2e6122becf6dbdf0528df6e8b13)_BUILD_ABI) from functorch import vmap x = torch.randn(2, 3, 5) vmap(lambda x: x, out_dims=3)(x) ``` Output ``` 1.13.0.dev20220707+cu113 _cxxabi1011 /home/atalman/temp/testc1.py:5: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:73.) x = torch.randn(2, 3, 5) Traceback (most recent call last): File "/home/atalman/temp/testc1.py", line 6, in <module> vmap(lambda x: x, out_dims=3)(x) File "/home/atalman/conda/lib/python3.9/site-packages/functorch/_src/vmap.py", line 361, in wrapped return _flat_vmap( File "/home/atalman/conda/lib/python3.9/site-packages/functorch/_src/vmap.py", line 488, in _flat_vmap return _unwrap_batched(batched_outputs, out_dims, vmap_level, batch_size, func) File "/home/atalman/conda/lib/python3.9/site-packages/functorch/_src/vmap.py", line 165, in _unwrap_batched flat_outputs = [ File "/home/atalman/conda/lib/python3.9/site-packages/functorch/_src/vmap.py", line 166, in <listcomp> _remove_batch_dim(batched_output, vmap_level, batch_size, out_dim) IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3) ``` Related Builder PR: pytorch/builder#1083 Test PR: #81232 Pull Request resolved: #81058 Approved by: https://github.com/zou3519, https://github.com/malfet Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/d552ba3b4f53da9b6a5f6e0463111e43b367ef8a Reviewed By: DanilBaibak Differential Revision: D37813240 Pulled By: atalman fbshipit-source-id: 94d94e777b0e9d5da106173c06117b3019ba71c4

zou3519 added high priority module: build Build system issues module: cpp-extensions Related to torch.utils.cpp_extension topic: binaries labels Jun 29, 2022

pytorch-bot bot added the triage review label Jun 29, 2022

atalman added this to the 1.12.1 milestone Jun 30, 2022

atalman mentioned this issue Jul 8, 2022

Use fabi-version=11 to ensure compatibility between gcc7 and gcc9 binaries #81058

Closed

XuehaiPan mentioned this issue Jul 11, 2022

deps: bump PyTorch version to 1.12 metaopt/torchopt#25

Merged

cpuhrsch removed the triage review label Jul 11, 2022

albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 12, 2022

pytorchmergebot closed this as completed in d552ba3 Jul 12, 2022

atalman mentioned this issue Jul 21, 2022

Use fabi-version=11 to ensure compatibility between gcc7 and gcc9 bin… #81884

Merged

ptrblck mentioned this issue Jul 21, 2022

Add nvFuser tutorial for PyTorch 1.12.0 pytorch/tutorials#1968

Merged

zou3519 mentioned this issue Nov 2, 2022

1.12.1 incompatible with c++ built for 1.12.0 and vice versa #88301

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU-only c++ extension libraries (functorch, torchtext) built against PyTorch wheels are not fully compatible with PyTorch wheels #80489

CPU-only c++ extension libraries (functorch, torchtext) built against PyTorch wheels are not fully compatible with PyTorch wheels #80489

zou3519 commented Jun 29, 2022 •

edited

zou3519 commented Jun 29, 2022 •

edited

atalman commented Jun 29, 2022 •

edited

malfet commented Jun 29, 2022

atalman commented Jun 29, 2022

zou3519 commented Jun 29, 2022

atalman commented Jun 29, 2022 •

edited

ezyang commented Jun 29, 2022

malfet commented Jun 29, 2022

zou3519 commented Jun 29, 2022 •

edited

zou3519 commented Jun 30, 2022

seemethere commented Jun 30, 2022

malfet commented Jul 5, 2022

zou3519 commented Jul 7, 2022

ejguan commented Jul 8, 2022 •

edited

zou3519 commented Jul 8, 2022

CPU-only c++ extension libraries (functorch, torchtext) built against PyTorch wheels are not fully compatible with PyTorch wheels #80489

CPU-only c++ extension libraries (functorch, torchtext) built against PyTorch wheels are not fully compatible with PyTorch wheels #80489

Comments

zou3519 commented Jun 29, 2022 • edited

🐛 Describe the bug

functorch repro

torchtext repro

Versions

zou3519 commented Jun 29, 2022 • edited

atalman commented Jun 29, 2022 • edited

malfet commented Jun 29, 2022

atalman commented Jun 29, 2022

zou3519 commented Jun 29, 2022

atalman commented Jun 29, 2022 • edited

ezyang commented Jun 29, 2022

malfet commented Jun 29, 2022

zou3519 commented Jun 29, 2022 • edited

zou3519 commented Jun 30, 2022

seemethere commented Jun 30, 2022

malfet commented Jul 5, 2022

zou3519 commented Jul 7, 2022

ejguan commented Jul 8, 2022 • edited

zou3519 commented Jul 8, 2022

zou3519 commented Jun 29, 2022 •

edited

zou3519 commented Jun 29, 2022 •

edited

atalman commented Jun 29, 2022 •

edited

atalman commented Jun 29, 2022 •

edited

zou3519 commented Jun 29, 2022 •

edited

ejguan commented Jul 8, 2022 •

edited