New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C++17 for PyTorch #56055
Comments
This was also my first thought. It doesn't help with total download size, but at least it would help stay under 1 GB per wheel, which is the absolute max PyPI will allow. It looks like it's just possible to split in two about equally sized parts that would each stay under 1 GB without changing how Package sizesTo get an impression of file sizes, I unpacked a nightly wheel. It's 1.9 GB as a
Then I did the same for a conda nightly package. It's 1.3 GB packaged, and 3.1 GB unpacked.
Dynamic vs static linking
I'm not sure I understand this. Dynamic linking should be much better for package size. Just comparing the conda packages and wheels shows that. Also compare the conda-forge pytorch packages which do everything dynamically and are 425 MB for cuda10.2 versus 700 MB for the pytorch nightlies which I believe still use static linking partially. There may be other differences, but I think it's mostly dynamic linking - I have checked that conda-forge uses the default cuDNN version for each CUDA version and the same
If this is based on "but then we have to bundle in dynamic libs from CUDA itself", then I don't think that is the right comparison - doing that is not even allowed by the CUDA EULA as far as I know. Other ideas
Longer term & related ideasRight now only a single CUDA version can be put on PyPI. Currently that is CUDA 10.2; CUDA 11.1, ROCm 4.0 and CPU-only are all in a separate wheelhouse like https://download.pytorch.org/whl/torch_stable.html. That makes it very difficult to depend on, in effect distributing a downstream package depending on those is impractical. This should be solved somehow - and if the solution involves making it easier/better to do package hosting outside of PyPI (e.g. via custom wheelhouses or interacting with another package manager) then the limitations of PyPI may become less of a limiting factor here as well. Relevant discussion and blog post:
My personal impression is that the problem is only going to get worse (CUDA continues to grow, next year we're likely to get
And then Possible next stepNone of these solutions are ideal, there's no clear winner imho. Without more information, my guess would be that slimming down It would be great to have a better idea of what each technical change would bring. Have something like a "package size budget" where the total size is broken down into contributions from each dependency and build option. Does anyone have anything like this, and if not would it be worth producing it? |
Just noticed that gh-49050 (which split |
Thanks for the detailed analysis! CC @eqy who is looking into the memory requirements as well. |
One thing about dropping CUDA 10: The Jetsons currently are incompatible with CUDA 11 (as they don't use the PCIe and so they need their own driver that has not yet been updated as far as I understand). |
Just to keep a list, here are some performance reasons to switch to C++17:
|
Note: PyTorch 1.12 uses |
@xloem So pytorch requires C++17 in fact. |
? Line 159 in af0160c
pytorch/cmake/TorchConfig.cmake.in Line 184 in af0160c
|
I believe the |
FWIW, as a framework dependency, PyTorch/XLA is now transitioning to |
Just to follow up here, I haven't looked into the details of correctness. Some time ago I attempted to compile pytorch with clang, and the aligned_alloc code would not compile until I enabled c++17. I do not presently have the llvm version in question, I'm afraid. The aligned_alloc concern should be in a separate issue, and I apologise for clogging this thread. Since I am the only person who has mentioned it, it is probably minor. |
With CUDA-10.2 gone we can finally do it! This PR mostly contains build system related changes, invasive functional ones are to be followed. Among many expected tweaks to the build system, here are few unexpected ones: - Force onnx_proto project to be updated to C++17 to avoid `duplicate symbols` error when compiled by gcc-7.5.0, as storage rule for `constexpr` changed in C++17, but gcc does not seem to follow it - Do not use `std::apply` on CUDA but rely on the built-in variant, as it results in test failures when CUDA runtime picks host rather than device function when `std::apply` is invoked from CUDA code. - `std::decay_t` -> `::std::decay_t` and `std::move`->`::std::move` as VC++ for some reason claims that `std` symbol is ambigious - Disable use of `std::aligned_alloc` on Android, as its `libc++` does not implement it. Some prerequisites: - pytorch#89297 - pytorch#89605 - pytorch#90228 - pytorch#90389 - pytorch#90379 - pytorch#89570 - facebookincubator/gloo#336 - facebookincubator/gloo#343 - pytorch/builder@919676f Fixes pytorch#56055 Pull Request resolved: pytorch#85969 Approved by: https://github.com/ezyang, https://github.com/kulinseth
With CUDA-10.2 gone we can finally do it! This PR mostly contains build system related changes, invasive functional ones are to be followed. Among many expected tweaks to the build system, here are few unexpected ones: - Force onnx_proto project to be updated to C++17 to avoid `duplicate symbols` error when compiled by gcc-7.5.0, as storage rule for `constexpr` changed in C++17, but gcc does not seem to follow it - Do not use `std::apply` on CUDA but rely on the built-in variant, as it results in test failures when CUDA runtime picks host rather than device function when `std::apply` is invoked from CUDA code. - `std::decay_t` -> `::std::decay_t` and `std::move`->`::std::move` as VC++ for some reason claims that `std` symbol is ambigious - Disable use of `std::aligned_alloc` on Android, as its `libc++` does not implement it. Some prerequisites: - pytorch#89297 - pytorch#89605 - pytorch#90228 - pytorch#90389 - pytorch#90379 - pytorch#89570 - facebookincubator/gloo#336 - facebookincubator/gloo#343 - pytorch/builder@919676f Fixes pytorch#56055 Pull Request resolved: pytorch#85969 Approved by: https://github.com/ezyang, https://github.com/kulinseth
With CUDA-10.2 gone we can finally do it! This PR mostly contains build system related changes, invasive functional ones are to be followed. Among many expected tweaks to the build system, here are few unexpected ones: - Force onnx_proto project to be updated to C++17 to avoid `duplicate symbols` error when compiled by gcc-7.5.0, as storage rule for `constexpr` changed in C++17, but gcc does not seem to follow it - Do not use `std::apply` on CUDA but rely on the built-in variant, as it results in test failures when CUDA runtime picks host rather than device function when `std::apply` is invoked from CUDA code. - `std::decay_t` -> `::std::decay_t` and `std::move`->`::std::move` as VC++ for some reason claims that `std` symbol is ambigious - Disable use of `std::aligned_alloc` on Android, as its `libc++` does not implement it. Some prerequisites: - pytorch#89297 - pytorch#89605 - pytorch#90228 - pytorch#90389 - pytorch#90379 - pytorch#89570 - facebookincubator/gloo#336 - facebookincubator/gloo#343 - pytorch/builder@919676f Fixes pytorch#56055 Pull Request resolved: pytorch#85969 Approved by: https://github.com/ezyang, https://github.com/kulinseth
I just opened a new issue, but was any though given to upgrading the C standard to C17 as well? |
uhh, do we have that much C code? lol |
We are planning to migrate the PyTorch codebase to C++17, but are currently blocked by CUDA. This issue summarizes a discussion I had with @malfet and @ngimel about this.
CUDA 11 is the first CUDA version to support C++17, so we'd have to drop support for CUDA 10, but there are good reasons for us to keep CUDA 10 around still:
These issues would have to be fixed in CUDA, there's not much we can do on the PyTorch side.
Workarounds we considered (kudos to @malfet and @ngimel for the ideas)
-std=c++14 '-Xcompiler -std=c++17'
-fvisibility=hidden
, so it could cause symbol conflicts with other libraries.cuda_kernel<<<1, 1>>>(arguments, argument, argument)
cuLaunchKernel
, then we would be able to avoid nvcc for code outside of CUDA kernels, which would allow us to write it in C++17. However, this would be a nontrivial engineering effort. Also, kineto isn't correctly recording kernels launched withcuLaunchKernel
, so that would have to be fixed too.cc @malfet @seemethere @walterddr @ngimel
The text was updated successfully, but these errors were encountered: