-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
🐛 Bug
The test_jit_cuda_archflags
tests JIT compiling some CUDA code and uses the compute capability of the currently used GPU.
This test fails if the used GPU is newer than the nvcc used to compile PyTorch / running the tests.
Example: We have a HPC cluster with CUDA 10.1 and A100 GPUs, which support compute capability 8.0, while nvcc supports only 7.5.
The problem is, that the test ignores TORCH_CUDA_ARCH_LIST and blindly overwrites it.
Note: Upgrading the compiler is not an option here. The software installed on HPC clusters is usually grouped by toolchain generations. Upgrades to that happen, but rarely and involves installing a full new tree of software build with the newer toolchain and is therefor quite expensive.
To Reproduce
Steps to reproduce the behavior:
- Use CUDA 10.1 nvcc and a machine with A100 GPUs
- Build Pytorch as per instructions setting
TORCH_CUDA_ARCH_LIST=7.5+PTX
- Run test_jit_cuda_archflags
ERROR: test_jit_cuda_archflags (__main__.TestCppExtensionJIT)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1539, in _run_ninja_build
env=env)
File "/sw/installed/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/subprocess.py", line 487, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "test_cpp_extensions_jit.py", line 198, in test_jit_cuda_archflags
self._run_jit_cuda_archflags(flags, expected)
File "test_cpp_extensions_jit.py", line 152, in _run_jit_cuda_archflags
build_directory=temp_dir,
File "/tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 997, in load
keep_intermediates=keep_intermediates)
File "/tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1202, in _jit_compile
with_cuda=with_cuda)
File "/tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1300, in _write_ninja_file_and_build_library
error_prefix="Error building extension '{}'".format(name))
File "/tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1555, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'cudaext_archflags'
test_jit_cuda_archflags (__main__.TestCppExtensionJIT) ... /tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/cuda/__init__.py:104: UserWarning:
A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_61 sm_70 compute_70.
If you want to use the A100-SXM4-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
Detected CUDA files, patching ldflags
Emitting ninja build file /tmp/easybuild-tmp/eb-rbMKJa/tmpuextljdy/build.ninja...
Building extension module cudaext_archflags...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /sw/installed/CUDA/10.1.243/bin/nvcc -ccbin gcc -DTORCH_EXTENSION_NAME=cudaext_archflags -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -isystem /tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/include -isystem /tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/include/TH -isystem /tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/include/THC -isystem /sw/installed/CUDA/10.1.243/include -isystem /sw/installed/Python/3.7.4-GCCcore-8.3.0/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O2 -std=c++14 -c /tmp/s3248973-EasyBuild/PyTorch/1.7.1/fosscuda-2019b-Python-3.7.4/pytorch-1.7.1/test/cpp_extensions/cuda_extension.cu -o cuda_extension.cuda.o
FAILED: cuda_extension.cuda.o
/sw/installed/CUDA/10.1.243/bin/nvcc -ccbin gcc -DTORCH_EXTENSION_NAME=cudaext_archflags -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -isystem /tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/include -isystem /tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/include/TH -isystem /tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/include/THC -isystem /sw/installed/CUDA/10.1.243/include -isystem /sw/installed/Python/3.7.4-GCCcore-8.3.0/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O2 -std=c++14 -c /tmp/s3248973-EasyBuild/PyTorch/1.7.1/fosscuda-2019b-Python-3.7.4/pytorch-1.7.1/test/cpp_extensions/cuda_extension.cu -o cuda_extension.cuda.o
nvcc fatal : Unsupported gpu architecture 'compute_80'
Expected behavior
Tests succeed
Environment
- PyTorch Version (e.g., 1.0): 1.7.1
- How you installed PyTorch (
conda
,pip
, source): source - CUDA/cuDNN version: 10.1
- GPU models and configuration: A100
Additional context
I'd suggest to either check for "Unsupported gpu architecture" errors in the test and ignore that failure or limit the test to subsets of the specified TORCH_CUDA_ARCH_LIST chosen by the user
cc @gmagogsfm
Metadata
Metadata
Assignees
Labels
Type
Projects
Status