test_jit_cuda_archflags fails if current GPU is newer than nvcc

## 🐛 Bug

The `test_jit_cuda_archflags` tests JIT compiling some CUDA code and uses the compute capability of the currently used GPU.

This test fails if the used GPU is newer than the nvcc used to compile PyTorch / running the tests.

Example: We have a HPC cluster with CUDA 10.1 and A100 GPUs, which support compute capability 8.0, while nvcc supports only 7.5.
The problem is, that the test ignores TORCH_CUDA_ARCH_LIST and blindly overwrites it.

Note: Upgrading the compiler is not an option here. The software installed on HPC clusters is usually grouped by toolchain generations. Upgrades to that happen, but rarely and involves installing a full new tree of software build with the newer toolchain and is therefor quite expensive.

## To Reproduce

Steps to reproduce the behavior:

1. Use CUDA 10.1 nvcc and a machine with A100 GPUs
1. Build Pytorch as per instructions setting `TORCH_CUDA_ARCH_LIST=7.5+PTX`
1. Run test_jit_cuda_archflags

```
ERROR: test_jit_cuda_archflags (__main__.TestCppExtensionJIT)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1539, in _run_ninja_build
    env=env)
  File "/sw/installed/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/subprocess.py", line 487, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "test_cpp_extensions_jit.py", line 198, in test_jit_cuda_archflags
    self._run_jit_cuda_archflags(flags, expected)
  File "test_cpp_extensions_jit.py", line 152, in _run_jit_cuda_archflags
    build_directory=temp_dir,
  File "/tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 997, in load
    keep_intermediates=keep_intermediates)
  File "/tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1202, in _jit_compile
    with_cuda=with_cuda)
  File "/tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1300, in _write_ninja_file_and_build_library
    error_prefix="Error building extension '{}'".format(name))
  File "/tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1555, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'cudaext_archflags'
```

```
test_jit_cuda_archflags (__main__.TestCppExtensionJIT) ... /tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/cuda/__init__.py:104: UserWarning: 
A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_61 sm_70 compute_70.
If you want to use the A100-SXM4-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
Detected CUDA files, patching ldflags
Emitting ninja build file /tmp/easybuild-tmp/eb-rbMKJa/tmpuextljdy/build.ninja...
Building extension module cudaext_archflags...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /sw/installed/CUDA/10.1.243/bin/nvcc -ccbin gcc -DTORCH_EXTENSION_NAME=cudaext_archflags -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -isystem /tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/include -isystem /tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/include/TH -isystem /tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/include/THC -isystem /sw/installed/CUDA/10.1.243/include -isystem /sw/installed/Python/3.7.4-GCCcore-8.3.0/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O2 -std=c++14 -c /tmp/s3248973-EasyBuild/PyTorch/1.7.1/fosscuda-2019b-Python-3.7.4/pytorch-1.7.1/test/cpp_extensions/cuda_extension.cu -o cuda_extension.cuda.o 
FAILED: cuda_extension.cuda.o 
/sw/installed/CUDA/10.1.243/bin/nvcc -ccbin gcc -DTORCH_EXTENSION_NAME=cudaext_archflags -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -isystem /tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/include -isystem /tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/include/TH -isystem /tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/include/THC -isystem /sw/installed/CUDA/10.1.243/include -isystem /sw/installed/Python/3.7.4-GCCcore-8.3.0/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O2 -std=c++14 -c /tmp/s3248973-EasyBuild/PyTorch/1.7.1/fosscuda-2019b-Python-3.7.4/pytorch-1.7.1/test/cpp_extensions/cuda_extension.cu -o cuda_extension.cuda.o 
nvcc fatal   : Unsupported gpu architecture 'compute_80'
```

## Expected behavior

Tests succeed

## Environment

 - PyTorch Version (e.g., 1.0): 1.7.1
 - How you installed PyTorch (`conda`, `pip`, source): source
 - CUDA/cuDNN version: 10.1
 - GPU models and configuration: A100

## Additional context

I'd suggest to either check for "Unsupported gpu architecture" errors in the test and ignore that failure or limit the test to subsets of the specified TORCH_CUDA_ARCH_LIST chosen by the user


cc @gmagogsfm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test_jit_cuda_archflags fails if current GPU is newer than nvcc #51950

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

test_jit_cuda_archflags fails if current GPU is newer than nvcc #51950

Description

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions