Skip to content

test_jit_cuda_archflags fails if current GPU is newer than nvcc #51950

@Flamefire

Description

@Flamefire

🐛 Bug

The test_jit_cuda_archflags tests JIT compiling some CUDA code and uses the compute capability of the currently used GPU.

This test fails if the used GPU is newer than the nvcc used to compile PyTorch / running the tests.

Example: We have a HPC cluster with CUDA 10.1 and A100 GPUs, which support compute capability 8.0, while nvcc supports only 7.5.
The problem is, that the test ignores TORCH_CUDA_ARCH_LIST and blindly overwrites it.

Note: Upgrading the compiler is not an option here. The software installed on HPC clusters is usually grouped by toolchain generations. Upgrades to that happen, but rarely and involves installing a full new tree of software build with the newer toolchain and is therefor quite expensive.

To Reproduce

Steps to reproduce the behavior:

  1. Use CUDA 10.1 nvcc and a machine with A100 GPUs
  2. Build Pytorch as per instructions setting TORCH_CUDA_ARCH_LIST=7.5+PTX
  3. Run test_jit_cuda_archflags
ERROR: test_jit_cuda_archflags (__main__.TestCppExtensionJIT)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1539, in _run_ninja_build
    env=env)
  File "/sw/installed/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/subprocess.py", line 487, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "test_cpp_extensions_jit.py", line 198, in test_jit_cuda_archflags
    self._run_jit_cuda_archflags(flags, expected)
  File "test_cpp_extensions_jit.py", line 152, in _run_jit_cuda_archflags
    build_directory=temp_dir,
  File "/tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 997, in load
    keep_intermediates=keep_intermediates)
  File "/tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1202, in _jit_compile
    with_cuda=with_cuda)
  File "/tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1300, in _write_ninja_file_and_build_library
    error_prefix="Error building extension '{}'".format(name))
  File "/tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1555, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'cudaext_archflags'
test_jit_cuda_archflags (__main__.TestCppExtensionJIT) ... /tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/cuda/__init__.py:104: UserWarning: 
A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_61 sm_70 compute_70.
If you want to use the A100-SXM4-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
Detected CUDA files, patching ldflags
Emitting ninja build file /tmp/easybuild-tmp/eb-rbMKJa/tmpuextljdy/build.ninja...
Building extension module cudaext_archflags...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /sw/installed/CUDA/10.1.243/bin/nvcc -ccbin gcc -DTORCH_EXTENSION_NAME=cudaext_archflags -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -isystem /tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/include -isystem /tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/include/TH -isystem /tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/include/THC -isystem /sw/installed/CUDA/10.1.243/include -isystem /sw/installed/Python/3.7.4-GCCcore-8.3.0/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O2 -std=c++14 -c /tmp/s3248973-EasyBuild/PyTorch/1.7.1/fosscuda-2019b-Python-3.7.4/pytorch-1.7.1/test/cpp_extensions/cuda_extension.cu -o cuda_extension.cuda.o 
FAILED: cuda_extension.cuda.o 
/sw/installed/CUDA/10.1.243/bin/nvcc -ccbin gcc -DTORCH_EXTENSION_NAME=cudaext_archflags -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -isystem /tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/include -isystem /tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/include/TH -isystem /tmp/easybuild-tmp/eb-rbMKJa/tmpSqyw7V/lib/python3.7/site-packages/torch/include/THC -isystem /sw/installed/CUDA/10.1.243/include -isystem /sw/installed/Python/3.7.4-GCCcore-8.3.0/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O2 -std=c++14 -c /tmp/s3248973-EasyBuild/PyTorch/1.7.1/fosscuda-2019b-Python-3.7.4/pytorch-1.7.1/test/cpp_extensions/cuda_extension.cu -o cuda_extension.cuda.o 
nvcc fatal   : Unsupported gpu architecture 'compute_80'

Expected behavior

Tests succeed

Environment

  • PyTorch Version (e.g., 1.0): 1.7.1
  • How you installed PyTorch (conda, pip, source): source
  • CUDA/cuDNN version: 10.1
  • GPU models and configuration: A100

Additional context

I'd suggest to either check for "Unsupported gpu architecture" errors in the test and ignore that failure or limit the test to subsets of the specified TORCH_CUDA_ARCH_LIST chosen by the user

cc @gmagogsfm

Metadata

Metadata

Assignees

No one assigned

    Labels

    oncall: jitAdd this issue/PR to JIT oncall triage queue

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions