Build failure with "TORCH_CUDA_API" is undefined and more #52331

edrozenberg · 2021-02-16T23:51:24Z

🐛 Bug

Failing to build from source. Have built successfully some months ago (pytorch-20200514_bbfd0ef), but failing to build now. For the earlier successful build the OS packages were older, gcc was older, nvidia stack was older, pytorch was older.

To Reproduce

Steps to reproduce the behavior:

git clone the source
git submodule sync
git submodule update --init --recursive
Set env vars
python3 setup.py install --root=/usr/local/src/pytorch/pkg/new

Build issue appears to start at this section of the build output:

[4860/5986] Building NVCC (Device) object ...src/THC/torch_cuda_generated_THCSleep.cu.
FAILED: caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/THC/torch_cuda_generated_THCSleep.cu.o 
cd /usr/local/src/pytorch/src/pytorch-git/build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/THC && /usr/bin/cmake -E make_directory /usr/local/src/pytorch/src/pytorch-git/build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/THC/. && /usr/bin/cmake -D verbose:BOOL=OFF -D build_configuration:STRING=Release -D generated_file:STRING=/usr/local/src/pytorch/src/pytorch-git/build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/THC/./torch_cuda_generated_THCSleep.cu.o -D generated_cubin_file:STRING=/usr/local/src/pytorch/src/pytorch-git/build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/THC/./torch_cuda_generated_THCSleep.cu.o.cubin.txt -P /usr/local/src/pytorch/src/pytorch-git/build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/THC/torch_cuda_generated_THCSleep.cu.o.Release.cmake
/usr/local/src/pytorch/src/pytorch-git/torch/include/THC/THCGeneral.h(39): error: identifier "TORCH_CUDA_API" is undefined

/usr/local/src/pytorch/src/pytorch-git/torch/include/THC/THCGeneral.h(39): error: "THCState" has already been declared in the current scope

/usr/local/src/pytorch/src/pytorch-git/torch/include/THC/THCGeneral.h(39): error: expected a ";"

Complete build output messages:
https://gist.githubusercontent.com/edrozenberg/6e2a25c76d7c62533204974bd4499a47/raw/4c7d3875063f186a8044afc15958c723e3f87732/pytorch%2520build%2520log%25202021-02-16.txt

Expected behavior

Successfull build to the target dir

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

PyTorch Version (e.g., 1.0): 2020-02-16 52af23b
OS (e.g., Linux): Slackware Linux 64 -current (pre-15.0)
How you installed PyTorch (conda, pip, source): source
Build command you used (if compiling from source): python3 setup.py install --root=/usr/local/src/pytorch/pkg/new
Python version: 3.9.1
CUDA/cuDNN version: cuda-11.2.1 / cudnn-8.1.0.77_11.2
GPU models and configuration: TITAN X (Pascal) (12GB), GeForce GT 630 (2GB)
Any other relevant information: magma-2.5.4, nvidia-driver-460.39, nvidia-nccl-2.8.4.1_11.2

Additional context

Using the following build approach:

#!/usr/bin/bash

export TORCH_CUDA_ARCH_LIST="6.1;7.0;7.5;8.0;8.6"
export NCCL_INCLUDE_DIR="/opt/nvidia/nccl/include"
export NCCL_ROOT_DIR="/opt/nvidia/nccl"
export USE_SYSTEM_NCCL=1

cd pytorch-git

python3 setup.py install --root=/usr/local/src/pytorch/pkg/new

Built and installed magma from source

Linux 5.10.15 #1 SMP Wed Feb 10 14:06:55 CST 2021 x86_64 
Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz GenuineIntel GNU/Linux

magma-2.5.4

gcc-10.2.0
gcc-brig-10.2.0
gcc-g++-10.2.0
gcc-gdc-10.2.0
gcc-gfortran-10.2.0
gcc-gnat-10.2.0
gcc-go-10.2.0
gcc-objc-10.2.0
gccmakedep-1.0.3

automake-1.16.2
cmake-3.19.4
gccmakedep-1.0.3
imake-1.0.8
make-4.3
makedepend-1.0.6
pmake-1.111

nvidia-cuda-11.2.1
nvidia-cudnn-8.1.0.77_11.2
nvidia-driver-460.39
nvidia-kernel-460.39_5.10.15
nvidia-ml-py3-7.352.0
nvidia-nccl-2.8.4.1_11.2
nvidia-tensorrt-7.2.2.3_11.1

cc @malfet @seemethere @walterddr @ngimel

The text was updated successfully, but these errors were encountered:

edrozenberg · 2021-02-16T23:56:48Z

Have looked at existing issues, forums posts etc. Only clue is that it might be related to NCCL but don't have any strong idea. Also tried with pytorch-1.7.1 and pytorch-1.8.0-rc2, same issues I believe.

pascal-soveaux · 2021-02-17T15:51:25Z

Hi,

Maybe related. I ran once into this issue when I had a different pytorch version in my PATH

PATH=/third_party/libtorch/include 1.XXX <= From my install dir
pytorch1.YYY/build$ make <= Current compile

-DCMAKE_INSTALL_PREFIX=/usr/local/src/pytorch/src/pytorch-git/torch
/usr/local/src/pytorch/src/pytorch-git/torch is your install dir.
/usr/local/src/pytorch/src/pytorch-git/torch/include/THC/THCGeneral.h(39): error: identifier "TORCH_CUDA_API" is undefined

make clean first or remove prev install.

Pascal

walterddr · 2021-02-17T20:42:30Z

Looks like #49050 has changed TORCH_CUDA_API and split it into 2. could you follow the instruction here to do a clean build?

edrozenberg · 2021-02-17T22:13:12Z

@walterddr thanks, same issue after a clean build with latest git pull. I've been anyway always doing make clean before the builds (which deletes the build dir). Also tried with export USE_SYSTEM_NCCL=0 just to see what happens, same problem.

Currently unable to build any version of torch (the 1.7.1 build fails for a different reason related to an fbgemm issue filed by someone else already). Wonder what magic spells people are casting to get this project to build.

Only success so far was with export USE_CUDA=0 but that's not useful.

edrozenberg · 2021-02-17T22:24:17Z

My other option for now is to use the wheel https://files.pythonhosted.org/packages/d6/c1/70f2fd464a895844a9bf4cf1d93b09eb6cd5edf8274d19a7fed2ed6c4cc3/torch-1.7.1-cp39-cp39-win_amd64.whl

But that comes with its own set of issues:

Built with Intel MKL - packaging Intel's messy MKL is its own project, and MKL doesn't work on AMD without ugly hacks that will eventually stop working. Prefer OpenBLAS even if perf is not quite as good.
CUDA arch stops at 75 which excludes the latest devices
Built with old GCC 7
Built for old CUDA 10 and old CuDNN 7 (but I guess still works, still compatible)

edrozenberg · 2021-02-19T23:26:15Z

Thanks to @walterddr for mentioning TORCH_CUDA_API - that put me on the path to figuring out what was going on, after another few days and a few dozen failed build attempts.

The issue was the git checkout was inconsistent/not in a properly updated state, despite my doing make clean; git reset --hard; git pull; git submodule update --init --recursive

I trusted git, or my knowledge of git. That was a mistake. Doing a brand new clean checkout should have been step 1, not step 1057, on my way to troubleshoothing this. Lesson learned.

The inconsistency was the following -

Old git checkout (after reset, pull, clean, submodule update etc.):

# grep -r TORCH_CUDA_API pytorch-git.old
pytorch-git.old/torch/include/THC/THCGeneral.h:#define THC_API TORCH_CUDA_API
pytorch-git.old/torch/include/THC/THCGeneral.h:#define THC_CLASS TORCH_CUDA_API
pytorch-git.old/torch/include/THC/THCAllocator.h:class TORCH_CUDA_API THCIpcDeleter {
pytorch-git.old/torch/include/torch/csrc/jit/codegen/cuda/tensor_meta.h:#include <torch/csrc/WindowsTorchApiMacro.h> // TORCH_CUDA_API
pytorch-git.old/torch/include/torch/csrc/jit/codegen/cuda/tensor_meta.h:struct TORCH_CUDA_API TensorContiguity {
pytorch-git.old/torch/include/c10/macros/Export.h:// HIPify should translate TORCH_CUDA_API to TORCH_HIP_API
pytorch-git.old/c10/macros/Export.h:// HIPify should translate TORCH_CUDA_API to TORCH_HIP_API

New fresh git checkout:

# grep -r TORCH_CUDA_API pytorch-git
pytorch-git/c10/macros/Export.h:// HIPify should translate TORCH_CUDA_API to TORCH_HIP_API

The old checkout still had some code referencing TORCH_CUDA_API, causing the build failure.

For reference I used the following build script currently for a successful build of pytorch 2021-02-19 941ebec:

#!/usr/bin/bash

# NOTES
# o We need to disable the fbgemm build for versions 1.7.1 and 1.8 which fail to 
#   build with it enabled - seems only the torch git checkout source can build
#   with the current fbgemm source.

# Set source version
VERSION=${VERSION:-git}

# Define build params
export BLAS='OpenBLAS'
#export USE_FBGEMM=0
#export USE_CUDA=0
export BUILD_SPLIT_CUDA=1
export TORCH_CUDA_ARCH_LIST='6.1;7.0;7.5;8.0;8.6'
export USE_SYSTEM_NCCL=1
export NCCL_ROOT_DIR='/opt/nvidia/nccl'
export NCCL_INCLUDE_DIR='/opt/nvidia/nccl/include'

# Go to the source
cd pytorch-${VERSION}

# Clean the source
make clean

# Update source if git
if [ $VERSION == "git" ];
then
  git pull
  git submodule sync
  git submodule update --init --recursive
fi

# Build the source
python3 setup.py install --root=/usr/local/src/pytorch/pkg/new

Only remaining wish is that it would be great if the thousands of deprecation warnings and out of bounds warnings could be fixed, would make errors easier to find in the output, and make it much easier to diff and scroll the output from multiple build attempts. Hopefully the pytorch project will get around to those cleanups.

bryan-lunt · 2021-03-25T17:00:39Z

I'm having this same issue trying to build v1.8.0 .

edrozenberg · 2021-03-25T17:17:24Z

I'm having this same issue trying to build v1.8.0 .

@bryan-lunt yes I've had 0 luck building 1.8.0 or any older release. Only thing that fully builds is a git checkout of latest dev source. My 1.8.0 failure is related to fbgemm and there is no solution currently that I could find (other than disabling fbgemm with USE_FBGEMM=0 when building 1.8.0).

bryan-lunt · 2021-03-25T17:19:52Z

Even when I disable fbgemm I still get the TORCH_CUDA_API undefined error.

edrozenberg · 2021-03-25T17:23:02Z

Even when I disable fbgemm I still get the TORCH_CUDA_API undefined error.

Would suggest abandoning your attempt to build 1.8.0 unless a torch developer tells you how to do it, I certainly have no clue. Build git main instead, fresh checkout, with build params including BUILD_SPLIT_CUDA=1

bryan-lunt · 2021-03-25T18:18:46Z

I checked out the master branch and I'm still getting

pytorch/torch/include/THC/THCGeneral.h(41): error: identifier "TORCH_CUDA_API" is undefined

:(

bryan-lunt · 2021-03-25T19:05:06Z

The problem seems to be in cuda generated code. I wonder if it has something to do with using too old a version of the cuda toolkit.

edrozenberg · 2021-03-25T20:57:59Z

Could be old cuda. Build worked for me earlier today using my build script above with git checkout 2021-03-25 911b8b1. Make sure nvidia-smi works and run some basic cuda samples that come with cuda (for ex samples/0_Simple/matrixMulCUBLAS) to make sure the cuda stack works. No other ideas sorry.

nvidia-cuda-11.2.2
nvidia-cudnn-8.1.1.33_11.2
nvidia-driver-460.67
nvidia-kernel-460.67_5.10.25
nvidia-ml-py3-7.352.0
nvidia-nccl-2.8.4.1_11.2
nvidia-tensorrt-7.2.3.4_11.1

ejguan added module: build Build system issues triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Feb 17, 2021

malfet added the module: cuda Related to torch.cuda, and CUDA support in general label Feb 17, 2021

edrozenberg closed this as completed Feb 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build failure with "TORCH_CUDA_API" is undefined and more #52331

Build failure with "TORCH_CUDA_API" is undefined and more #52331

edrozenberg commented Feb 16, 2021 •

edited by pytorch-probot bot

edrozenberg commented Feb 16, 2021

pascal-soveaux commented Feb 17, 2021 •

edited

walterddr commented Feb 17, 2021

edrozenberg commented Feb 17, 2021

edrozenberg commented Feb 17, 2021

edrozenberg commented Feb 19, 2021 •

edited

bryan-lunt commented Mar 25, 2021

edrozenberg commented Mar 25, 2021 •

edited

bryan-lunt commented Mar 25, 2021

edrozenberg commented Mar 25, 2021

bryan-lunt commented Mar 25, 2021

bryan-lunt commented Mar 25, 2021

edrozenberg commented Mar 25, 2021

Build failure with "TORCH_CUDA_API" is undefined and more #52331

Build failure with "TORCH_CUDA_API" is undefined and more #52331

Comments

edrozenberg commented Feb 16, 2021 • edited by pytorch-probot bot

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

edrozenberg commented Feb 16, 2021

pascal-soveaux commented Feb 17, 2021 • edited

walterddr commented Feb 17, 2021

edrozenberg commented Feb 17, 2021

edrozenberg commented Feb 17, 2021

edrozenberg commented Feb 19, 2021 • edited

bryan-lunt commented Mar 25, 2021

edrozenberg commented Mar 25, 2021 • edited

bryan-lunt commented Mar 25, 2021

edrozenberg commented Mar 25, 2021

bryan-lunt commented Mar 25, 2021

bryan-lunt commented Mar 25, 2021

edrozenberg commented Mar 25, 2021

edrozenberg commented Feb 16, 2021 •

edited by pytorch-probot bot

pascal-soveaux commented Feb 17, 2021 •

edited

edrozenberg commented Feb 19, 2021 •

edited

edrozenberg commented Mar 25, 2021 •

edited