Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build failure with "TORCH_CUDA_API" is undefined and more #52331

Closed
edrozenberg opened this issue Feb 16, 2021 · 13 comments
Closed

Build failure with "TORCH_CUDA_API" is undefined and more #52331

edrozenberg opened this issue Feb 16, 2021 · 13 comments
Labels
module: build Build system issues module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@edrozenberg
Copy link

edrozenberg commented Feb 16, 2021

馃悰 Bug

Failing to build from source. Have built successfully some months ago (pytorch-20200514_bbfd0ef), but failing to build now. For the earlier successful build the OS packages were older, gcc was older, nvidia stack was older, pytorch was older.

To Reproduce

Steps to reproduce the behavior:

  1. git clone the source
  2. git submodule sync
  3. git submodule update --init --recursive
  4. Set env vars
  5. python3 setup.py install --root=/usr/local/src/pytorch/pkg/new

Build issue appears to start at this section of the build output:

[4860/5986] Building NVCC (Device) object ...src/THC/torch_cuda_generated_THCSleep.cu.
FAILED: caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/THC/torch_cuda_generated_THCSleep.cu.o 
cd /usr/local/src/pytorch/src/pytorch-git/build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/THC && /usr/bin/cmake -E make_directory /usr/local/src/pytorch/src/pytorch-git/build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/THC/. && /usr/bin/cmake -D verbose:BOOL=OFF -D build_configuration:STRING=Release -D generated_file:STRING=/usr/local/src/pytorch/src/pytorch-git/build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/THC/./torch_cuda_generated_THCSleep.cu.o -D generated_cubin_file:STRING=/usr/local/src/pytorch/src/pytorch-git/build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/THC/./torch_cuda_generated_THCSleep.cu.o.cubin.txt -P /usr/local/src/pytorch/src/pytorch-git/build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/THC/torch_cuda_generated_THCSleep.cu.o.Release.cmake
/usr/local/src/pytorch/src/pytorch-git/torch/include/THC/THCGeneral.h(39): error: identifier "TORCH_CUDA_API" is undefined

/usr/local/src/pytorch/src/pytorch-git/torch/include/THC/THCGeneral.h(39): error: "THCState" has already been declared in the current scope

/usr/local/src/pytorch/src/pytorch-git/torch/include/THC/THCGeneral.h(39): error: expected a ";"

Complete build output messages:
https://gist.githubusercontent.com/edrozenberg/6e2a25c76d7c62533204974bd4499a47/raw/4c7d3875063f186a8044afc15958c723e3f87732/pytorch%2520build%2520log%25202021-02-16.txt

Expected behavior

Successfull build to the target dir

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

  • PyTorch Version (e.g., 1.0): 2020-02-16 52af23b
  • OS (e.g., Linux): Slackware Linux 64 -current (pre-15.0)
  • How you installed PyTorch (conda, pip, source): source
  • Build command you used (if compiling from source): python3 setup.py install --root=/usr/local/src/pytorch/pkg/new
  • Python version: 3.9.1
  • CUDA/cuDNN version: cuda-11.2.1 / cudnn-8.1.0.77_11.2
  • GPU models and configuration: TITAN X (Pascal) (12GB), GeForce GT 630 (2GB)
  • Any other relevant information: magma-2.5.4, nvidia-driver-460.39, nvidia-nccl-2.8.4.1_11.2

Additional context

Using the following build approach:

#!/usr/bin/bash

export TORCH_CUDA_ARCH_LIST="6.1;7.0;7.5;8.0;8.6"
export NCCL_INCLUDE_DIR="/opt/nvidia/nccl/include"
export NCCL_ROOT_DIR="/opt/nvidia/nccl"
export USE_SYSTEM_NCCL=1

cd pytorch-git

python3 setup.py install --root=/usr/local/src/pytorch/pkg/new

Built and installed magma from source

Linux 5.10.15 #1 SMP Wed Feb 10 14:06:55 CST 2021 x86_64 
Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz GenuineIntel GNU/Linux

magma-2.5.4

gcc-10.2.0
gcc-brig-10.2.0
gcc-g++-10.2.0
gcc-gdc-10.2.0
gcc-gfortran-10.2.0
gcc-gnat-10.2.0
gcc-go-10.2.0
gcc-objc-10.2.0
gccmakedep-1.0.3

automake-1.16.2
cmake-3.19.4
gccmakedep-1.0.3
imake-1.0.8
make-4.3
makedepend-1.0.6
pmake-1.111

nvidia-cuda-11.2.1
nvidia-cudnn-8.1.0.77_11.2
nvidia-driver-460.39
nvidia-kernel-460.39_5.10.15
nvidia-ml-py3-7.352.0
nvidia-nccl-2.8.4.1_11.2
nvidia-tensorrt-7.2.2.3_11.1

cc @malfet @seemethere @walterddr @ngimel

@edrozenberg
Copy link
Author

Have looked at existing issues, forums posts etc. Only clue is that it might be related to NCCL but don't have any strong idea. Also tried with pytorch-1.7.1 and pytorch-1.8.0-rc2, same issues I believe.

@pascal-soveaux
Copy link

pascal-soveaux commented Feb 17, 2021

Hi,

Maybe related. I ran once into this issue when I had a different pytorch version in my PATH

PATH=/third_party/libtorch/include 1.XXX <= From my install dir
pytorch1.YYY/build$ make <= Current compile

-DCMAKE_INSTALL_PREFIX=/usr/local/src/pytorch/src/pytorch-git/torch
/usr/local/src/pytorch/src/pytorch-git/torch is your install dir.
/usr/local/src/pytorch/src/pytorch-git/torch/include/THC/THCGeneral.h(39): error: identifier "TORCH_CUDA_API" is undefined

make clean first or remove prev install.

Pascal

@ejguan ejguan added module: build Build system issues triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Feb 17, 2021
@malfet malfet added the module: cuda Related to torch.cuda, and CUDA support in general label Feb 17, 2021
@walterddr
Copy link
Contributor

Looks like #49050 has changed TORCH_CUDA_API and split it into 2. could you follow the instruction here to do a clean build?

@edrozenberg
Copy link
Author

@walterddr thanks, same issue after a clean build with latest git pull. I've been anyway always doing make clean before the builds (which deletes the build dir). Also tried with export USE_SYSTEM_NCCL=0 just to see what happens, same problem.

Currently unable to build any version of torch (the 1.7.1 build fails for a different reason related to an fbgemm issue filed by someone else already). Wonder what magic spells people are casting to get this project to build.

Only success so far was with export USE_CUDA=0 but that's not useful.

@edrozenberg
Copy link
Author

My other option for now is to use the wheel https://files.pythonhosted.org/packages/d6/c1/70f2fd464a895844a9bf4cf1d93b09eb6cd5edf8274d19a7fed2ed6c4cc3/torch-1.7.1-cp39-cp39-win_amd64.whl

But that comes with its own set of issues:

  • Built with Intel MKL - packaging Intel's messy MKL is its own project, and MKL doesn't work on AMD without ugly hacks that will eventually stop working. Prefer OpenBLAS even if perf is not quite as good.
  • CUDA arch stops at 75 which excludes the latest devices
  • Built with old GCC 7
  • Built for old CUDA 10 and old CuDNN 7 (but I guess still works, still compatible)

@edrozenberg
Copy link
Author

edrozenberg commented Feb 19, 2021

Thanks to @walterddr for mentioning TORCH_CUDA_API - that put me on the path to figuring out what was going on, after another few days and a few dozen failed build attempts.

The issue was the git checkout was inconsistent/not in a properly updated state, despite my doing make clean; git reset --hard; git pull; git submodule update --init --recursive

I trusted git, or my knowledge of git. That was a mistake. Doing a brand new clean checkout should have been step 1, not step 1057, on my way to troubleshoothing this. Lesson learned.

The inconsistency was the following -

Old git checkout (after reset, pull, clean, submodule update etc.):

# grep -r TORCH_CUDA_API pytorch-git.old
pytorch-git.old/torch/include/THC/THCGeneral.h:#define THC_API TORCH_CUDA_API
pytorch-git.old/torch/include/THC/THCGeneral.h:#define THC_CLASS TORCH_CUDA_API
pytorch-git.old/torch/include/THC/THCAllocator.h:class TORCH_CUDA_API THCIpcDeleter {
pytorch-git.old/torch/include/torch/csrc/jit/codegen/cuda/tensor_meta.h:#include <torch/csrc/WindowsTorchApiMacro.h> // TORCH_CUDA_API
pytorch-git.old/torch/include/torch/csrc/jit/codegen/cuda/tensor_meta.h:struct TORCH_CUDA_API TensorContiguity {
pytorch-git.old/torch/include/c10/macros/Export.h:// HIPify should translate TORCH_CUDA_API to TORCH_HIP_API
pytorch-git.old/c10/macros/Export.h:// HIPify should translate TORCH_CUDA_API to TORCH_HIP_API

New fresh git checkout:

# grep -r TORCH_CUDA_API pytorch-git
pytorch-git/c10/macros/Export.h:// HIPify should translate TORCH_CUDA_API to TORCH_HIP_API

The old checkout still had some code referencing TORCH_CUDA_API, causing the build failure.

For reference I used the following build script currently for a successful build of pytorch 2021-02-19 941ebec:

#!/usr/bin/bash

# NOTES
# o We need to disable the fbgemm build for versions 1.7.1 and 1.8 which fail to 
#   build with it enabled - seems only the torch git checkout source can build
#   with the current fbgemm source.

# Set source version
VERSION=${VERSION:-git}

# Define build params
export BLAS='OpenBLAS'
#export USE_FBGEMM=0
#export USE_CUDA=0
export BUILD_SPLIT_CUDA=1
export TORCH_CUDA_ARCH_LIST='6.1;7.0;7.5;8.0;8.6'
export USE_SYSTEM_NCCL=1
export NCCL_ROOT_DIR='/opt/nvidia/nccl'
export NCCL_INCLUDE_DIR='/opt/nvidia/nccl/include'

# Go to the source
cd pytorch-${VERSION}

# Clean the source
make clean

# Update source if git
if [ $VERSION == "git" ];
then
  git pull
  git submodule sync
  git submodule update --init --recursive
fi

# Build the source
python3 setup.py install --root=/usr/local/src/pytorch/pkg/new

Only remaining wish is that it would be great if the thousands of deprecation warnings and out of bounds warnings could be fixed, would make errors easier to find in the output, and make it much easier to diff and scroll the output from multiple build attempts. Hopefully the pytorch project will get around to those cleanups.

@bryan-lunt
Copy link

I'm having this same issue trying to build v1.8.0 .

@edrozenberg
Copy link
Author

edrozenberg commented Mar 25, 2021

I'm having this same issue trying to build v1.8.0 .

@bryan-lunt yes I've had 0 luck building 1.8.0 or any older release. Only thing that fully builds is a git checkout of latest dev source. My 1.8.0 failure is related to fbgemm and there is no solution currently that I could find (other than disabling fbgemm with USE_FBGEMM=0 when building 1.8.0).

@bryan-lunt
Copy link

Even when I disable fbgemm I still get the TORCH_CUDA_API undefined error.

@edrozenberg
Copy link
Author

Even when I disable fbgemm I still get the TORCH_CUDA_API undefined error.

Would suggest abandoning your attempt to build 1.8.0 unless a torch developer tells you how to do it, I certainly have no clue. Build git main instead, fresh checkout, with build params including BUILD_SPLIT_CUDA=1

@bryan-lunt
Copy link

I checked out the master branch and I'm still getting

pytorch/torch/include/THC/THCGeneral.h(41): error: identifier "TORCH_CUDA_API" is undefined

:(

@bryan-lunt
Copy link

The problem seems to be in cuda generated code. I wonder if it has something to do with using too old a version of the cuda toolkit.

@edrozenberg
Copy link
Author

Could be old cuda. Build worked for me earlier today using my build script above with git checkout 2021-03-25 911b8b1. Make sure nvidia-smi works and run some basic cuda samples that come with cuda (for ex samples/0_Simple/matrixMulCUBLAS) to make sure the cuda stack works. No other ideas sorry.

nvidia-cuda-11.2.2
nvidia-cudnn-8.1.1.33_11.2
nvidia-driver-460.67
nvidia-kernel-460.67_5.10.25
nvidia-ml-py3-7.352.0
nvidia-nccl-2.8.4.1_11.2
nvidia-tensorrt-7.2.3.4_11.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: build Build system issues module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

6 participants