New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Commit breaks CUDA9 builds [NativeFunctions: support backend-specific dispatch] #3807

Closed
VladislavZavadskyy opened this Issue Nov 21, 2017 · 19 comments

Comments

@VladislavZavadskyy
Contributor

VladislavZavadskyy commented Nov 21, 2017

This one: 445cc1f
Here's the log. And the setup is: Ubuntu 17.10, gcc 6.4.0, cuda 9.0 and cudnn 7.0.3.
Hard resetting to previous commit makes build go smooth, so I'm pretty sure it's the one.

@sdmonov

This comment has been minimized.

Contributor

sdmonov commented Nov 21, 2017

Can you try running setup.py again after you see this error. I see same error and when retry the setup.py for second time it works for me. I guess this is just temporary workaround until the problem is fixed.

@soumith soumith changed the title from This commit breaks build on cuda 9, find out how... to Commit breaks CUDA9 builds [NativeFunctions: support backend-specific dispatch] Nov 21, 2017

@vadimkantorov

This comment has been minimized.

vadimkantorov commented Nov 22, 2017

Current version has a subtle bug that makes build from sources pick up headers of a previous installed version: #3669

You may try removing completely the existing install, and start over (if haven't tried yet).

@peterjc123

This comment has been minimized.

Contributor

peterjc123 commented Nov 22, 2017

A temporary fix for that would be calling the gen.py before the calling setup.py. That is:

mkdir torch\lib\build\ATen\src\ATen
cd torch\lib\build\ATen\src\ATen
mkdir ATen
python  ../../../../../../aten/src/ATen/gen.py -s  ../../../../../../aten/src/ATen  ../../../../../../aten/src/ATen/Declarations.cwrap  ../../../../../../aten/src/THNN/generic/THNN.h  ../../../../../../aten/src/THCUNN/generic/THCUNN.h  ../../../../../../aten/src/ATen/nn.yaml  ../../../../../../aten/src/ATen/native/native_functions.yaml
cd ../../../../../..

Also a PR for that is #3757, which is not so clean but works.

@napsternxg

This comment has been minimized.

napsternxg commented Nov 22, 2017

I am facing similar issue. Reported here: #3822

Following @peterjc123 suggested steps, I am still getting the same error.

mkdir -p torch/lib/build/ATen/src/ATen/ATen
cd torch/lib/build/ATen/src/ATen
python  ../../../../../../aten/src/ATen/gen.py -s  ../../../../../../aten/src/ATen  ../../../../../../aten/src/ATen/Declarations.cwrap  ../../../../../../aten/src/THNN/generic/THNN.h  ../../../../../../aten/src/THCUNN/generic/THCUNN.h  ../../../../../../aten/src/ATen/nn.yaml  ../../../../../../aten/src/ATen/native/native_functions.yaml
cd ../../../../../..
python setup.py install

@vadimkantorov @soumith any suggested workarounds ?

@vadimkantorov

This comment has been minimized.

vadimkantorov commented Nov 22, 2017

@peterjc123 #3757 only adds a hack for Windows, right?

If you're also referring to the problem of using obsolete includes, for a proper fix, I think, one should try to figure out how to place pytorch headers before system headers. At least that is the problem in #3669, confirmed by preprocessor outputs with -E flag passed to NVCC

@HenryJia

This comment has been minimized.

HenryJia commented Jan 12, 2018

Any updates on this bug? I still can't seem to find a workaround?

@mikael10j

This comment has been minimized.

mikael10j commented Jan 17, 2018

Hi,
it seems to be related to gcc 6. I had the same issue with Ubuntu 17.04, gcc 6.3 and CUDA 9.0.
I could to compile with gcc 5.4.1, but without gloo (NO_Distributed=1), otherwise it fails.

@HenryJia

This comment has been minimized.

HenryJia commented Jan 17, 2018

Compiling with gcc 5.4.0-1 leads to "error: unrecognized command line option ‘-fno-plt’" for me
I'm on Arch Linux

@soumith

This comment has been minimized.

Member

soumith commented Jan 18, 2018

@HenryJia

This comment has been minimized.

HenryJia commented Jan 18, 2018

@soumith Well, simply doing gcc-5 -fno-plt is enough to raise that error, I'm not sure how I'm supposed to get around this easily?

@soumith

This comment has been minimized.

Member

soumith commented Jan 18, 2018

@HenryJia from reading that reddit / archlinux thread it is like this:

  • gcc-5 doesn't have fno-plt.
  • archlinux sets it's CFLAGS to have fno-plt because it assumes default compiler is gcc 6+
@HenryJia

This comment has been minimized.

HenryJia commented Jan 18, 2018

@soumith Hmmm, OK, but how do I stop it from doing that? my $CFLAGS and $CXXFLAGS are empty?

@soumith

This comment has been minimized.

Member

soumith commented Jan 18, 2018

i am not entirely sure, I dont have archlinux. Try asking on some of the Arch forums, they must've figured it out by now.

@HenryJia

This comment has been minimized.

HenryJia commented Jan 18, 2018

Alright, will do

@ppwwyyxx

This comment has been minimized.

Contributor

ppwwyyxx commented Jan 20, 2018

ArchLinux w/ gcc5, able to build successfully by the following change:

diff --git i/setup.py w/setup.py
index e484692f..75e1836e 100644
--- i/setup.py
+++ w/setup.py
@@ -90,7 +90,7 @@ import distutils.sysconfig
 cfg_vars = distutils.sysconfig.get_config_vars()
 for key, value in cfg_vars.items():
     if type(value) == str:
-        cfg_vars[key] = value.replace("-Wstrict-prototypes", "")
+        cfg_vars[key] = value.replace("-Wstrict-prototypes", "").replace("-fno-plt", "")

 ################################################################################
 # Custom build commands
@HenryJia

This comment has been minimized.

HenryJia commented Jan 21, 2018

adding CFLAGS="${CFLAGS/-fno-plt/}" CXXFLAGS="${CXXFLAGS/-fno-plt/}" to the environment variable of the python setup.py command also seems to do the trick. I think it essentially does the same thing

@wranai

This comment has been minimized.

wranai commented Mar 10, 2018

It works with gcc-5 from the Ubuntu 18.04 repositories:

  • cuda-9-1 (9.1.85-1)
  • g{cc,++}-5 (5.5.0-8ubuntu1)
  • 248c933
@nmilosev

This comment has been minimized.

nmilosev commented Jun 9, 2018

Got similar error on Fedora 28.

Got it to build though:

nvidia-smi doesn't report a process running but I see the speedup compared to CPU and also memory usage in nvidia-smi. This is probably optimus related.

Another issue is that if I try to use prebuilt pytorch binaries I run into this: https://discuss.pytorch.org/t/runtimeerror-cudnn-status-mapping-error/4370

Thanks for the useful information in this thread! <3

@SsnL

This comment has been minimized.

Contributor

SsnL commented Jun 26, 2018

This is an NVIDIA NVCC bug, which is fixed in CUDA 9.2.

closed via #8863 which adds a clear error message

@SsnL SsnL closed this Jun 26, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment