Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need binary release with CUDA 8.0 support #2559

Closed
fayeshine opened this issue May 28, 2016 · 26 comments

Comments

@chroneus

This comment has been minimized.

Copy link

commented May 29, 2016

so far have this problem with build
ERROR: /tensorflow/tensorflow/stream_executor/BUILD:5:1: C++ compilation of rule '//tensorflow/stream_executor:stream_executor' failed: crosstool_wrapper_driver_is_not_gcc failed: error executing command third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE '-D_FORTIFY_SOURCE=1' -fstack-protector -fPIE -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object ... (remaining 100 argument(s) skipped): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 1. tensorflow/stream_executor/cuda/cuda_blas.cc: In member function 'virtual bool perftools::gputools::cuda::CUDABlas::DoBlasGemm(perftools::gputools::Stream*, perftools::gputools::blas::Transpose, perftools::gputools::blas::Transpose, tensorflow::uint64, tensorflow::uint64, tensorflow::uint64, float, const perftools::gputools::DeviceMemory<Eigen::half>&, int, const perftools::gputools::DeviceMemory<Eigen::half>&, int, float, perftools::gputools::DeviceMemory<Eigen::half>*, int)': tensorflow/stream_executor/cuda/cuda_blas.cc:1683:22: error: 'CUBLAS_DATA_HALF' was not declared in this scope CUDAMemory(a), CUBLAS_DATA_HALF, lda, ^

@kashif

This comment has been minimized.

Copy link
Contributor

commented May 29, 2016

#2556 fixes this issue

@peyush

This comment has been minimized.

Copy link

commented May 31, 2016

Building with bazel with GPU support still gives the same error for CUDA 8.0

@orionr

This comment has been minimized.

Copy link
Contributor

commented May 31, 2016

@kashif - 👍 Temporarily adding your changes in source (master branch) seems to work for CUDA 8.0 per this command succeeding.

bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

I needed to upgrade bazel as well, but that's just an FYI.

@orionr

This comment has been minimized.

Copy link
Contributor

commented May 31, 2016

@peyush - not sure why it's failing for you.

@jendap

This comment has been minimized.

Copy link
Contributor

commented Jun 6, 2016

#2614 is merged on master. We will cherry pick it for 0.9 tomorrow and build the artifacts...

@jstaker7

This comment has been minimized.

Copy link

commented Jun 7, 2016

Ran into an error very near the end of the build:

ERROR: /Users/Peace/Projects/external/tensorflow/tensorflow/contrib/session_bundle/example/BUILD:38:1: Executing genrule //tensorflow/contrib/session_bundle/example:half_plus_two failed: bash failed: error executing command
(cd /private/var/tmp/_bazel_Peace/25fdd91e698016ec32416f69e3fa9b23/execroot/tensorflow &&
exec env -
PATH=/Library/Frameworks/Python.framework/Versions/3.5/bin:/Developer/NVIDIA/CUDA-8.0/bin:/opt/local/bin:/opt/local/sbin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/local/lib/:/Developer/NVIDIA/CUDA-8.0/bin/
TMPDIR=/var/folders/0j/lmh6kxcx54s8b1cxcxb15n8w0000gn/T/
/bin/bash -c 'source external/bazel_tools/tools/genrule/genrule-setup.sh; rm -rf /tmp/half_plus_two; /Users/Peace/Projects/venvs/deep/bin/python bazel-out/host/bin/tensorflow/contrib/session_bundle/example/export_half_plus_two; cp -r /tmp/half_plus_two/* bazel-out/local_darwin-opt/genfiles/tensorflow/contrib/session_bundle/example/half_plus_two'): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 245.
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.8.0.dylib locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.5.dylib locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.8.0.dylib locally
Target //tensorflow/tools/pip_package:build_pip_package failed to build
INFO: Elapsed time: 2067.414s, Critical Path: 2055.33s

Python throws open a window saying it quit unexpectedly, and displays a python path that is different than the one I provided to the ./configure script. So perhaps numpy or some other requirement is not being satisfied (Edit: I checked the other path and numpy does exist there also). Any ideas on how I can fix this?

Bazel version: bazel release 0.2.3-2016-06-02 (@c728a63)

@Sohojoe

This comment has been minimized.

Copy link

commented Jun 17, 2016

@jendap @jstaker7 did this get resolved? will CUDA 8 be supported when installing via pip?

@jstaker7

This comment has been minimized.

Copy link

commented Jun 18, 2016

@Sohojoe Thanks for the followup. Not yet.

Expected output during compilation:
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally

What I get:
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally

So somehow it's failing when trying to open libcuda.so.1and no other errors are outputted, so it is hard to debug. I am using OS X, so perhaps that's the problem; things might now be playing nice yet in that regard.

@aselle aselle removed the triaged label Jul 28, 2016

@suharshs

This comment has been minimized.

Copy link
Member

commented Aug 15, 2016

@jendap @jstaker7 Any update on this? According to #2614 this has been resolved?

@vrv

This comment has been minimized.

Copy link
Contributor

commented Aug 15, 2016

That bug was a duplicate of this one, it hasn't been resolved. Basically we either decide to build additional PIPs for an unsupported release candidate, or we close this bug and tell people they can build from sources (which they sort of can, modulo a bunch of bazel-related issues that are being resolved).

@FlorinAndrei

This comment has been minimized.

Copy link

commented Sep 7, 2016

Building TF from source with CUDA 8 currently fails in numerous ways, not all of which have workarounds.

#4105

#4190

#4214

If anyone has a recipe for how to successfully complete the compilation from source, it'd be fantastic. The environment I'm thinking is along the lines of "use some popular Linux distribution with default package versions and add CUDA 8 to it". E.g.:

  • CUDA 8.0 RC + the gcc patch
  • cuDNN 5.1
  • nvidia-driver 367 or 370-beta
  • Ubuntu 16.04
  • python 2.7

A decent set of enabled compute capabilities, up to and including 6.1 (for the new Pascal GPUs) would be great.

@moonlightlane

This comment has been minimized.

Copy link

commented Sep 7, 2016

I followed section 6 of this post and succeeded building tensorflow with CUDA 8 and cuDNN 5 under Ubuntu 16.04 and Python 2.7, in the second try (I forgot what went wrong in the first try. It might be that something was not right during setup during ./configure phase. If certain combination of configure parameters fails, try cuDNN 5 instead of cuDNN 5.1). You will receive numerous warnings during compilation and installation but eventually importing tensorflow into python works...

P.S. I did not follow any other setup procedure in that post, but from the headlines of each section in that post, my setup is probably similar to what is said in the post.

@wagonhelm

This comment has been minimized.

Copy link
Contributor

commented Sep 9, 2016

http://wp.me/p7GvOc-2H Tutorial I made for building from Sources with Ubuntu 16.04 and Cuda 8.0 RC w/ cuDNN 5.1

@FlorinAndrei

This comment has been minimized.

Copy link

commented Sep 9, 2016

All the compilation failures I've seen before only happen on the master branch. None of these issues persists when I switch to the r0.10 branch. Last night I was able to complete a build, Ubuntu 16.04, CUDA 8.0 RC + the compiler patch, cuDNN 5.1, nvidia-driver-370, python-2.7, and compute capability 6.1 (for Pascal GPU) - as soon as I switched to r0.10.

@xpe

This comment has been minimized.

Copy link

commented Sep 14, 2016

@FlorinAndrei Congratulations on that build! 👍 Can you share your choices for ./configure and any other details that you found tricky?

What GPU hardware are you using? I noticed you are using the 370 Nvidia driver.

As for me:

  • GeForce GTX 1080
  • Nvidia Driver: 367.44 driver
  • Ubuntu 16.04
  • CUDA 8.0 RC + the compiler patch
  • cuDNN 5.1.5

I have gotten my Nvidia driver, CUDA, and cuDNN to build. That wasn't obvious at all. But I haven't got my TensorFlow build to work yet.

@FlorinAndrei

This comment has been minimized.

Copy link

commented Sep 14, 2016

@xpe - Clone this repo https://github.com/FlorinAndrei/ml-setup and checkout the ubuntu1604 branch. Then just follow the README and run the Ansible playbooks.

The overview is this (look at the playbooks to fill in details such as extra packages to install):

  • install Ubuntu 16.04
  • install linux-headers, dkms, build-essentials, python development packages
  • do a dist-upgrade
  • enable ppa:graphics-drivers/ppa
  • download and install CUDA 5, the ggc patch, and cuDNN, all from the runfiles
    • sh {{ packages_folder }}/{{ cuda_file }} --silent --toolkit --samples --verbose --override
    • sh {{ packages_folder }}/{{ cuda_patch1_file }} --silent --accept-eula
    • do the manual install for cuDNN
  • create CUDA_HOME and other env vars per CUDA docs
  • apt-get install nvidia-370 (this must happen after CUDA; 370 is now the stable version of the Linux driver, no need to go back to 367)
  • install Java from ppa:webupd8team/java and Bazel from http://storage.googleapis.com/bazel-apt per official docs

This is where the Ansible playbooks stop. You could just run Ansible and it will take you this far. I have not automated the steps after that.

The manual process after that is:

  • clone TensorFlow repo, checkout the r0.10 branch (this is the crucial step; the master branch keeps failing, whereas r0.10 compiled on first try)
  • run ./configure and accept default values except:
    • enable GPU
    • CUDA version must be: 8.0 (autodetect doesn't work)
    • compute capability: whatever it is that your card supports
  • bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
  • bazel-bin/tensorflow/tools/pip_package/build_pip_package $HOME

And that should be it.

I've tested some pretty complex models and I have not seen any strange failures this far.

@xpe

This comment has been minimized.

Copy link

commented Sep 21, 2016

@FlorinAndrei Excellent work! Thank you!

For others: note that the Nvidia drivers page does not necessarily list results in chronological order. The first time around, I missed the latest driver. Like Florin referred to above, at the current time, 370.28 is the latest stable driver for many of the Geforce product line:

http://www.geforce.com/drivers/results/107408
Linux x64 (AMD64/EM64T) Display Driver
Version 370.28
Release Date Thu Sep 08, 2016

@devymex

This comment has been minimized.

Copy link

commented Sep 29, 2016

Does anyone successed install TensorFlow with following combination?

  • Ubuntu 16.04 (Just newly installed and apt upgraded)
  • Nvidia Driver: nvidia-370 (sudo apt install nvidia-370)
  • JAVA 8 and Bazel 0.3.1 (install from official package):
chmod +x bazel-0.3.1-installer-linux-x86_64.sh
./bazel-0.3.1-installer-linux-x86_64.sh --user
  • gcc: 5.4.0 (default with 16.04)
  • CUDA 8.0.44 (for GTX 1070, installed from official .run file without driver)
export CUDA_HOME=/usr/local/cuda-8.0
export LD_LIBRARY_PATH="$CUDA_HOME/lib64:$LD_LIBRARY_PATH"
export PATH="$CUDA_HOME/bin:$PATH"
  • cuDNN 5.1.5 (installed from official package)
  • tensorflow master branch (clone from official github)
  • python 2 and python 3 (install from ubuntu official source)

Then run the tensorflow/configure:

./configure
Please specify the location of python. [Default is /usr/bin/python]: /usr/bin/python3
Do you wish to build TensorFlow with Google Cloud Platform support? [y/N] N
No Google Cloud Platform support will be enabled for TensorFlow
Do you wish to build TensorFlow with Hadoop File System support? [y/N] 
No Hadoop File System support will be enabled for TensorFlow
Found possible Python library paths:
  /usr/local/lib/python2.7/dist-packages
  /usr/lib/python2.7/dist-packages
Please input the desired Python library path to use.  Default is [/usr/local/lib/python2.7/dist-packages]
/usr/lib/python3/dist-packages
Do you wish to build TensorFlow with GPU support? [y/N] y
GPU support will be enabled for TensorFlow
Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: 
Please specify the Cuda SDK version you want to use, e.g. 7.0. [Leave empty to use system default]: 8.0
Please specify the location where CUDA 8.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 
Please specify the Cudnn version you want to use. [Leave empty to use system default]: 5.1.5
Please specify the location where cuDNN 5.1.5 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 
Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size.
[Default is: "3.5,5.2"]: 

Done with

INFO: Starting clean (this may take a while). Consider using --expunge_async if the clean takes more than several minutes.
.
INFO: All external dependencies fetched successfully.
Configuration finished

and I run the building:

bazel build -c opt --config=cuda

some error occured immediately:

ERROR: /home/devymex/.cache/bazel/_bazel_devymex/05af4cc48fb50d1cc8f7e879f4c1ce83/external/local_config_cuda/crosstool/BUILD:4:1: Traceback (most recent call last):
    File "/home/devymex/.cache/bazel/_bazel_devymex/05af4cc48fb50d1cc8f7e879f4c1ce83/external/local_config_cuda/crosstool/BUILD", line 4
        error_gpu_disabled()
    File "/home/devymex/.cache/bazel/_bazel_devymex/05af4cc48fb50d1cc8f7e879f4c1ce83/external/local_config_cuda/crosstool/error_gpu_disabled.bzl", line 3, in error_gpu_disabled
        fail("ERROR: Building with --config=c...")
ERROR: Building with --config=cuda but TensorFlow is not configured to build with GPU support. Please re-run ./configure and enter 'Y' at the prompt to build with GPU support.
...

I've tried to:

  • reinstall Ubuntu, many times;
    
  • reinstall Bazel (from 3party source), many times too.
    
  • try different Bazel edition (0.3.1, 0.3.0)
    
  • change to the older CUDA edition 8.0.27 (CUDA 7 doesn't support GTX 1070)
    
  • try different cuda paths "/usr/local/cuda" and "/usr/local/cuda-8.0" (all of them are exist)
    
  • re-run ./configure many times and re-run bazel build -c opt --config=cuda
    
  • run "bazel clean" before re-run ./configure
    
  • re-download offical tensorflow package or re-clone official tensorflow github repo
    
  • delete path ~/.config/bazel before ./configure
    
  • python2 and python3
    
  • I've tried all possible action that may related to this error.
    

but the error always occured and with identical message.

I've try to use tensorflow r0.10 package (download from git hub), the error message were became to:

ERROR: no such package '@local_config_cuda//crosstool': BUILD file not found on package path.
ERROR: no such package '@local_config_cuda//crosstool': BUILD file not found on package path.

with the tensorflow r0.10, above tried actions has no effect to these two error message.
I'm so frustrated after four days failure. Does someone could help me? Thanks!

@leezu

This comment has been minimized.

Copy link

commented Sep 29, 2016

@devymex I have the exact same problem (symptoms / errors), just on CentOS 7.2

@devymex

This comment has been minimized.

Copy link

commented Sep 29, 2016

@leezu I also tried to edit the ./tensorflow/tools/baze.rc, commented the following line:
#build:cuda --crosstool_top=@local_config_cuda//crosstool:toolchain
then run the building command and it began to compile, but after a while it stucked in another error:

ERROR: /home/devymex/Software/tensorflow/tensorflow/stream_executor/BUILD:5:1: C++ compilation of rule '//tensorflow/stream_executor:stream_executor' failed: gcc failed: error executing command 
  (cd /home/devymex/.cache/bazel/_bazel_devymex/05af4cc48fb50d1cc8f7e879f4c1ce83/execroot/tensorflow && \
  exec env - \
    LD_LIBRARY_PATH=/usr/local/cuda/lib64 \
    PATH=/usr/local/cuda-8.0/bin:/home/devymex/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin \

I'm absolutely devastated.

@jart

This comment has been minimized.

Copy link
Contributor

commented Oct 13, 2016

Update: We're actively working on CUDA 8.0 support. One potential blocker so far is #4931.

@flx42

This comment has been minimized.

Copy link
Contributor

commented Oct 13, 2016

@jart: FWIW, CUDA 8.0 has been working fine for us with TF r0.10 and I just tried r0.11, it seems fine too.
This is my fixes on the official Dockerfile:

diff --git a/tensorflow/tools/docker/Dockerfile.devel-gpu b/tensorflow/tools/docker/Dockerfile.devel-gpu
index b4dc923..771838c 100644
--- a/tensorflow/tools/docker/Dockerfile.devel-gpu
+++ b/tensorflow/tools/docker/Dockerfile.devel-gpu
@@ -1,4 +1,4 @@
-FROM nvidia/cuda:7.5-cudnn5-devel
+FROM nvidia/cuda:8.0-cudnn5-devel

 MAINTAINER Craig Citro <craigcitro@google.com>

@@ -87,8 +87,7 @@ WORKDIR /tensorflow
 # Configure the build for our CUDA configuration.
 ENV CUDA_PATH /usr/local/cuda
 ENV CUDA_TOOLKIT_PATH /usr/local/cuda
-ENV CUDNN_INSTALL_PATH /usr/lib/x86_64-linux-gnu
-ENV LD_LIBRARY_PATH /usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
+ENV LD_LIBRARY_PATH /usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH
 ENV TF_NEED_CUDA 1
 ENV TF_CUDA_COMPUTE_CAPABILITIES=3.0,3.5,5.2
@jart jart referenced this issue Oct 13, 2016

@gunan gunan assigned yifeif and unassigned jendap Oct 14, 2016

@jart jart changed the title feature request: support for cuda 8.0 rc Need binary release with CUDA 8.0 support Oct 14, 2016

@yaroslavvb

This comment has been minimized.

Copy link
Contributor

commented Oct 25, 2016

0.11rc1 has CUDA 8 and was posted in official downloads last Friday :
https://www.tensorflow.org/versions/r0.11/get_started/os_setup.html

On Sun, Oct 23, 2016 at 1:38 PM, Steven Shi notifications@github.com
wrote:

Any updates on this?


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#2559 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABaHGb9KX9KNN9InXf-JcvvUEpT2Vc3ks5q28WvgaJpZM4IpGKc
.

@xru

This comment has been minimized.

Copy link

commented Nov 2, 2016

@yaroslavvb solved, works.
finally, thanks.

@yifeif yifeif closed this Nov 3, 2016

@AntonyM55

This comment has been minimized.

Copy link

commented Nov 10, 2016

@devymex @leezu

try
$export TF_NEED_CUDA=1
$./configure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.