Build fails on Ubuntu 16.04 LTS, CUDA Toolkit 8.0, cuDNN 5.0.5, and Bazel 0.3.0-jdk7 #3526

Closed
adam-erickson opened this Issue Jul 27, 2016 · 5 comments

Projects

None yet

4 participants

@adam-erickson
adam-erickson commented Jul 27, 2016 edited

Hi Everyone,

I've downgraded my gcc to 5.3.0 by building from source in order to install CUDA Toolkit 8.0 with cuDNN 5.0.5. I also installed OpenCL freeglut3 and mesa libraries via apt-get. I then built Bazel from source using the installer script. Next, I installed the TensorFlow and Google Cloud Platform Python dependencies. I then cloned the tensorflow GitHub repository and modified the CROSSTOOL file variable cxx_builtin_include_directory to include the gcc location for 5.3.0. I then ran ./configure with default settings and tried to build with Bazel, but it always fails with an error like this, which appears to be a gcc issue:

WARNING: /root/.cache/bazel/_bazel_root/fbc06f9baef46cade6e35d9e4137e37c/external/protobuf/WORKSPACE:1: Workspace name in /root/.cache/bazel/_bazel_root/fbc06f9baef46cade6e35d9e4137e37c/external/protobuf/WORKSPACE (@main) does not match the name given in the repository's definition (@protobuf); this will cause a build error in future versions.

ERROR: /root/.cache/bazel/_bazel_root/fbc06f9baef46cade6e35d9e4137e37c/external/zlib_archive/BUILD:7:1: undeclared inclusion(s) in rule '@zlib_archive//:zlib'

This rule is missing dependency declarations for the following files included by 'external/zlib_archive/zlib-1.2.8/inftrees.c':
'/usr/local/lib/gcc/x86_64-unknown-linux-gnu/5.3.0/include-fixed/limits.h'
'/usr/local/lib/gcc/x86_64-unknown-linux-gnu/5.3.0/include-fixed/syslimits.h'
'/usr/local/lib/gcc/x86_64-unknown-linux-gnu/5.3.0/include/stddef.h'
'/usr/local/lib/gcc/x86_64-unknown-linux-gnu/5.3.0/include/stdarg.h'.
Target //tensorflow/cc:tutorials_example_trainer failed to build

If I change the gcc version to 4.8 (installed via apt-get) in ./configure and revert CROSSTOOL I get many warnings:
INFO: ... warning: variable 'parsed_colon' set but not used

This warning is followed by an error:
ERROR: /opt/tensorflow/tensorflow/core/kernels/BUILD:1527:1: undeclared inclusion(s) in rule '//tensorflow/core/kernels:depth_space_ops_gpu':
this rule is missing dependency declarations for the following files included by 'tensorflow/core/kernels/spacetodepth_op_gpu.cu.cc':
'/usr/local/cuda-8.0/include/cuda_runtime.h'
'/usr/local/cuda-8.0/include/host_config.h'
'/usr/local/cuda-8.0/include/builtin_types.h'
'/usr/local/cuda-8.0/include/device_types.h'
'/usr/local/cuda-8.0/include/host_defines.h'
'/usr/local/cuda-8.0/include/driver_types.h'
...

This time, it appears to be an issue with CUDA Toolkit 8.0. Everything seems to work flawlessly up until building tensorflow from source.

Thanks,

Adam

@adam-erickson adam-erickson changed the title from TensorFlow build fails on Ubuntu 16.04 LTS, CUDA Toolkit 8.0, cuDNN 5.0.5, and Bazel 0.3.0-jdk7 to Build fails on Ubuntu 16.04 LTS, CUDA Toolkit 8.0, cuDNN 5.0.5, and Bazel 0.3.0-jdk7 Jul 27, 2016
@JohnAllen
Contributor
JohnAllen commented Jul 27, 2016 edited

Try the one-earlier release of Bazel 0.2.3 or whatever it is. 0.3 never worked for me. Rolling back twice worked for me on Ubuntu 16 and 14, with same exact CUD* versions. Make sure bazel 0.3 is uninstalled, obviously.

@yaroslavvb
Contributor

I worked around similar issue by adding cxx_builtin_include_directory
#3431

@michaelisard
Member

@adam-erickson please update if the above comments don't help.

@adam-erickson
adam-erickson commented Jul 28, 2016 edited

It looks like the solution of @yaroslavvb should work for the install with gcc 4.8.

Since I'm not running Pascal architecture GPUs, but rather a node with four GeForce GTX Titan X GPUs, I ended up installing the latest CUDA 367.35 display drivers from ppa (the display drivers included with CUDA Toolkit 7.5 cause nvidia-smi to freeze on Ubuntu 16.04), CUDA Toolkit 7.5 from Ubuntu 16.04 LTS package management, and cuDNN 5.0.5 from the Nvidia site. I then built and ran the samples from source. One function appears to error in tests, but maybe that's because the samples are intended for Ubuntu 15.04 with cuDNN 4. I'm happily back to only the standard gcc now. TensorFlow is functioning well with the standard Python distribution. Here was my full process, after removing previous installations:

Recommended: Install OpenCL libraries
Update list:
apt-get update
Install OpenCL libraries:
apt-get install mesa-common-dev freeglut3-dev
apt-get install libxmu-dev libxi-dev

Install CUDA Toolkit 7.5 and 367.xx display driver from Ubuntu 16.04 apt-get
Install Python dependencies:
apt-get install python-pip python-dev
Remove existing CUDA installation:
apt-get purge nvidia-*
Install CUDA display driver 367.35:
add-apt-repository ppa:graphics-drivers/ppa
apt-get update
apt-get install nvidia-367
reboot
Install CUDA Toolkit 7.5
apt-get install nvidia-cuda-toolkit
apt-get install nvidia-nsight
apt-get install nvidia-profiler
apt-get install libcupti-dev zliblg-dev
Link files:
mkdir /usr/local/cuda
ln -s /usr/lib/x86_64-linux-gnu/ lib64
ln -s /usr/include/ include
ln -s /usr/bin/ bin
ln -s /usr/lib/x86_64-linux-gnu/ nvvm
mkdir -p extras/CUPTI
cd extras/CUPTI
ln -s /usr/lib/x86_64-linux-gnu/ lib64
ln -s /usr/include/ include

Install cuDNN 5.0.5
cd /opt
Download cuDNN 5.0.5 from: [(https://developer.nvidia.com/rdp/cudnn-download)]
tar xvf cudnn-8.0-linux-x64-v5.0-ga.tar
Add to ~/.bashrc: export LD_LIBRARY_PATH=/opt/cuda:$LD_LIBRARY_PATH
Copy cudnn files to the default CUDA directories and set permissions:
cp cuda/include/cudnn.h /usr/local/cuda/include/
cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/
chmod a+r /usr/local/cuda/include/cudnn.h
chmod a+r /usr/local/cuda/lib64/libcudnn*
Download and run the CUDA Toolkit 7.5 samples:
Get the full run file here: [https://developer.nvidia.com/cuda-toolkit]
Extract to path:
mkdir /opt/cuda/cudatoolkit
sh cuda_7.5.xxx_linux_64.run -extract=/opt/cuda/cudatoolkit
cd /opt/cuda/cudatoolkit
Install only the samples to /opt/cuda/samples:
sh cuda-samples-linux-7.5.xx-xxxx.run
cd ..
rm -rf cudatoolkit
Run the Device Query tool:
cd samples/1_Utilities/deviceQuery
make
./deviceQuery
Run the bandwidth test:
cd /opt/cuda/samples/1_Utilities/bandwidthTest
make
./bandwidthTest
Run the n-body sample with nvprof:
cd ../..
cd 5_Simulations/nbody
make
nvprof ./nbody -benchmark -numdevices=4
nvprof --print-gpu-trace ./nbody -benchmark -numdevices=4

Install TensorFlow from binary for CUDA Toolkit 7.5
-Set new variable and install:
cd /opt
export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.9.0-cp27-none-linux_x86_64.whl
pip install --upgrade $TF_BINARY_URL

@michaelisard
Member

Good to hear it! I'm closing the issue but please reopen if there is something that needs to be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment