Failed to build from source due to missing libcudart.so.7.5: #2053

akors · 2016-04-21T20:02:20Z

Hi! I tried to compile the tutorials_example_trainer file, and I have quite a journey behind me. I recompiled GCC several times, I recompiled bazel dozens of times and did a fair share of CROSSTOOLS editing.
At this point, I am stuck. The compliation fails with the message:

bazel-out/host/bin/tensorflow/cc/ops/random_ops_gen_cc: error while loading shared libraries: libcudart.so.7.5: cannot open shared object file: No such file or directory Target //tensorflow/cc:tutorials_example_trainer failed to build

I am using Tensorflow HEAD (currently 7b536cd), I have CUDA 7.5 installed in /usr/local/cuda-7.5 .

My LD_LIBRARY_PATH is set to :/usr/local/cuda/lib64:/usr/local/cuda/lib64 , and the files exist there.
I tried to point bazel to the library directory by adding

+ linker_flag: "-L/usr/local/cuda/lib64"
.

Environment info

Operating System: Fedora 23

Installed version of CUDA and cuDNN: 7.5 and 4.0.7

$ ls /usr/local/cuda-7.5/lib64/libcud*
/usr/local/cuda-7.5/lib64/libcudadevrt.a    /usr/local/cuda-7.5/lib64/libcudart.so.7.5.18  /usr/local/cuda-7.5/lib64/libcudnn.so.4
/usr/local/cuda-7.5/lib64/libcudart.so      /usr/local/cuda-7.5/lib64/libcudart_static.a   /usr/local/cuda-7.5/lib64/libcudnn.so.4.0.7
/usr/local/cuda-7.5/lib64/libcudart.so.7.5  /usr/local/cuda-7.5/lib64/libcudnn.so          /usr/local/cuda-7.5/lib64/libcudnn_static.a

If installed from sources, provide the commit hash: 7b536cd

Steps to reproduce

Recompiled Bazel, version 0.2.1 with the following patch: https://gist.github.com/akors/5db13e874c144b3b111f0e0326d7b771#file-bazel-custom-gcc-patch
Applied the following patch to TensorFlow: https://gist.github.com/akors/5db13e874c144b3b111f0e0326d7b771#file-tensorflow-custom-gcc-patch
Ran ./configure with /opt/gcc-4.9/bin/gcc as compiler, but default otherwise.
Ran bazel build -c opt --config=cuda --local_resources 4096,3.0,1.0 -j 4 //tensorflow/cc:tutorials_example_trainer --verbose_failures

What have you tried?

I cried a lot.
Add linker_flag: "-L/usr/local/cuda/lib64" to third_party/gpus/crosstool/CROSSTOOL
Add linker_flag: "-Wl,-R/usr/local/cuda/lib64" to third_party/gpus/crosstool/CROSSTOOL
Recompile bazel, with added linker_flag: "-Wl,-R/usr/local/cuda/lib64" in tools/cpp/CROSSTOOL
Create a file /etc/ld.so.conf.d/LOCAL_cuda-lib64.conf with the contents /usr/local/cuda/lib64 and ran ldconfig

Logs or other output that would be helpful

Here's the full output of the last operation:

ERROR: /home/alexander/.local/src/tensorflow/tensorflow/cc/BUILD:28:1: Executing genrule //tensorflow/cc:random_ops_genrule failed: namespace-sandbox failed: error executing command 
  (cd /home/alexander/.cache/bazel/_bazel_alexander/3fc3a90944d6c7fe99106d6e515412c7/tensorflow && \
  exec env - \
    PATH=/usr/lib/ccache:/usr/lib/ccache:/usr/lib64/qt-3.3/bin:/usr/lib64/ccache:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/home/alexander/.local/bin:/home/alexander/bin:/home/alexander/.local/bin:/home/alexander/bin \
  /home/alexander/.cache/bazel/_bazel_alexander/3fc3a90944d6c7fe99106d6e515412c7/tensorflow/_bin/namespace-sandbox @/home/alexander/.cache/bazel/_bazel_alexander/3fc3a90944d6c7fe99106d6e515412c7/tensorflow/bazel-sandbox/7d27406c-6595-46a2-a8dc-b7faf7c28c88-0.params -- /bin/bash -c 'source external/bazel_tools/tools/genrule/genrule-setup.sh; bazel-out/host/bin/tensorflow/cc/ops/random_ops_gen_cc bazel-out/local_linux-opt/genfiles/tensorflow/cc/ops/random_ops.h bazel-out/local_linux-opt/genfiles/tensorflow/cc/ops/random_ops.cc 0').
bazel-out/host/bin/tensorflow/cc/ops/random_ops_gen_cc: error while loading shared libraries: libcudart.so.7.5: cannot open shared object file: No such file or directory
Target //tensorflow/cc:tutorials_example_trainer failed to build
INFO: Elapsed time: 630.365s, Critical Path: 137.07s

ps.: Out of curiosity, what are you TensorFlow devs using for a development machine? Has any of you actually tried using a modern Linux distribution that comes with GCC newer than 4.9 for compilation? You really should. It's quite the experience.

The text was updated successfully, but these errors were encountered:

akors · 2016-04-21T20:12:02Z

Fun fact: Symlinking all Cuda shared libraries to the custom GCC library directory also doesn't work.

black-puppydog · 2016-04-22T13:09:53Z

I have been having this issue, too. For now I resorted to using the nightly builds, which work just fine, and hope that I won't have to touch the C++ portions of tensorflow or contribute a patch that requires rebuilding...
There was another issue that seemed related (#1701), also on Fedora23.
I am on Fedora21 here, with gcc 4.9.2

akors · 2016-04-22T13:47:54Z

@black-puppydog Thanks for your information.
Do I understand correctly that
a) You are using nightly builds as "user", but do not build it from source?
b) Even on Fedora 21, it fails to compile with the missing libcudart.so.7.5?

Another fun fact:
Symlinking /usr/lib64/libcudart.so.7.5 -> /usr/local/cuda/lib64/libcudart.so.7.5 also doesn't work.

I'm out of ideas over here. I will go back to the CPU-based build, which seems to work and come back when some dev has shed some light on this issue.

vrv · 2016-04-22T17:22:03Z

gcc 4.9.2 is what I currently use and it has worked for me. @keveman in case he has any ideas, or maybe we have to rope in bazel devs.

akors · 2016-04-24T13:03:13Z

@vrv Thanks for the reply. And you are able to build the GPU-version of TensorFlow from source?
Which distribution (with version) do you have? Is the GCC 4.9.2 from your distro? Or self-compiled?

vrv · 2016-04-24T18:48:53Z

I also just tried from another machine and it still works for me (just synced to HEAD today).

On this machine, I'm using ubuntu 14.04, gcc 4.8.4 provided by distro.

If you manually run bazel-out/host/bin/tensorflow/cc/ops/random_ops_gen_cc, do you still see the failure?

Do you see a bazel-out/host/bin/tensorflow/cc/ops/random_ops_gen_cc.runfiles/third_party/gpus/cuda/lib64/libcudart.so symlink?

vrv · 2016-04-24T18:49:25Z

Btw the reason we don't use gcc 5+ is that nvcc currently isn't compatible with it, so we're all stuck on 4.8 or 4.9 :(

akors · 2016-04-24T20:18:41Z

If you manually run bazel-out/host/bin/tensorflow/cc/ops/random_ops_gen_cc, do you still see the failure?

No, then the library can be found.

$ bazel-out/host/bin/tensorflow/cc/ops/logging_ops_gen_cc
Usage: bazel-out/host/bin/tensorflow/cc/ops/logging_ops_gen_cc out.h out.cc include_internal
  include_internal: 1 means include internal ops
[alexander@desktop-fedora tensorflow]$ ldd bazel-out/host/bin/tensorflow/cc/ops/logging_ops_gen_cc
    linux-vdso.so.1 (0x00007ffcef5b6000)
    libcudart.so.7.5 => /home/alexander/.local/src/tensorflow/bazel-out/host/bin/tensorflow/cc/ops/../../../_solib_local/_U_S_Sthird_Uparty_Sgpus_Scuda_Ccudart___Uthird_Uparty_Sgpus_Scuda_Slib64/libcudart.so.7.5 (0x00007f5991bd4000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f59918a1000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007f599169d000)
    libz.so.1 => /lib64/libz.so.1 (0x00007f5991487000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f5991269000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f5990ee7000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f5990cd0000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f599090e000)
    librt.so.1 => /lib64/librt.so.1 (0x00007f5990706000)
    /lib64/ld-linux-x86-64.so.2 (0x000055fcecd2f000)

Do you see a bazel-out/host/bin/tensorflow/cc/ops/random_ops_gen_cc.runfiles/third_party/gpus/cuda/lib64/libcudart.so symlink?

Not libcudart.so, but libcudart.so.7.5 . The link is valid, and links into the bazel-cache directory.

Btw the reason we don't use gcc 5+ is that nvcc currently isn't compatible with it, so we're all stuck on 4.8 or 4.9 :(

I understand. Did you also have to patch your CROSSTOOL file?

On this machine, I'm using ubuntu 14.04, gcc 4.8.4 provided by distro.

Thanks. Does this mean you are using CUDA 7.0, not 7.5?

vrv · 2016-04-24T21:50:58Z

Hmm, this is probably a bazel-related problem, since the binary itself does seem to have the right linkages. @damienmg for some help, if he has ideas.

(No, it works for me at 7.5 too, I did not have to patch my CROSSTOOL file). To be completely honest, I'm not sure why some configurations work and others don't.

damienmg · 2016-04-25T06:23:19Z

It is running with sandboxing so cannot find the sjared library

damienmg · 2016-04-25T06:26:39Z

Sorry shared. I should not use my phone configured in French ever again...

Try to add --genrule_strategy=standalone when building TensorFlow

akors · 2016-04-25T10:56:15Z

Thanks a lot! Adding --genrule_strategy=standalone to the bazel command line works! I can now compile the trainer and run the trainer with GPU support.

For the curious, the full command line is now:
bazel build -c opt --config=cuda --genrule_strategy=standalone --local_resources 4096,3.0,1.0 -j 2 //tensorflow/cc:tutorials_example_trainer --verbose_failures
Totally obvious ;)

I will now try to minimize all the changes that I did to get it to run on my system, and then write them up in a comprehensible way.

As for this bug: compiling in standalone mode is not very obvious, so at the very least I think it should go in the documentation. I do believe that some fixing is required (either for bazel or for the tensorflow build rules), but you can close this issue at your discretion.

damienmg · 2016-04-25T12:17:12Z

Actually tensorflow's configure should add that flag, what is in tools/bazel.rc for you?

akors · 2016-04-25T15:49:04Z

This is my tools/bazel.rc after ./configure.

# Autogenerated by configure: DO NOT EDIT
build:cuda --crosstool_top=//third_party/gpus/crosstool
build:cuda --define=using_cuda=true

build --force_python=py2
build --python2_path=/usr/bin/python
build --define=use_fast_cpp_protos=true
build --define=allow_oversize_protos=true

build --spawn_strategy=standalone
test --spawn_strategy=standalone
run --spawn_strategy=standalone

Is it possible that the --spawn_strategy=standalone is only added to the build target, but not to the build:cuda target?

damienmg · 2016-04-25T15:59:39Z

/cc @aehlig who knows exactly how this file is parsed. Yes it is definitely possible but should not

aehlig · 2016-04-26T08:23:55Z

/cc @aehlig who knows exactly how this file is parsed.

My understanding is that build:cuda options are more specific than plain build options and hence take precedence where conflicting options are specified for build and build:cuda; but otherwise they're just added additionally. So, in your case you should get --spawn_startegy=standalone. You can also verify which options are taken from the rc-files by adding --announce_rc to the command-line.

Note, however, that on the command-line you specified --genrule_strategy=standalone whereas from the rc-file you inherit --spawn_strategy=standalone and I'm not sure how those two strategy options interact.

akors · 2016-04-26T21:23:14Z

Hi, this is for the people coming from Google, trying to get TensorFlow to compile on their machines:

How to compile TensorFlow from source on Fedora 23 with a custom compiler.

Compiling TensorFlow with GPU support is possible, but a bit tricky on Fedora 23 and up.
The compilation requires a specific GCC version which is not available from Fedora repoisitories and specifying the compiler is more complicated than it should be.

Compiling GCC

For CUDA version 7.5, you need to obtain the source code for GCC version 4.9. You can obtain it from here.

Next, you need to install GCC compile-time dependencies:

sudo dnf install mpfr-devel gmp-devel libmpc-devel isl-devel

Now you have to configure the GCC build. For details, check out the GCC configuration page. I suggest installing into a custom prefix, such as /opt/gcc-4.9. I suggest enabling only C and C++ and skipping the rest of the GCC languages to save time and disk space. The options --with-as, --with-ld and --with-nm are required, because otherwise TensorFlow build will fail, complaining that those binaries cannot be found.

/configure --prefix=/opt/gcc-4.9 --disable-nls --enable-languages=c,c++ --with-ld=/bin/ld --with-nm=/bin/nm --with-as=/usr/bin/as

When this step is done, you can compile GCC with the following command:

make -j4

This assumes you want to use 4 processing cores. You can use more or less, or omit the -j option entirely.

Finally, run as root:

make install

Compiling bazel

Obtain the bazel source code. You need the current master branch, NOT any of the recent releases.

git clone https://github.com/bazelbuild/bazel.git

To compile bazel, you need to specify

export CC=/opt/gcc-4.9/bin/gcc
./compile

This will produce the bazel binary in path/to/bazel/output/bazel

Compiling TensorFlow

Obtain the TensorFlow source code

git clone --recurse-submodules https://github.com/tensorflow/tensorflow

Modify the file third_party/gpus/crosstool/CROSSTOOL: Find the toolchain entry where the toolchain_identifieris set to "local_linux". Only change entries here, the rest is irrelevant.

Replace the following lines:

cxx_builtin_include_directory: "/usr/lib/gcc/"
cxx_builtin_include_directory: "/usr/local/include"

with the following lines:
cxx_builtin_include_directory: "/opt/gcc-4.9/lib/gcc/"
cxx_builtin_include_directory: "/opt/gcc-4.9/local/include"
cxx_builtin_include_directory: "/opt/gcc-4.9/include"

Next, run the ./configure and set your options. Specify your self-compiled GCC (for me /opt/gcc-4.9/bin/gcc) as the compiler to be used by nvcc.

To compile the source, use the following command line:

path/to/bazel/output/bazel build -c opt --config=cuda --genrule_strategy=standalone --local_resources 4096,4.0,1.0 -j 4 //tensorflow/cc:tutorials_example_trainer

Explanations:

build: what bazel should do
-c opt: Don't know, says so in the docu
--config=cuda: Compile with CUDA support. Don't ask me why you have to specify that again, even though you did so in ./configure.
--genrule_strategy=standalone: This compiles in "standalone mode". Don't ask me what that is, but it's required so that generated output files can find the libcudart.so that they are linked to (see issue Failed to build from source due to missing libcudart.so.7.5: #2053).
--local_resources 4096,4.0,1.0 -j 4: Use at most 4096M of memory, 4.0 CPU's, 1.0 of I/O and 4 threads. This is required so that the compilation doesn't crash due to out-of-memory (I have 8GB of physical memory and 4GB of swap). The 4096 is still a lie, because the compilation still used more - but at least it didn't crash.
//tensorflow/cc:tutorials_example_trainer: What to build.

I sincerely hope that this guide will be obsolete very soon, and you can just get cracking without all these workarounds. But for now, this will probably be useful.

akors · 2016-04-26T21:33:13Z

@aehlig

Here's the output with --announce_rc but not --genrule_strategy=standalone

$ bazel build -c opt --config=cuda --local_resources 4096,4.0,1.0 -j 4 --announce_rc //tensorflow/cc:tutorials_example_trainerINFO: Options provided by the client:
  Inherited 'common' options: --isatty=1 --terminal_columns=173
INFO: Reading options for 'build' from /home/alexander/.local/src/tensorflow/tools/bazel.rc:
  'build' options: --force_python=py2 --python2_path=/usr/bin/python --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone
INFO: Reading options for 'build' from /home/alexander/.local/src/tensorflow/tools/bazel.rc:
  'build' options: --crosstool_top=//third_party/gpus/crosstool --define=using_cuda=true

Here's the output with both

$ bazel build -c opt --config=cuda --genrule_strategy=standalone --local_resources 4096,4.0,1.0 -j 4 --announce_rc //tensorflow/cc:tutorials_example_trainer
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=1 --terminal_columns=173
INFO: Reading options for 'build' from /home/alexander/.local/src/tensorflow/tools/bazel.rc:
  'build' options: --force_python=py2 --python2_path=/usr/bin/python --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone
INFO: Reading options for 'build' from /home/alexander/.local/src/tensorflow/tools/bazel.rc:
  'build' options: --crosstool_top=//third_party/gpus/crosstool --define=using_cuda=true

I really don't know how --spawn_startegy=standalone interacts with --genrule_strategy=standalone, but I know that without the latter being passed on the command line, the compilation fails with the missing libcudart.so.

damienmg · 2016-04-26T21:37:11Z

As discussed offline with @aehlig, --spawn_strategy and --genrule_strategy
are two different things, I totally missed that tensorflow does not have
the later in their rc file.

On Tue, Apr 26, 2016 at 11:35 PM Alexander Korsunsky <
notifications@github.com> wrote:

@aehlig https://github.com/aehlig

Here's the output with --announce_rc but not --genrule_strategy=standalone

$ bazel build -c opt --config=cuda --local_resources 4096,4.0,1.0 -j 4 --announce_rc //tensorflow/cc:tutorials_example_trainerINFO: Options provided by the client:
Inherited 'common' options: --isatty=1 --terminal_columns=173
INFO: Reading options for 'build' from /home/alexander/.local/src/tensorflow/tools/bazel.rc:
'build' options: --force_python=py2 --python2_path=/usr/bin/python --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone
INFO: Reading options for 'build' from /home/alexander/.local/src/tensorflow/tools/bazel.rc:
'build' options: --crosstool_top=//third_party/gpus/crosstool --define=using_cuda=true

Here's the output with both

$ bazel build -c opt --config=cuda --genrule_strategy=standalone --local_resources 4096,4.0,1.0 -j 4 --announce_rc //tensorflow/cc:tutorials_example_trainer
INFO: Options provided by the client:
Inherited 'common' options: --isatty=1 --terminal_columns=173
INFO: Reading options for 'build' from /home/alexander/.local/src/tensorflow/tools/bazel.rc:
'build' options: --force_python=py2 --python2_path=/usr/bin/python --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone
INFO: Reading options for 'build' from /home/alexander/.local/src/tensorflow/tools/bazel.rc:
'build' options: --crosstool_top=//third_party/gpus/crosstool --define=using_cuda=true

I really don't know how --spawn_startegy=standalone interacts with
--genrule_strategy=standalone, but I know that without the latter being
passed on the command line, the compilation fails with the missing
libcudart.so.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#2053 (comment)

chemelnucfin · 2016-04-28T21:59:25Z

--genrule_strategy=standalone also helped me build on Fedora 23. Now I can run the example trainer.

akors · 2016-05-11T14:50:12Z

Hi, @itsmeolivia , is there any particular reason why you closed this issue?

Because I just checked again with the latest head (5681406), and I still have the same error as described in the original message.

I believe that compilation should succeed without any magical options that are not described in the tutorial. And my setup really isn't that exotic ;)

schmiflo · 2016-05-25T08:48:50Z

@akors Thanks a lot for pointing this out - it took me quite some time until I found this thread...

…upstream-sync-230410 Develop upstream sync 230410

vrv assigned vrv and keveman and unassigned vrv Apr 22, 2016

akors mentioned this issue Apr 26, 2016

Undeclared inclusions and missing dependencies in building #2109

Closed

chemelnucfin mentioned this issue Apr 30, 2016

fixes #1423 - add check for tests to see if tensorflow was built with gpu #1493

Closed

itsmeolivia closed this as completed May 10, 2016

akors mentioned this issue May 25, 2016

Fedora 23: build fails with missing libcudart.so.7.5 without --genrule_strategy=standalone #2499

Closed

akors mentioned this issue Aug 7, 2016

Training on a GTX 1080 does not work, produces random labels #3507

Closed

chemelnucfin mentioned this issue Feb 7, 2017

ImportError: libcudart.so.7.5: cannot open shared object file #1501

Closed

fsx950223 pushed a commit to fsx950223/tensorflow that referenced this issue Nov 28, 2023

Merge pull request tensorflow#2053 from ROCmSoftwarePlatform/develop-…

83a7ed4

…upstream-sync-230410 Develop upstream sync 230410

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to build from source due to missing libcudart.so.7.5: #2053

Failed to build from source due to missing libcudart.so.7.5: #2053

akors commented Apr 21, 2016 •

edited

akors commented Apr 21, 2016

black-puppydog commented Apr 22, 2016

akors commented Apr 22, 2016

vrv commented Apr 22, 2016

akors commented Apr 24, 2016

vrv commented Apr 24, 2016

vrv commented Apr 24, 2016

akors commented Apr 24, 2016

vrv commented Apr 24, 2016

damienmg commented Apr 25, 2016

damienmg commented Apr 25, 2016

akors commented Apr 25, 2016

damienmg commented Apr 25, 2016

akors commented Apr 25, 2016

damienmg commented Apr 25, 2016

aehlig commented Apr 26, 2016

akors commented Apr 26, 2016 •

edited

akors commented Apr 26, 2016

damienmg commented Apr 26, 2016

chemelnucfin commented Apr 28, 2016

akors commented May 11, 2016

schmiflo commented May 25, 2016

Failed to build from source due to missing libcudart.so.7.5: #2053

Failed to build from source due to missing libcudart.so.7.5: #2053

Comments

akors commented Apr 21, 2016 • edited

Environment info

Steps to reproduce

What have you tried?

Logs or other output that would be helpful

akors commented Apr 21, 2016

black-puppydog commented Apr 22, 2016

akors commented Apr 22, 2016

vrv commented Apr 22, 2016

akors commented Apr 24, 2016

vrv commented Apr 24, 2016

vrv commented Apr 24, 2016

akors commented Apr 24, 2016

vrv commented Apr 24, 2016

damienmg commented Apr 25, 2016

damienmg commented Apr 25, 2016

akors commented Apr 25, 2016

damienmg commented Apr 25, 2016

akors commented Apr 25, 2016

damienmg commented Apr 25, 2016

aehlig commented Apr 26, 2016

akors commented Apr 26, 2016 • edited

How to compile TensorFlow from source on Fedora 23 with a custom compiler.

Compiling GCC

Compiling bazel

Compiling TensorFlow

akors commented Apr 26, 2016

damienmg commented Apr 26, 2016

chemelnucfin commented Apr 28, 2016

akors commented May 11, 2016

schmiflo commented May 25, 2016

akors commented Apr 21, 2016 •

edited

akors commented Apr 26, 2016 •

edited