Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to build from source due to missing libcudart.so.7.5: #2053

Closed
akors opened this issue Apr 21, 2016 · 22 comments
Closed

Failed to build from source due to missing libcudart.so.7.5: #2053

akors opened this issue Apr 21, 2016 · 22 comments
Assignees

Comments

@akors
Copy link

akors commented Apr 21, 2016

Hi! I tried to compile the tutorials_example_trainer file, and I have quite a journey behind me. I recompiled GCC several times, I recompiled bazel dozens of times and did a fair share of CROSSTOOLS editing.
At this point, I am stuck. The compliation fails with the message:

bazel-out/host/bin/tensorflow/cc/ops/random_ops_gen_cc: error while loading shared libraries: libcudart.so.7.5: cannot open shared object file: No such file or directory Target //tensorflow/cc:tutorials_example_trainer failed to build

I am using Tensorflow HEAD (currently 7b536cd), I have CUDA 7.5 installed in /usr/local/cuda-7.5 .

My LD_LIBRARY_PATH is set to :/usr/local/cuda/lib64:/usr/local/cuda/lib64 , and the files exist there.
I tried to point bazel to the library directory by adding

+ linker_flag: "-L/usr/local/cuda/lib64"
.

Environment info

Operating System: Fedora 23

Installed version of CUDA and cuDNN: 7.5 and 4.0.7

$ ls /usr/local/cuda-7.5/lib64/libcud*
/usr/local/cuda-7.5/lib64/libcudadevrt.a    /usr/local/cuda-7.5/lib64/libcudart.so.7.5.18  /usr/local/cuda-7.5/lib64/libcudnn.so.4
/usr/local/cuda-7.5/lib64/libcudart.so      /usr/local/cuda-7.5/lib64/libcudart_static.a   /usr/local/cuda-7.5/lib64/libcudnn.so.4.0.7
/usr/local/cuda-7.5/lib64/libcudart.so.7.5  /usr/local/cuda-7.5/lib64/libcudnn.so          /usr/local/cuda-7.5/lib64/libcudnn_static.a

If installed from sources, provide the commit hash: 7b536cd

Steps to reproduce

  1. Recompiled Bazel, version 0.2.1 with the following patch: https://gist.github.com/akors/5db13e874c144b3b111f0e0326d7b771#file-bazel-custom-gcc-patch
  2. Applied the following patch to TensorFlow: https://gist.github.com/akors/5db13e874c144b3b111f0e0326d7b771#file-tensorflow-custom-gcc-patch
  3. Ran ./configure with /opt/gcc-4.9/bin/gcc as compiler, but default otherwise.
  4. Ran bazel build -c opt --config=cuda --local_resources 4096,3.0,1.0 -j 4 //tensorflow/cc:tutorials_example_trainer --verbose_failures

What have you tried?

  1. I cried a lot.
  2. Add linker_flag: "-L/usr/local/cuda/lib64" to third_party/gpus/crosstool/CROSSTOOL
  3. Add linker_flag: "-Wl,-R/usr/local/cuda/lib64" to third_party/gpus/crosstool/CROSSTOOL
  4. Recompile bazel, with added linker_flag: "-Wl,-R/usr/local/cuda/lib64" in tools/cpp/CROSSTOOL
  5. Create a file /etc/ld.so.conf.d/LOCAL_cuda-lib64.conf with the contents /usr/local/cuda/lib64 and ran ldconfig

Logs or other output that would be helpful

Here's the full output of the last operation:

ERROR: /home/alexander/.local/src/tensorflow/tensorflow/cc/BUILD:28:1: Executing genrule //tensorflow/cc:random_ops_genrule failed: namespace-sandbox failed: error executing command 
  (cd /home/alexander/.cache/bazel/_bazel_alexander/3fc3a90944d6c7fe99106d6e515412c7/tensorflow && \
  exec env - \
    PATH=/usr/lib/ccache:/usr/lib/ccache:/usr/lib64/qt-3.3/bin:/usr/lib64/ccache:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/home/alexander/.local/bin:/home/alexander/bin:/home/alexander/.local/bin:/home/alexander/bin \
  /home/alexander/.cache/bazel/_bazel_alexander/3fc3a90944d6c7fe99106d6e515412c7/tensorflow/_bin/namespace-sandbox @/home/alexander/.cache/bazel/_bazel_alexander/3fc3a90944d6c7fe99106d6e515412c7/tensorflow/bazel-sandbox/7d27406c-6595-46a2-a8dc-b7faf7c28c88-0.params -- /bin/bash -c 'source external/bazel_tools/tools/genrule/genrule-setup.sh; bazel-out/host/bin/tensorflow/cc/ops/random_ops_gen_cc bazel-out/local_linux-opt/genfiles/tensorflow/cc/ops/random_ops.h bazel-out/local_linux-opt/genfiles/tensorflow/cc/ops/random_ops.cc 0').
bazel-out/host/bin/tensorflow/cc/ops/random_ops_gen_cc: error while loading shared libraries: libcudart.so.7.5: cannot open shared object file: No such file or directory
Target //tensorflow/cc:tutorials_example_trainer failed to build
INFO: Elapsed time: 630.365s, Critical Path: 137.07s

ps.: Out of curiosity, what are you TensorFlow devs using for a development machine? Has any of you actually tried using a modern Linux distribution that comes with GCC newer than 4.9 for compilation? You really should. It's quite the experience.

@akors
Copy link
Author

akors commented Apr 21, 2016

Fun fact: Symlinking all Cuda shared libraries to the custom GCC library directory also doesn't work.

@black-puppydog
Copy link

I have been having this issue, too. For now I resorted to using the nightly builds, which work just fine, and hope that I won't have to touch the C++ portions of tensorflow or contribute a patch that requires rebuilding...
There was another issue that seemed related (#1701), also on Fedora23.
I am on Fedora21 here, with gcc 4.9.2

@akors
Copy link
Author

akors commented Apr 22, 2016

@black-puppydog Thanks for your information.
Do I understand correctly that
a) You are using nightly builds as "user", but do not build it from source?
b) Even on Fedora 21, it fails to compile with the missing libcudart.so.7.5?

Another fun fact:
Symlinking /usr/lib64/libcudart.so.7.5 -> /usr/local/cuda/lib64/libcudart.so.7.5 also doesn't work.

I'm out of ideas over here. I will go back to the CPU-based build, which seems to work and come back when some dev has shed some light on this issue.

@vrv
Copy link

vrv commented Apr 22, 2016

gcc 4.9.2 is what I currently use and it has worked for me. @keveman in case he has any ideas, or maybe we have to rope in bazel devs.

@vrv vrv assigned vrv and keveman and unassigned vrv Apr 22, 2016
@akors
Copy link
Author

akors commented Apr 24, 2016

@vrv Thanks for the reply. And you are able to build the GPU-version of TensorFlow from source?
Which distribution (with version) do you have? Is the GCC 4.9.2 from your distro? Or self-compiled?

@vrv
Copy link

vrv commented Apr 24, 2016

I also just tried from another machine and it still works for me (just synced to HEAD today).

On this machine, I'm using ubuntu 14.04, gcc 4.8.4 provided by distro.

If you manually run bazel-out/host/bin/tensorflow/cc/ops/random_ops_gen_cc, do you still see the failure?

Do you see a bazel-out/host/bin/tensorflow/cc/ops/random_ops_gen_cc.runfiles/third_party/gpus/cuda/lib64/libcudart.so symlink?

@vrv
Copy link

vrv commented Apr 24, 2016

Btw the reason we don't use gcc 5+ is that nvcc currently isn't compatible with it, so we're all stuck on 4.8 or 4.9 :(

@akors
Copy link
Author

akors commented Apr 24, 2016

If you manually run bazel-out/host/bin/tensorflow/cc/ops/random_ops_gen_cc, do you still see the failure?

No, then the library can be found.

$ bazel-out/host/bin/tensorflow/cc/ops/logging_ops_gen_cc
Usage: bazel-out/host/bin/tensorflow/cc/ops/logging_ops_gen_cc out.h out.cc include_internal
  include_internal: 1 means include internal ops
[alexander@desktop-fedora tensorflow]$ ldd bazel-out/host/bin/tensorflow/cc/ops/logging_ops_gen_cc
    linux-vdso.so.1 (0x00007ffcef5b6000)
    libcudart.so.7.5 => /home/alexander/.local/src/tensorflow/bazel-out/host/bin/tensorflow/cc/ops/../../../_solib_local/_U_S_Sthird_Uparty_Sgpus_Scuda_Ccudart___Uthird_Uparty_Sgpus_Scuda_Slib64/libcudart.so.7.5 (0x00007f5991bd4000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f59918a1000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007f599169d000)
    libz.so.1 => /lib64/libz.so.1 (0x00007f5991487000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f5991269000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f5990ee7000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f5990cd0000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f599090e000)
    librt.so.1 => /lib64/librt.so.1 (0x00007f5990706000)
    /lib64/ld-linux-x86-64.so.2 (0x000055fcecd2f000)

Do you see a bazel-out/host/bin/tensorflow/cc/ops/random_ops_gen_cc.runfiles/third_party/gpus/cuda/lib64/libcudart.so symlink?

Not libcudart.so, but libcudart.so.7.5 . The link is valid, and links into the bazel-cache directory.

Btw the reason we don't use gcc 5+ is that nvcc currently isn't compatible with it, so we're all stuck on 4.8 or 4.9 :(

I understand. Did you also have to patch your CROSSTOOL file?

On this machine, I'm using ubuntu 14.04, gcc 4.8.4 provided by distro.

Thanks. Does this mean you are using CUDA 7.0, not 7.5?

@vrv
Copy link

vrv commented Apr 24, 2016

Hmm, this is probably a bazel-related problem, since the binary itself does seem to have the right linkages. @damienmg for some help, if he has ideas.

(No, it works for me at 7.5 too, I did not have to patch my CROSSTOOL file). To be completely honest, I'm not sure why some configurations work and others don't.

@damienmg
Copy link
Contributor

It is running with sandboxing so cannot find the sjared library

@damienmg
Copy link
Contributor

Sorry shared. I should not use my phone configured in French ever again...

Try to add --genrule_strategy=standalone when building TensorFlow

@akors
Copy link
Author

akors commented Apr 25, 2016

Thanks a lot! Adding --genrule_strategy=standalone to the bazel command line works! I can now compile the trainer and run the trainer with GPU support.

For the curious, the full command line is now:
bazel build -c opt --config=cuda --genrule_strategy=standalone --local_resources 4096,3.0,1.0 -j 2 //tensorflow/cc:tutorials_example_trainer --verbose_failures
Totally obvious ;)

I will now try to minimize all the changes that I did to get it to run on my system, and then write them up in a comprehensible way.

As for this bug: compiling in standalone mode is not very obvious, so at the very least I think it should go in the documentation. I do believe that some fixing is required (either for bazel or for the tensorflow build rules), but you can close this issue at your discretion.

@damienmg
Copy link
Contributor

Actually tensorflow's configure should add that flag, what is in tools/bazel.rc for you?

@akors
Copy link
Author

akors commented Apr 25, 2016

This is my tools/bazel.rc after ./configure.

# Autogenerated by configure: DO NOT EDIT
build:cuda --crosstool_top=//third_party/gpus/crosstool
build:cuda --define=using_cuda=true

build --force_python=py2
build --python2_path=/usr/bin/python
build --define=use_fast_cpp_protos=true
build --define=allow_oversize_protos=true

build --spawn_strategy=standalone
test --spawn_strategy=standalone
run --spawn_strategy=standalone

Is it possible that the --spawn_strategy=standalone is only added to the build target, but not to the build:cuda target?

@damienmg
Copy link
Contributor

/cc @aehlig who knows exactly how this file is parsed. Yes it is definitely possible but should not

@aehlig
Copy link

aehlig commented Apr 26, 2016

/cc @aehlig who knows exactly how this file is parsed.

My understanding is that build:cuda options are more specific than plain build options and hence take precedence where conflicting options are specified for build and build:cuda; but otherwise they're just added additionally. So, in your case you should get --spawn_startegy=standalone. You can also verify which options are taken from the rc-files by adding --announce_rc to the command-line.

Note, however, that on the command-line you specified --genrule_strategy=standalone whereas from the rc-file you inherit --spawn_strategy=standalone and I'm not sure how those two strategy options interact.

@akors
Copy link
Author

akors commented Apr 26, 2016

Hi, this is for the people coming from Google, trying to get TensorFlow to compile on their machines:

How to compile TensorFlow from source on Fedora 23 with a custom compiler.

Compiling TensorFlow with GPU support is possible, but a bit tricky on Fedora 23 and up.
The compilation requires a specific GCC version which is not available from Fedora repoisitories and specifying the compiler is more complicated than it should be.

Compiling GCC

For CUDA version 7.5, you need to obtain the source code for GCC version 4.9. You can obtain it from here.

Next, you need to install GCC compile-time dependencies:

sudo dnf install mpfr-devel gmp-devel libmpc-devel isl-devel

Now you have to configure the GCC build. For details, check out the GCC configuration page. I suggest installing into a custom prefix, such as /opt/gcc-4.9. I suggest enabling only C and C++ and skipping the rest of the GCC languages to save time and disk space. The options --with-as, --with-ld and --with-nm are required, because otherwise TensorFlow build will fail, complaining that those binaries cannot be found.

/configure --prefix=/opt/gcc-4.9 --disable-nls --enable-languages=c,c++ --with-ld=/bin/ld --with-nm=/bin/nm --with-as=/usr/bin/as

When this step is done, you can compile GCC with the following command:

make -j4

This assumes you want to use 4 processing cores. You can use more or less, or omit the -j option entirely.

Finally, run as root:

make install

Compiling bazel

Obtain the bazel source code. You need the current master branch, NOT any of the recent releases.

git clone https://github.com/bazelbuild/bazel.git

To compile bazel, you need to specify

export CC=/opt/gcc-4.9/bin/gcc
./compile

This will produce the bazel binary in path/to/bazel/output/bazel

Compiling TensorFlow

Obtain the TensorFlow source code

git clone --recurse-submodules https://github.com/tensorflow/tensorflow

Modify the file third_party/gpus/crosstool/CROSSTOOL: Find the toolchain entry where the toolchain_identifieris set to "local_linux". Only change entries here, the rest is irrelevant.

Replace the following lines:

cxx_builtin_include_directory: "/usr/lib/gcc/"
cxx_builtin_include_directory: "/usr/local/include"

with the following lines:
cxx_builtin_include_directory: "/opt/gcc-4.9/lib/gcc/"
cxx_builtin_include_directory: "/opt/gcc-4.9/local/include"
cxx_builtin_include_directory: "/opt/gcc-4.9/include"

Next, run the ./configure and set your options. Specify your self-compiled GCC (for me /opt/gcc-4.9/bin/gcc) as the compiler to be used by nvcc.

To compile the source, use the following command line:

path/to/bazel/output/bazel build -c opt --config=cuda --genrule_strategy=standalone --local_resources 4096,4.0,1.0 -j 4 //tensorflow/cc:tutorials_example_trainer

Explanations:

  • build: what bazel should do
  • -c opt: Don't know, says so in the docu
  • --config=cuda: Compile with CUDA support. Don't ask me why you have to specify that again, even though you did so in ./configure.
  • --genrule_strategy=standalone: This compiles in "standalone mode". Don't ask me what that is, but it's required so that generated output files can find the libcudart.so that they are linked to (see issue Failed to build from source due to missing libcudart.so.7.5: #2053).
  • --local_resources 4096,4.0,1.0 -j 4: Use at most 4096M of memory, 4.0 CPU's, 1.0 of I/O and 4 threads. This is required so that the compilation doesn't crash due to out-of-memory (I have 8GB of physical memory and 4GB of swap). The 4096 is still a lie, because the compilation still used more - but at least it didn't crash.
  • //tensorflow/cc:tutorials_example_trainer: What to build.

I sincerely hope that this guide will be obsolete very soon, and you can just get cracking without all these workarounds. But for now, this will probably be useful.

@akors
Copy link
Author

akors commented Apr 26, 2016

@aehlig

Here's the output with --announce_rc but not --genrule_strategy=standalone

$ bazel build -c opt --config=cuda --local_resources 4096,4.0,1.0 -j 4 --announce_rc //tensorflow/cc:tutorials_example_trainerINFO: Options provided by the client:
  Inherited 'common' options: --isatty=1 --terminal_columns=173
INFO: Reading options for 'build' from /home/alexander/.local/src/tensorflow/tools/bazel.rc:
  'build' options: --force_python=py2 --python2_path=/usr/bin/python --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone
INFO: Reading options for 'build' from /home/alexander/.local/src/tensorflow/tools/bazel.rc:
  'build' options: --crosstool_top=//third_party/gpus/crosstool --define=using_cuda=true

Here's the output with both

$ bazel build -c opt --config=cuda --genrule_strategy=standalone --local_resources 4096,4.0,1.0 -j 4 --announce_rc //tensorflow/cc:tutorials_example_trainer
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=1 --terminal_columns=173
INFO: Reading options for 'build' from /home/alexander/.local/src/tensorflow/tools/bazel.rc:
  'build' options: --force_python=py2 --python2_path=/usr/bin/python --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone
INFO: Reading options for 'build' from /home/alexander/.local/src/tensorflow/tools/bazel.rc:
  'build' options: --crosstool_top=//third_party/gpus/crosstool --define=using_cuda=true

I really don't know how --spawn_startegy=standalone interacts with --genrule_strategy=standalone, but I know that without the latter being passed on the command line, the compilation fails with the missing libcudart.so.

@damienmg
Copy link
Contributor

As discussed offline with @aehlig, --spawn_strategy and --genrule_strategy
are two different things, I totally missed that tensorflow does not have
the later in their rc file.

On Tue, Apr 26, 2016 at 11:35 PM Alexander Korsunsky <
notifications@github.com> wrote:

@aehlig https://github.com/aehlig

Here's the output with --announce_rc but not --genrule_strategy=standalone

$ bazel build -c opt --config=cuda --local_resources 4096,4.0,1.0 -j 4 --announce_rc //tensorflow/cc:tutorials_example_trainerINFO: Options provided by the client:
Inherited 'common' options: --isatty=1 --terminal_columns=173
INFO: Reading options for 'build' from /home/alexander/.local/src/tensorflow/tools/bazel.rc:
'build' options: --force_python=py2 --python2_path=/usr/bin/python --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone
INFO: Reading options for 'build' from /home/alexander/.local/src/tensorflow/tools/bazel.rc:
'build' options: --crosstool_top=//third_party/gpus/crosstool --define=using_cuda=true

Here's the output with both

$ bazel build -c opt --config=cuda --genrule_strategy=standalone --local_resources 4096,4.0,1.0 -j 4 --announce_rc //tensorflow/cc:tutorials_example_trainer
INFO: Options provided by the client:
Inherited 'common' options: --isatty=1 --terminal_columns=173
INFO: Reading options for 'build' from /home/alexander/.local/src/tensorflow/tools/bazel.rc:
'build' options: --force_python=py2 --python2_path=/usr/bin/python --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone
INFO: Reading options for 'build' from /home/alexander/.local/src/tensorflow/tools/bazel.rc:
'build' options: --crosstool_top=//third_party/gpus/crosstool --define=using_cuda=true

I really don't know how --spawn_startegy=standalone interacts with
--genrule_strategy=standalone, but I know that without the latter being
passed on the command line, the compilation fails with the missing
libcudart.so.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#2053 (comment)

@chemelnucfin
Copy link
Contributor

--genrule_strategy=standalone also helped me build on Fedora 23. Now I can run the example trainer.

@akors
Copy link
Author

akors commented May 11, 2016

Hi, @itsmeolivia , is there any particular reason why you closed this issue?

Because I just checked again with the latest head (5681406), and I still have the same error as described in the original message.

I believe that compilation should succeed without any magical options that are not described in the tutorial. And my setup really isn't that exotic ;)

@schmiflo
Copy link

@akors Thanks a lot for pointing this out - it took me quite some time until I found this thread...

fsx950223 pushed a commit to fsx950223/tensorflow that referenced this issue Nov 28, 2023
…upstream-sync-230410

Develop upstream sync 230410
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants