Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build failure: undefined reference to protobuf symbols #34117

Closed
dbonner opened this issue Nov 9, 2019 · 43 comments
Closed

Build failure: undefined reference to protobuf symbols #34117

dbonner opened this issue Nov 9, 2019 · 43 comments
Assignees
Labels
subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues type:build/install Build and install issues

Comments

@dbonner
Copy link

dbonner commented Nov 9, 2019

Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04.3
  • Python version: Tried with python 3.7.3 and python 3.8
  • Installed using virtualenv? pip? conda?: conda python 3.7.3 and virtualenv python 3.8
  • Bazel version (if compiling from source): 0.29.1
  • GCC/Compiler version (if compiling from source): 7.4.0
  • CUDA/cuDNN version: CUDA 10.0 and cuDNN 7.6.5
  • GPU model and memory: Nvidia RTX 2080 Ti

Describe the problem
Build fails most of the way in to build.

Provide the exact sequence of commands / steps that you executed before running into the problem

git clone https://github.com/tensorflow/tensorflow.git
cd tensorflow
git checkout -b mybranch (make up a branch to checkout head)
bazel build --config=opt --config=cuda --config=v2 --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" //tensorflow/tools/pip_package:build_pip_package

Here is the last part of the terminal's output (attached text file):
tensorflow_build_fails.txt

Any other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

@gadagashwini-zz gadagashwini-zz self-assigned this Nov 11, 2019
@gadagashwini-zz gadagashwini-zz added subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues type:build/install Build and install issues labels Nov 11, 2019
@gadagashwini-zz
Copy link
Contributor

@dbonner, Did you run ./configure. If yes please attach the output of ./configure. Thanks!

@gadagashwini-zz gadagashwini-zz added the stat:awaiting response Status - Awaiting response from author label Nov 11, 2019
@normanheckscher
Copy link
Contributor

https://www.tensorflow.org/install/source#linux

Tested build configuration differs from the OP.

Tested build uses:
gcc 7.3.1
cuDNN 7.4

@dbonner
Copy link
Author

dbonner commented Nov 12, 2019

@gadagashwini
Here is the output as requested:
./configure
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
You have bazel 0.29.1 installed.
Please specify the location of python. [Default is /home/daniel/anaconda3/envs/tfgpu/bin/python]:

Found possible Python library paths:
/home/daniel/anaconda3/envs/tfgpu/lib/python3.7/site-packages
Please input the desired Python library path to use. Default is [/home/daniel/anaconda3/envs/tfgpu/lib/python3.7/site-packages]

Do you wish to build TensorFlow with XLA JIT support? [Y/n]: Y
XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: N
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with ROCm support? [y/N]: N
No ROCm support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Do you wish to build TensorFlow with TensorRT support? [y/N]: y
TensorRT support will be enabled for TensorFlow.

Found CUDA 10.0 in:
/usr/local/cuda/lib64
/usr/local/cuda/include
Found cuDNN 7 in:
/usr/lib/x86_64-linux-gnu
/usr/include
Found TensorRT 6 in:
/usr/lib/x86_64-linux-gnu
/usr/include/x86_64-linux-gnu

Please specify a list of comma-separated CUDA compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size, and that TensorFlow only supports compute capabilities >= 3.5 [Default is: 3.5,7.0]: 7.5

Do you want to use clang as CUDA compiler? [y/N]: N
nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native -Wno-sign-compare]:

Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]:
Not configuring the WORKSPACE for Android builds.

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
--config=mkl # Build with MKL support.
--config=monolithic # Config for mostly static monolithic build.
--config=ngraph # Build with Intel nGraph support.
--config=numa # Build with NUMA support.
--config=dynamic_kernels # (Experimental) Build kernels into separate shared objects.
--config=v2 # Build TensorFlow 2.x instead of 1.x.
Preconfigured Bazel build configs to DISABLE default on features:
--config=noaws # Disable AWS S3 filesystem support.
--config=nogcp # Disable GCP support.
--config=nohdfs # Disable HDFS support.
--config=nonccl # Disable NVIDIA NCCL support.
Configuration finished

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Nov 12, 2019
@bas-aarts
Copy link

I've also not been able to compile TF from source for the last 4 days because of this.

@benbarsdell
Copy link
Contributor

I see the same errors too.

@pooyadavoodi
Copy link

I also see the same problem. I tried bazel 0.27.1 and 0.27.2. It seems related to protobuf. What version of protobuf is TF using?

@mihaimaruseac
Copy link
Collaborator

mihaimaruseac commented Nov 13, 2019

Issue title is misleading, first error is not undefined reference to tensorflow::register_op but

bazel-out/host/bin/tensorflow/core/libfunctional_ops_op_lib.lo(functional_ops.o): In function `tensorflow::Status tensorflow::errors::InvalidArgument<char const*, unsigned long, char const*, int>(char const*, unsigned long, char const*, int)':
functional_ops.cc:(.text._ZN10tensorflow6errors15InvalidArgumentIJPKcmS3_iEEENS_6StatusEDpT_[_ZN10tensorflow6errors15InvalidArgumentIJPKcmS3_iEEENS_6StatusEDpT_]+0x42): undefined reference to `tensorflow::strings::FastInt32ToBufferLeft(int, char*)'
functional_ops.cc:(.text._ZN10tensorflow6errors15InvalidArgumentIJPKcmS3_iEEENS_6StatusEDpT_[_ZN10tensorflow6errors15InvalidArgumentIJPKcmS3_iEEENS_6StatusEDpT_]+0x82): undefined reference to `tensorflow::strings::FastUInt64ToBufferLeft(unsigned long long, char*)'
functional_ops.cc:(.text._ZN10tensorflow6errors15InvalidArgumentIJPKcmS3_iEEENS_6StatusEDpT_[_ZN10tensorflow6errors15InvalidArgumentIJPKcmS3_iEEENS_6StatusEDpT_]+0xd4): undefined reference to `tensorflow::strings::StrCat[abi:cxx11](tensorflow::strings::AlphaNum const&, tensorflow::strings::AlphaNum const&, tensorflow::strings::AlphaNum const&, tensorflow::strings::AlphaNum const&)'
functional_ops.cc:(.text._ZN10tensorflow6errors15InvalidArgumentIJPKcmS3_iEEENS_6StatusEDpT_[_ZN10tensorflow6errors15InvalidArgumentIJPKcmS3_iEEENS_6StatusEDpT_]+0xef): undefined reference to `tensorflow::Status::Status(tensorflow::error::Code, absl::string_view)'
...

In any case, can you try building again from a fresh clone and attach the entire log of bazel build -s --config=opt --config=cuda --config=v2 --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" //tensorflow/tools/pip_package:build_pip_package &> log.txt?

@pooyadavoodi
Copy link

The first error I see is this one:

ERROR: /bench/tensorflow-source/tensorflow/python/BUILD:2317:1: Linking of rule '//tensorflow/python:gen_boosted_trees_ops_py_wrappers_cc' failed (Exit 1)
bazel-out/host/bin/tensorflow/core/libop_gen_lib.a(op_gen_lib.o): In function `google::protobuf::internal::ArenaStringPtr::CreateInstance(google::protobuf::Arena*, std::__cxx11::basic_string<char, std::char_trait
s<char>, std::allocator<char> > const*)':
op_gen_lib.cc:(.text._ZN6google8protobuf8internal14ArenaStringPtr14CreateInstanceEPNS0_5ArenaEPKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE[_ZN6google8protobuf8internal14ArenaStringPtr14CreateInstanceEPN
S0_5ArenaEPKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE]+0x36): undefined reference to `google::protobuf::internal::ArenaImpl::AllocateAlignedAndAddCleanup(unsigned long, void (*)(void*))'
op_gen_lib.cc:(.text._ZN6google8protobuf8internal14ArenaStringPtr14CreateInstanceEPNS0_5ArenaEPKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE[_ZN6google8protobuf8internal14ArenaStringPtr14CreateInstanceEPN
S0_5ArenaEPKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE]+0xc0): undefined reference to `google::protobuf::Arena::OnArenaAllocation(std::type_info const*, unsigned long) const'

@dbonner dbonner changed the title Building from head fails - undefined reference to tensorflow::register_op Building from head fails - functional_ops.cc:(.text.startup._Z41__static_initialization_and_destruction_0ii.constprop.70+0x1ac7): undefined reference to `tensorflow::register_op::OpDefBuilderReceiver::OpDefBuilderReceiver(tensorflow::register_op::OpDefBuilderWrapper<true> const&)' Nov 14, 2019
@dbonner
Copy link
Author

dbonner commented Nov 14, 2019

@mihaimaruseac
log.txt is 119 MB so can't be uploaded to github.
You can download from this link:
https://www.dropbox.com/s/c7xc9y37k617z93/log.txt?dl=0
Many thanks for helping with this issue.

@mihaimaruseac
Copy link
Collaborator

First error is

ERROR: /home/daniel/tensorflow/tensorflow/python/BUILD:2599:1: Linking of rule '//tensorflow/python:gen_tpu_ops_py_wrappers_cc' failed (Exit 1)
bazel-out/host/bin/tensorflow/core/libop_gen_lib.a(op_gen_lib.o): In function `google::protobuf::internal::ArenaStringPtr::CreateInstance(google::protobuf::Arena*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const*)':
op_gen_lib.cc:(.text._ZN6google8protobuf8internal14ArenaStringPtr14CreateInstanceEPNS0_5ArenaEPKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE[_ZN6google8protobuf8internal14ArenaStringPtr14CreateInstanceEPNS0_5ArenaEPKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE]+0x36): undefined reference to `google::protobuf::internal::ArenaImpl::AllocateAlignedAndAddCleanup(unsigned long, void (*)(void*))'
op_gen_lib.cc:(.text._ZN6google8protobuf8internal14ArenaStringPtr14CreateInstanceEPNS0_5ArenaEPKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE[_ZN6google8protobuf8internal14ArenaStringPtr14CreateInstanceEPNS0_5ArenaEPKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE]+0xc0): undefined reference to `google::protobuf::Arena::OnArenaAllocation(std::type_info const*, unsigned long) const'

@mihaimaruseac mihaimaruseac changed the title Building from head fails - functional_ops.cc:(.text.startup._Z41__static_initialization_and_destruction_0ii.constprop.70+0x1ac7): undefined reference to `tensorflow::register_op::OpDefBuilderReceiver::OpDefBuilderReceiver(tensorflow::register_op::OpDefBuilderWrapper<true> const&)' Build failure: undefined reference to protobuf symbols Nov 14, 2019
@cliffwoolley
Copy link
Contributor

@pkanwar23 , can you please help to drive this one? It is blocking several people on our team.

@pooyadavoodi
Copy link

Issue title is misleading, first error is not undefined reference to tensorflow::register_op but

bazel-out/host/bin/tensorflow/core/libfunctional_ops_op_lib.lo(functional_ops.o): In function `tensorflow::Status tensorflow::errors::InvalidArgument<char const*, unsigned long, char const*, int>(char const*, unsigned long, char const*, int)':
functional_ops.cc:(.text._ZN10tensorflow6errors15InvalidArgumentIJPKcmS3_iEEENS_6StatusEDpT_[_ZN10tensorflow6errors15InvalidArgumentIJPKcmS3_iEEENS_6StatusEDpT_]+0x42): undefined reference to `tensorflow::strings::FastInt32ToBufferLeft(int, char*)'
functional_ops.cc:(.text._ZN10tensorflow6errors15InvalidArgumentIJPKcmS3_iEEENS_6StatusEDpT_[_ZN10tensorflow6errors15InvalidArgumentIJPKcmS3_iEEENS_6StatusEDpT_]+0x82): undefined reference to `tensorflow::strings::FastUInt64ToBufferLeft(unsigned long long, char*)'
functional_ops.cc:(.text._ZN10tensorflow6errors15InvalidArgumentIJPKcmS3_iEEENS_6StatusEDpT_[_ZN10tensorflow6errors15InvalidArgumentIJPKcmS3_iEEENS_6StatusEDpT_]+0xd4): undefined reference to `tensorflow::strings::StrCat[abi:cxx11](tensorflow::strings::AlphaNum const&, tensorflow::strings::AlphaNum const&, tensorflow::strings::AlphaNum const&, tensorflow::strings::AlphaNum const&)'
functional_ops.cc:(.text._ZN10tensorflow6errors15InvalidArgumentIJPKcmS3_iEEENS_6StatusEDpT_[_ZN10tensorflow6errors15InvalidArgumentIJPKcmS3_iEEENS_6StatusEDpT_]+0xef): undefined reference to `tensorflow::Status::Status(tensorflow::error::Code, absl::string_view)'
...

In any case, can you try building again from a fresh clone and attach the entire log of bazel build -s --config=opt --config=cuda --config=v2 --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" //tensorflow/tools/pip_package:build_pip_package &> log.txt?

Just tried this command on the latest master, and got the same error.
I uploaded the log here: https://drive.google.com/open?id=1MobJh4K6bk_ySptRn77F1dKl4srY3k0I

The first error seems to be this one:

ERROR: /bench/tensorflow-source/tensorflow/cc/BUILD:506:1: Linking of rule '//tensorflow/cc:ops/training_ops_gen_cc' failed (Exit 1)
bazel-out/host/bin/tensorflow/core/libop_gen_lib.a(op_gen_lib.o): In function `google::protobuf::internal::ArenaStringPtr::CreateInstance(google::protobuf::Arena*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const*)':
op_gen_lib.cc:(.text._ZN6google8protobuf8internal14ArenaStringPtr14CreateInstanceEPNS0_5ArenaEPKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE[_ZN6google8protobuf8internal14ArenaStringPtr14CreateInstanceEPNS0_5ArenaEPKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE]+0x36): undefined reference to `google::protobuf::internal::ArenaImpl::AllocateAlignedAndAddCleanup(unsigned long, void (*)(void*))'
op_gen_lib.cc:(.text._ZN6google8protobuf8internal14ArenaStringPtr14CreateInstanceEPNS0_5ArenaEPKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE[_ZN6google8protobuf8internal14ArenaStringPtr14CreateInstanceEPNS0_5ArenaEPKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE]+0xc0): undefined reference to `google::protobuf::Arena::OnArenaAllocation(std::type_info const*, unsigned long) const'
bazel-out/host/bin/tensorflow/core/libop_gen_lib.a(op_gen_lib.o): In function `tensorflow::(anonymous namespace)::MergeArg(tensorflow::ApiDef_Arg*, tensorflow::ApiDef_Arg const&)':
op_gen_lib.cc:(.text._ZN10tensorflow12_GLOBAL__N_18MergeArgEPNS_10ApiDef_ArgERKS1_+0x4c): undefined reference to `google::protobuf::internal::fixed_address_empty_string[abi:cxx11]'
op_gen_lib.cc:(.text._ZN10tensorflow12_GLOBAL__N_18MergeArgEPNS_10ApiDef_ArgERKS1_+0x77): undefined reference to `google::protobuf::internal::fixed_address_empty_string[abi:cxx11]'
bazel-out/host/bin/tensorflow/core/libop_gen_lib.a(op_gen_lib.o): In function `std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > absl::strings_internal::JoinRange<google::protobuf::RepeatedPtrField<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >(google::protobuf::RepeatedPtrField<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&, absl::string_view)':
op_gen_lib.cc:(.text._ZN4absl16strings_internal9JoinRangeIN6google8protobuf16RepeatedPtrFieldINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEEESA_RKT_NS_11string_viewE[_ZN4absl16strings_internal9JoinRangeIN6google8protobuf16RepeatedPtrFieldINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEEESA_RKT_NS_11string_viewE]+0x21): undefined reference to `google::protobuf::RepeatedPtrField<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >::end() const'

@bas-aarts
Copy link

That's the exact error I'm seeing. I'm compiling merely with:
--config=opt --config=cuda
I tried bazel 0.27, 0.29, 1.0, and 1.1

@benbarsdell
Copy link
Contributor

Just noticed this: if I checkout a random commit from Oct 2, I see the linker errors when I build with bazel 0.27.1, but not when I build with bazel 0.24.1.

@bas-aarts
Copy link

I've come to the same conclusion.
upto 17a2394 (Nov 6) I can build with 0.24.1, but not with 0.27.x

@artemry-nv
Copy link

Seems we're observing the similar issue after recent Bazel version upgrade.

@paapu88
Copy link

paapu88 commented Nov 18, 2019

I have also problems with this, trying now bazel 0.24.1
I'm in jetson nano, tensorflow2 has been compiled from source there (I have not tried it yet):
https://jkjung-avt.github.io/build-tensorflow-2.0.0/

@dbonner
Copy link
Author

dbonner commented Nov 19, 2019

Yes, 0.27.1 still fails with the patch.
It builds with bazel 0.26.1.
You have to run ./configure with bazel 0.27.1 or higher. Otherwise it won't let you run ./configure.
Then uninstall bazel (sudo apt remove bazel)
Then install bazel 0.26.1
Then it will build successfully.

@sergei-mironov
Copy link

sergei-mironov commented Nov 19, 2019

Yes, 0.27.1 still fails with the patch.
It builds with bazel 0.26.1.
You have to run ./configure with bazel 0.27.1 or higher. Otherwise it won't let you run ./configure.
Then uninstall bazel (sudo apt remove bazel)
Then install bazel 0.26.1
Then it will build successfully.

And I call this a solution! Here is a simple patch that can be used to lower the bazel requirements down to 0.26.1. Then bazel 0.26.1 worked for me. Hope we'll find a more clever fix.

diff --git a/configure.py b/configure.py
index d29f3d4464..cd3ce61dec 100644
--- a/configure.py
+++ b/configure.py
@@ -49,7 +49,7 @@ _TF_BAZELRC_FILENAME = '.tf_configure.bazelrc'
 _TF_WORKSPACE_ROOT = ''
 _TF_BAZELRC = ''
 _TF_CURRENT_BAZEL_VERSION = None
-_TF_MIN_BAZEL_VERSION = '0.27.1'
+_TF_MIN_BAZEL_VERSION = '0.26.1'
 _TF_MAX_BAZEL_VERSION = '0.29.1'

 NCCL_LIB_PATHS = [

FYI I'm running configure with the following setting:

PYTHON_BIN_PATH="/usr/local/bin/python" PYTHON_LIB_PATH="/usr/local/lib/python3.6/dist-packages" TF_ENABLE_XLA="y" TF_NEED_OPENCL_SYCL="n" TF_NEED_ROCM="n" TF_NEED_CUDA="y" TF_DOWNLOAD_CLANG="n" GCC_HOST_COMPILER_PATH="/usr/bin/gcc" CC_OPT_FLAGS="-march=native -Wno-sign-compare" TF_SET_ANDROID_WORKSPACE="n" TF_NEED_TENSORRT=n TF_CUDA_COMPUTE_CAPABILITIES="3.5,6.0" TF_CUDA_CLANG=n TF_NEED_MPI=n ./configure;

sergei-mironov pushed a commit to Huawei-MRC-OSI/tensorflow that referenced this issue Nov 19, 2019
@scentini
Copy link
Contributor

We have a fix for this issue on TF side. However given that the build succeeds with 0.26.1 and is broken with 0.27.1, we should locate the change in Bazel that caused this behavior.
An incompatible change that modifies the linking order was introduced in 0.27 through --incompatible_do_not_split_linking_cmdline. Can someone please try building with 0.27.1 or newer version adding --noincompatible_do_not_split_linking_cmdline ?

@scentini
Copy link
Contributor

Yes, 0.27.1 still fails with the patch.
It builds with bazel 0.26.1.

@dbonner can you please clarify -- was the build with 0.26.1 successful before a73d7ac?
We're seeing failures at HEAD for windows and macos due to this commit, it'll most likely be at least temporarily reverted. I'd like to know if it impacts this issue in any way.

@bas-aarts
Copy link

building at a73d7ac with bazel 0.27.2 with --noincompatible_do_not_split_linking_cmdline produces the same linker errors.

@mihaimaruseac
Copy link
Collaborator

@bmzhao has a fix for this which would be incoming soon.

@dbonner
Copy link
Author

dbonner commented Nov 19, 2019

@scentini
I don't know if the build would have been successful with bazel 0.26.1 before patch a73d7ac. I tested building after the patch with bazel 0.26.1 and it worked.
I am building the r2.1 branch with bazel 0.26.1 at the moment. It has the same error when you try to build it with bazel 0.27.1.

tensorflow-copybara pushed a commit that referenced this issue Nov 19, 2019
Bazel's change to legacy_whole_archive behavior is not the cause for TF's linking issues with protobuf. Protobuf's implementation and runtime are correctly being linked into TF here: https://github.com/tensorflow/tensorflow/blob/da5765ebad2e1d3c25d11ee45aceef0b60da499f/tensorflow/core/platform/default/build_config.bzl#L239 and https://github.com/tensorflow/tensorflow/blob/da5765ebad2e1d3c25d11ee45aceef0b60da499f/third_party/protobuf/protobuf.patch#L18, and I've confirmed that protobuf symbols are still present in libtensorflow_framework.so via nm.

After examining the linker flags that bazel passes to gcc, https://gist.github.com/bmzhao/f51bbdef50e9db9b24acd5b5acc95080, I discovered that the order of the linker flags was what was causing the undefined reference.

See https://eli.thegreenplace.net/2013/07/09/library-order-in-static-linking/ and https://stackoverflow.com/a/12272890. Basically linkers discard the objects they've been asked to link if those objects do not export any symbols that the linker currently has kept track as "undefined".

To prove this was the issue, I was able to successfully link after moving the linking shared object flag (-l:libtensorflow_framework.so.2) to the bottom of the flag order, and manually invoking g++.

This change uses cc_import to to link against a .so in the "deps" of tf_cc_binary, rather than as the "srcs" of tf_cc_binary. This technique was inspired by the comment here: https://github.com/bazelbuild/bazel/blob/387c610d09b99536f7f5b8ecb883d14ee6063fdd/examples/windows/dll/windows_dll_library.bzl#L47-L48

Successfully built on vanilla Ubuntu 18.04 VM:
bmzhao@bmzhao-tf-build-failure-reproing:~/tf-fix/tf$ bazel build -c opt --config=cuda --config=v2 --host_force_python=PY3 //tensorflow/tools/pip_package:build_pip_package
Target //tensorflow/tools/pip_package:build_pip_package up-to-date:
  bazel-bin/tensorflow/tools/pip_package/build_pip_package
INFO: Elapsed time: 2067.380s, Critical Path: 828.19s
INFO: 12942 processes: 51 remote cache hit, 12891 local.
INFO: Build completed successfully, 14877 total actions

The root cause might instead be bazelbuild/bazel#7687, which is pending further investigation.

PiperOrigin-RevId: 281341817
Change-Id: Ia240eb050d9514ed5ac95b7b5fb7e0e98b7d1e83
@bmzhao
Copy link
Member

bmzhao commented Nov 19, 2019

Hello!

5caa9e8 is now in master. I've manually tested building with it using bazel 1.1, with the following command:

bazel build -c opt --config=cuda --config=v2 --host_force_python=PY3 //tensorflow/tools/pip_package:build_pip_package


Target //tensorflow/tools/pip_package:build_pip_package up-to-date:
  bazel-bin/tensorflow/tools/pip_package/build_pip_package
INFO: Elapsed time: 2067.380s, Critical Path: 828.19s
INFO: 12942 processes: 51 remote cache hit, 12891 local.
INFO: Build completed successfully, 14877 total actions

@dbonner can you confirm if Tensorflow head now builds for you as well?

@bas-aarts
Copy link

@bmzhao, I was able to successfully build TF with bazel 1.1 at 5caa9e8 on Ubuntu 18.04
I do see some failures for windows and mac in the CI, like the ones mentioned by @scentini before.
Does this mean this change could be reverted?

@bmzhao
Copy link
Member

bmzhao commented Nov 19, 2019

Hey @bas-aarts,

After double checking with our buildcop, it looks like the current Windows and Mac breakages' root causes are other commits (not 5caa9e8). Therefore, I don't expect this change to be rolled back.

@alanpurple
Copy link
Contributor

Is this error related?

ERROR: /home/vai/repo/tensorflow/tensorflow/python/keras/api/BUILD:115:1: Executing genrule //tensorflow/python/keras/api:keras_python_api_gen_compat_v1 failed (Exit 1)
2019-11-20 13:15:41.004621: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
Traceback (most recent call last):
  File "/home/vai/.cache/bazel/_bazel_vai/964c7018fd2d0d2d2cf98e15f592d3c8/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_1_keras_python_api_gen_compat_v1.runfiles/org_tensorflow/tensorflow/python/tools/api/generator/create_python_api.py", line 771, in <module>
    main()
  File "/home/vai/.cache/bazel/_bazel_vai/964c7018fd2d0d2d2cf98e15f592d3c8/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_1_keras_python_api_gen_compat_v1.runfiles/org_tensorflow/tensorflow/python/tools/api/generator/create_python_api.py", line 767, in main
    lazy_loading, args.use_relative_imports)
  File "/home/vai/.cache/bazel/_bazel_vai/964c7018fd2d0d2d2cf98e15f592d3c8/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_1_keras_python_api_gen_compat_v1.runfiles/org_tensorflow/tensorflow/python/tools/api/generator/create_python_api.py", line 625, in create_api_files
    api_version, compat_api_versions, lazy_loading, use_relative_imports)
  File "/home/vai/.cache/bazel/_bazel_vai/964c7018fd2d0d2d2cf98e15f592d3c8/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_1_keras_python_api_gen_compat_v1.runfiles/org_tensorflow/tensorflow/python/tools/api/generator/create_python_api.py", line 502, in get_api_init_text
    _, attr = tf_decorator.unwrap(attr)
  File "/home/vai/.cache/bazel/_bazel_vai/964c7018fd2d0d2d2cf98e15f592d3c8/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_1_keras_python_api_gen_compat_v1.runfiles/org_tensorflow/tensorflow/python/util/tf_decorator.py", line 219, in unwrap
    elif _has_tf_decorator_attr(cur):
  File "/home/vai/.cache/bazel/_bazel_vai/964c7018fd2d0d2d2cf98e15f592d3c8/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_1_keras_python_api_gen_compat_v1.runfiles/org_tensorflow/tensorflow/python/util/tf_decorator.py", line 124, in _has_tf_decorator_attr
    hasattr(obj, '_tf_decorator') and
  File "/home/vai/.cache/bazel/_bazel_vai/964c7018fd2d0d2d2cf98e15f592d3c8/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_1_keras_python_api_gen_compat_v1.runfiles/org_tensorflow/tensorflow/python/util/lazy_loader.py", line 62, in __getattr__
    module = self._load()
  File "/home/vai/.cache/bazel/_bazel_vai/964c7018fd2d0d2d2cf98e15f592d3c8/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_1_keras_python_api_gen_compat_v1.runfiles/org_tensorflow/tensorflow/python/util/lazy_loader.py", line 45, in _load
    module = importlib.import_module(self.__name__)
  File "/home/vai/anaconda3/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/vai/.cache/bazel/_bazel_vai/964c7018fd2d0d2d2cf98e15f592d3c8/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_1_keras_python_api_gen_compat_v1.runfiles/org_tensorflow/tensorflow/compiler/tf2tensorrt/wrap_py_utils.py", line 28, in <module>
    _wrap_py_utils = swig_import_helper()
  File "/home/vai/.cache/bazel/_bazel_vai/964c7018fd2d0d2d2cf98e15f592d3c8/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_1_keras_python_api_gen_compat_v1.runfiles/org_tensorflow/tensorflow/compiler/tf2tensorrt/wrap_py_utils.py", line 24, in swig_import_helper
    _mod = imp.load_module('_wrap_py_utils', fp, pathname, description)
  File "/home/vai/anaconda3/lib/python3.7/imp.py", line 242, in load_module
    return load_dynamic(name, filename, file)
  File "/home/vai/anaconda3/lib/python3.7/imp.py", line 342, in load_dynamic
    return _load(spec)
  File "<frozen importlib._bootstrap>", line 696, in _load
  File "<frozen importlib._bootstrap>", line 670, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 583, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 1043, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /home/vai/.cache/bazel/_bazel_vai/964c7018fd2d0d2d2cf98e15f592d3c8/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/python/keras/api/create_tensorflow.python_api_1_keras_python_api_gen_compat_v1.runfiles/org_tensorflow/tensorflow/compiler/tf2tensorrt/_wrap_py_utils.so: undefined symbol: _ZN10tensorflow3Env7DefaultEv
----------------
Note: The failure of target //tensorflow/python/keras/api:create_tensorflow.python_api_1_keras_python_api_gen_compat_v1 (with exit code 1) may have been caused by the fact that it is a Python 2 program that was built in the host configuration, which uses Python 3. You can change the host configuration (for the entire build) to instead use Python 2 by setting --host_force_python=PY2.

If this error started occurring in Bazel 0.27 and later, it may be because the Python toolchain now enforces that targets analyzed as PY2 and PY3 run under a Python 2 and Python 3 interpreter, respectively. See https://github.com/bazelbuild/bazel/issues/7899 for more information.
----------------
Target //tensorflow/tools/pip_package:build_pip_package failed to build
Use --verbose_failures to see the command lines of failed build steps.
ERROR: /home/vai/repo/tensorflow/tensorflow/python/tools/BUILD:141:1 Executing genrule //tensorflow/python/keras/api:keras_python_api_gen_compat_v1 failed (Exit 1)

@bmzhao
Copy link
Member

bmzhao commented Nov 20, 2019

@alanpurple could you file a separate issue including all relevant information to reproduce the error? Please see https://github.com/tensorflow/tensorflow/blob/master/ISSUE_TEMPLATE.md#system-information

@bmzhao bmzhao closed this as completed Nov 20, 2019
@tensorflow-bot
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

@dbonner
Copy link
Author

dbonner commented Nov 20, 2019

I am still in the process of building with bazel 1.1.0 and will report back how it goes.
If successful, can we update configure.py to allow maximum bazel version of 1.1.0 please?

@scentini
Copy link
Contributor

I believe the undefined symbols errors are caused by 2 different Bazel flags:
--incompatible_remove_legacy_whole_archive
--incompatible_do_not_split_linking_cmdline

@bmzhao confirmed that --incompatible_do_not_split_linking_cmdline allows for a successful build even without the fix 5caa9e8

The --incompatible_remove_legacy_whole_archive has proven difficult to stick due to regressions and undefined symbols at runtime. @alanpurple, I suspect that the linking issue you're running into comes from it. Due to lack of test coverage it can happen that the codebase regressed wrt this flag in the meantime.

I am working on setting the default for --incompatible_remove_legacy_whole_archive to false. a73d7ac did that, but was reverted in 5caa9e8 because it caused windows and macos failures due to multiply defined symbols. I'm looking into fixing that now.

@dbonner configure.py already sets the max version to 1.1.0. Or is there another configure.py somewhere?

@artemry-nv
Copy link

JFYI
For our similar case the problem is not reproducible with TensorFlow master since yesterday (looks like there were some fixes). Bazel version: 0.29.1.
For example, this TensorFlow commit (95051b9) is built successfully.

@Molkree
Copy link
Contributor

Molkree commented Nov 20, 2019

I was having the same issue building 76f94a5 with Bazel 1.1.0 but adding --noincompatible_do_not_split_linking_cmdline as @scentini suggested allowed to build successfully (thank you!).

mihaimaruseac pushed a commit that referenced this issue Dec 4, 2019
Bazel's change to legacy_whole_archive behavior is not the cause for TF's linking issues with protobuf. Protobuf's implementation and runtime are correctly being linked into TF here: https://github.com/tensorflow/tensorflow/blob/da5765ebad2e1d3c25d11ee45aceef0b60da499f/tensorflow/core/platform/default/build_config.bzl#L239 and https://github.com/tensorflow/tensorflow/blob/da5765ebad2e1d3c25d11ee45aceef0b60da499f/third_party/protobuf/protobuf.patch#L18, and I've confirmed that protobuf symbols are still present in libtensorflow_framework.so via nm.

After examining the linker flags that bazel passes to gcc, https://gist.github.com/bmzhao/f51bbdef50e9db9b24acd5b5acc95080, I discovered that the order of the linker flags was what was causing the undefined reference.

See https://eli.thegreenplace.net/2013/07/09/library-order-in-static-linking/ and https://stackoverflow.com/a/12272890. Basically linkers discard the objects they've been asked to link if those objects do not export any symbols that the linker currently has kept track as "undefined".

To prove this was the issue, I was able to successfully link after moving the linking shared object flag (-l:libtensorflow_framework.so.2) to the bottom of the flag order, and manually invoking g++.

This change uses cc_import to to link against a .so in the "deps" of tf_cc_binary, rather than as the "srcs" of tf_cc_binary. This technique was inspired by the comment here: https://github.com/bazelbuild/bazel/blob/387c610d09b99536f7f5b8ecb883d14ee6063fdd/examples/windows/dll/windows_dll_library.bzl#L47-L48

Successfully built on vanilla Ubuntu 18.04 VM:
bmzhao@bmzhao-tf-build-failure-reproing:~/tf-fix/tf$ bazel build -c opt --config=cuda --config=v2 --host_force_python=PY3 //tensorflow/tools/pip_package:build_pip_package
Target //tensorflow/tools/pip_package:build_pip_package up-to-date:
  bazel-bin/tensorflow/tools/pip_package/build_pip_package
INFO: Elapsed time: 2067.380s, Critical Path: 828.19s
INFO: 12942 processes: 51 remote cache hit, 12891 local.
INFO: Build completed successfully, 14877 total actions

The root cause might instead be bazelbuild/bazel#7687, which is pending further investigation.

PiperOrigin-RevId: 281341817
Change-Id: Ia240eb050d9514ed5ac95b7b5fb7e0e98b7d1e83
@pooyadavoodi
Copy link

@mihaimaruseac
Looks like --noincompatible_do_not_split_linking_cmdline is still needed with bazel 1.1.0 in order to compile the master as of an hour ago.

@scentini
Copy link
Contributor

scentini commented Dec 5, 2019

cc @hlopko to deal with this from Bazel side.

@trentlo
Copy link
Contributor

trentlo commented Dec 5, 2019

I also hit a related issue that I need the flag --noincompatible_do_not_split_linking_cmdline to pass XLA tests.

For example:
bazel test -c opt --config=cuda --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0 --noincompatible_do_not_split_linking_cmdline //tensorflow/compiler/xla/service/gpu:horizontal_fusion_test

If I do no add the flag, I hit the below error due to linkage change.

2019-12-05 02:01:06.565142: F tensorflow/compiler/xla/tests/hlo_test_base.cc:53] Non-OK-status: result.status() status: Not found: Could not find registered platform with name: "interpreter"could not get interpreter platform

*** Received signal 6 ***

*** BEGIN MANGLED STACK TRACE ***

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues type:build/install Build and install issues
Projects
None yet
Development

No branches or pull requests