New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors_impl - NotFoundError - stringpiece #6473

Closed
MircoT opened this Issue Dec 23, 2016 · 14 comments

Comments

Projects
None yet
4 participants
@MircoT
Contributor

MircoT commented Dec 23, 2016

What related GitHub issues or StackOverflow threads have you found by searching the web for your problem?

None.

Environment info

Operating System: Ubuntu 16.04.1 LTS

Installed version of CUDA and cuDNN: None

Binary pip package info:

  1. A link to the pip package you installed: https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.12.0rc0-cp35-cp35m-linux_x86_64.whl
  2. The output from python -c "import tensorflow; print(tensorflow.__version__)": 0.12.0

The error comes from the binary packages indicated above. I had no problems with the package builded from source.

Source info:

  1. The commit hash (git rev-parse HEAD): 48fb73a
  2. The output of bazel version:
Build label: 0.4.1
Build target: bazel-out/local-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Wed Nov 30 09:13:36 2016 (1480497216)
Build timestamp: 1480497216
Build timestamp as int: 1480497216

If possible, provide a minimal reproducible example (We usually don't have time to read hundreds of lines of your code)

I guess seeing the problem is that I use inside a new Op some functions from this part of the core:

#include "tensorflow/core/lib/core/stringpiece.h"

What other attempted solutions have you tried?

This thing is strange because as I said with the package created from the source code I have no error during the execution of the script, but with the binary package provided from the official website I had the runtime error below.

Logs or other output that would be helpful

tensorflow.python.framework.errors_impl.NotFoundError: /.../newop.so: undefined symbol: _ZN10tensorflow9LogMemory21RecordRawDeallocationERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEExPvPNS_9AllocatorEb

@MircoT

This comment has been minimized.

Show comment
Hide comment
@MircoT

MircoT Dec 27, 2016

Contributor

No more problems with new version 0.12 on MacOS.

Contributor

MircoT commented Dec 27, 2016

No more problems with new version 0.12 on MacOS.

@MircoT MircoT closed this Dec 27, 2016

@MircoT MircoT reopened this Dec 27, 2016

@MircoT

This comment has been minimized.

Show comment
Hide comment
@MircoT

MircoT Dec 27, 2016

Contributor

The problem still in Linux as said above. I want to add another information:

to compile the plugin correctly on Linux I had to add libprotobuf and libprotobuf_lite as dependencies. I took the libraries from the compiled directory after bazel build execution.

Contributor

MircoT commented Dec 27, 2016

The problem still in Linux as said above. I want to add another information:

to compile the plugin correctly on Linux I had to add libprotobuf and libprotobuf_lite as dependencies. I took the libraries from the compiled directory after bazel build execution.

@aselle

This comment has been minimized.

Show comment
Hide comment
@aselle

aselle Dec 28, 2016

Member

@jart, could you take a look at this issue?

Member

aselle commented Dec 28, 2016

@jart, could you take a look at this issue?

@jart

This comment has been minimized.

Show comment
Hide comment
@jart

jart Dec 29, 2016

Member

So you're writing a custom op. Can you post your Bazel build configuration?

Member

jart commented Dec 29, 2016

So you're writing a custom op. Can you post your Bazel build configuration?

@MircoT

This comment has been minimized.

Show comment
Hide comment
@MircoT

MircoT Dec 29, 2016

Contributor

I'm sorry but what do you mean as build configuration?

By the way I was working on a clean system with Ubuntu 16.04. The fresh installation has only the base things to build tensorflow, I followed the instructions from the main website and for bazel I used the repo installation.

The system was previously configured with Cuda framework with the proper video driver and toolkit, also complete with cudnn.

To compile the tensorflow library I added avx instructions set.

To build the custom op I'm using gcc as described in the tutorial and so I don't have a project with bazel for the custom op.

Hope these information could be useful for the moment. If you need more specific info I will try to respond as soon as I can.

Thank you for the support.

Contributor

MircoT commented Dec 29, 2016

I'm sorry but what do you mean as build configuration?

By the way I was working on a clean system with Ubuntu 16.04. The fresh installation has only the base things to build tensorflow, I followed the instructions from the main website and for bazel I used the repo installation.

The system was previously configured with Cuda framework with the proper video driver and toolkit, also complete with cudnn.

To compile the tensorflow library I added avx instructions set.

To build the custom op I'm using gcc as described in the tutorial and so I don't have a project with bazel for the custom op.

Hope these information could be useful for the moment. If you need more specific info I will try to respond as soon as I can.

Thank you for the support.

@jart

This comment has been minimized.

Show comment
Hide comment
@jart

jart Dec 30, 2016

Member

Thank you for the support.

I'm glad to be of service. But please note for future reference that support is community-driven on StackOverflow. That is a more appropriate venue for issues like these. We try to keep this issue tracker limited to bugs and feature requests.

That said, you're most likely forgetting to link one of the TensorFlow shared objects into your program. Wild guess but try saying from tensorflow.python import pywrap_tensorflow before you call tf.load_op_library.

Also, I strongly recommend using Bazel. I don't know why the documentation says to use gcc. Basically there's a directory called tensorflow/user_ops that shows you how to do what you want to do. You can customize that directory to your heart's content.

If this doesn't solve your issue, let me know and I'll reopen this.

Member

jart commented Dec 30, 2016

Thank you for the support.

I'm glad to be of service. But please note for future reference that support is community-driven on StackOverflow. That is a more appropriate venue for issues like these. We try to keep this issue tracker limited to bugs and feature requests.

That said, you're most likely forgetting to link one of the TensorFlow shared objects into your program. Wild guess but try saying from tensorflow.python import pywrap_tensorflow before you call tf.load_op_library.

Also, I strongly recommend using Bazel. I don't know why the documentation says to use gcc. Basically there's a directory called tensorflow/user_ops that shows you how to do what you want to do. You can customize that directory to your heart's content.

If this doesn't solve your issue, let me know and I'll reopen this.

@jart jart closed this Dec 30, 2016

@MircoT

This comment has been minimized.

Show comment
Hide comment
@MircoT

MircoT Dec 30, 2016

Contributor

I will switch to bazel but basically I followed these instructions :

https://www.tensorflow.org/versions/r0.11/how_tos/adding_an_op/#building_the_op_library

I wrote here because the problem happens only with the binary package provided on the official website and I didn't have any problems with the package compiled from source. For future similar cases I will use stackoverflow.

Thank you for the response.

Contributor

MircoT commented Dec 30, 2016

I will switch to bazel but basically I followed these instructions :

https://www.tensorflow.org/versions/r0.11/how_tos/adding_an_op/#building_the_op_library

I wrote here because the problem happens only with the binary package provided on the official website and I didn't have any problems with the package compiled from source. For future similar cases I will use stackoverflow.

Thank you for the response.

@jart

This comment has been minimized.

Show comment
Hide comment
@jart

jart Dec 30, 2016

Member

Thank you for clarifying. So this only happens if you try to use something in stringpiece.h with the binary release, but compiling TensorFlow from source on your computer works fine. So maybe this is an ABI compatibility type issue. Maybe you're using a different version of GCC than what we used to build the release.

Hey @mrry would we consider something like this to be a bug? Does undefined symbol: _ZN10tensorflow9LogMemory21RecordRawDeallocationERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEExPvPNS_9AllocatorEb mean anything to you?

Member

jart commented Dec 30, 2016

Thank you for clarifying. So this only happens if you try to use something in stringpiece.h with the binary release, but compiling TensorFlow from source on your computer works fine. So maybe this is an ABI compatibility type issue. Maybe you're using a different version of GCC than what we used to build the release.

Hey @mrry would we consider something like this to be a bug? Does undefined symbol: _ZN10tensorflow9LogMemory21RecordRawDeallocationERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEExPvPNS_9AllocatorEb mean anything to you?

@mrry

This comment has been minimized.

Show comment
Hide comment
@mrry

mrry Dec 30, 2016

Contributor

Yes it looks like the extension and the binary package use a different definition of std::string. The demangled name from the error message expands to:

tensorflow::LogMemory::RecordRawDeallocation(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long long, void*, tensorflow::Allocator*, bool)

When I look for the corresponding symbol in the binary package I get the following:

$ objdump -t lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so | grep RecordRawDealloc | c++filt
0000000001f68110 g     F .text  00000000000003a3              tensorflow::LogMemory::RecordRawDeallocation(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long long, void*, tensorflow::Allocator*, bool)

Note the difference between std::__cxx11::basic_string<...> in the extension and std::basic_string<...> in the binary package.

There are a few workarounds, which seem to involve involve defining _GLIBCXX_USE_CXX11_ABI to 0` when you compile your extension. See this Stack Overflow answer for details.

Contributor

mrry commented Dec 30, 2016

Yes it looks like the extension and the binary package use a different definition of std::string. The demangled name from the error message expands to:

tensorflow::LogMemory::RecordRawDeallocation(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long long, void*, tensorflow::Allocator*, bool)

When I look for the corresponding symbol in the binary package I get the following:

$ objdump -t lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so | grep RecordRawDealloc | c++filt
0000000001f68110 g     F .text  00000000000003a3              tensorflow::LogMemory::RecordRawDeallocation(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long long, void*, tensorflow::Allocator*, bool)

Note the difference between std::__cxx11::basic_string<...> in the extension and std::basic_string<...> in the binary package.

There are a few workarounds, which seem to involve involve defining _GLIBCXX_USE_CXX11_ABI to 0` when you compile your extension. See this Stack Overflow answer for details.

@MircoT

This comment has been minimized.

Show comment
Hide comment
@MircoT

MircoT Dec 31, 2016

Contributor

I already compiled the extension with 'D_GLIBCXX_USE_CXX11_ABI=0' parameter but I have to try the suggested define inside my sources. I will do it soon but I think the problem is perfectly explained.

Is it possible to add that suggestion inside the documentation?

Thank you so much for the support, please be patient for a response, I can't try the fix right now.

Contributor

MircoT commented Dec 31, 2016

I already compiled the extension with 'D_GLIBCXX_USE_CXX11_ABI=0' parameter but I have to try the suggested define inside my sources. I will do it soon but I think the problem is perfectly explained.

Is it possible to add that suggestion inside the documentation?

Thank you so much for the support, please be patient for a response, I can't try the fix right now.

@mrry

This comment has been minimized.

Show comment
Hide comment
@mrry

mrry Jan 3, 2017

Contributor

I just took a look at the docs, and apparently there is a note buried in there:

Note on gcc version 5: gcc5 uses the new C++
ABI. The binary pip packages
available on the TensorFlow website are built with gcc4 that uses the older ABI.
If you compile your op library with gcc5, add -D_GLIBCXX_USE_CXX11_ABI=0 to
the command line to make the library compatible with the older abi.

If there's something you'd like to improve there, please feel free to submit a PR!

Contributor

mrry commented Jan 3, 2017

I just took a look at the docs, and apparently there is a note buried in there:

Note on gcc version 5: gcc5 uses the new C++
ABI. The binary pip packages
available on the TensorFlow website are built with gcc4 that uses the older ABI.
If you compile your op library with gcc5, add -D_GLIBCXX_USE_CXX11_ABI=0 to
the command line to make the library compatible with the older abi.

If there's something you'd like to improve there, please feel free to submit a PR!

@MircoT

This comment has been minimized.

Show comment
Hide comment
@MircoT

MircoT Jan 5, 2017

Contributor

I solved the problem with the tip you gave me but I want to explain better what is the situation:

I have a new op for TensorFlow and I used gcc to compile it and I already use the flag mentioned above, the one you have in the note of the documentation, but I forgot it for some libraries and so I had the mismatch of the functions.

With the correct build of the op (using -D_GLIBCXX_USE_CXX11_ABI=0) I had similar problems because I use a TensorFlow python package builded from sources, with bazel. Building from source without the option -cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" caused the same problem with a different function.

For the moment I don't have the project of the new op that uses the bazel toolchain, so I compiled again the TensorFlow package from sources with that option and all went well.

If I can guess a conclusion, working with the sources of TF and building the new op from there have no problem, despite the use or not of -cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0". Otherwise if you have a compiled TF from sources and you want to use the gcc build option is better to work with the same ABI.

I do not know if it is appropriate add a note on the building from source with that flag, to remain aligned with the official package. This is only a misunderstanding because you can choose to compile the new op with different options.

I'm sorry for the delay but I made some tests to be sure about the situation.

Contributor

MircoT commented Jan 5, 2017

I solved the problem with the tip you gave me but I want to explain better what is the situation:

I have a new op for TensorFlow and I used gcc to compile it and I already use the flag mentioned above, the one you have in the note of the documentation, but I forgot it for some libraries and so I had the mismatch of the functions.

With the correct build of the op (using -D_GLIBCXX_USE_CXX11_ABI=0) I had similar problems because I use a TensorFlow python package builded from sources, with bazel. Building from source without the option -cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" caused the same problem with a different function.

For the moment I don't have the project of the new op that uses the bazel toolchain, so I compiled again the TensorFlow package from sources with that option and all went well.

If I can guess a conclusion, working with the sources of TF and building the new op from there have no problem, despite the use or not of -cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0". Otherwise if you have a compiled TF from sources and you want to use the gcc build option is better to work with the same ABI.

I do not know if it is appropriate add a note on the building from source with that flag, to remain aligned with the official package. This is only a misunderstanding because you can choose to compile the new op with different options.

I'm sorry for the delay but I made some tests to be sure about the situation.

@jart

This comment has been minimized.

Show comment
Hide comment
@jart

jart Jan 14, 2017

Member

Thank you for providing more useful information for future people googling this issue. If you want to contribute to the documentation, we would be happy to review any pull requests you would be generous enough to contribute.

Member

jart commented Jan 14, 2017

Thank you for providing more useful information for future people googling this issue. If you want to contribute to the documentation, we would be happy to review any pull requests you would be generous enough to contribute.

@jart jart closed this Jan 14, 2017

@MircoT

This comment has been minimized.

Show comment
Hide comment
@MircoT

MircoT Jan 16, 2017

Contributor

I will try to write something for the documentations. Thank you for the support!

Contributor

MircoT commented Jan 16, 2017

I will try to write something for the documentations. Thank you for the support!

MircoT added a commit to MircoT/tensorflow that referenced this issue Jan 21, 2017

decentralion added a commit that referenced this issue Mar 7, 2017

Add note to new op and source build doc (#6473) (#6997)
* Add note to new op and source build doc (#6473)

* Add install source note

* Fix note style
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment