Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using tensorflow.contrib with cv_bridge causes tcmalloc error #8146

Closed
ethanabrooks opened this issue Mar 6, 2017 · 12 comments
Closed

Using tensorflow.contrib with cv_bridge causes tcmalloc error #8146

ethanabrooks opened this issue Mar 6, 2017 · 12 comments
Assignees

Comments

@ethanabrooks
Copy link

ethanabrooks commented Mar 6, 2017

NOTE: Only file GitHub issues for bugs and feature requests. All other topics will be closed.

For general support from the community, see StackOverflow.
To make bugs and feature requests more easy to find and organize, we close issues that are deemed
out of scope for GitHub Issues and point people to StackOverflow.

For bugs or installation issues, please provide the following information.
The more information you provide, the more easily we will be able to offer
help and advice.

What related GitHub issues or StackOverflow threads have you found by searching the web for your problem?

None.

Environment info

Operating System:

❯ uname -a 
Linux dos 3.13.0-76-generic #120-Ubuntu SMP Mon Jan 18 15:59:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Installed version of CUDA and cuDNN:
(please attach the output of ls -l /path/to/cuda/lib/libcud*):

❯ ls -l /path/to/cuda/lib/libcud*
ls: cannot access /path/to/cuda/lib/libcud*: No such file or directory

If installed from binary pip package, provide:

  1. A link to the pip package you installed:
  2. The output from python -c "import tensorflow; print(tensorflow.__version__)".
❯ python -c "import tensorflow; print(tensorflow.__version__)"
1.0.0

If installed from source, provide

  1. The commit hash (git rev-parse HEAD)
  2. The output of bazel version

If possible, provide a minimal reproducible example (We usually don't have time to read hundreds of lines of your code)

import tensorflow.contrib
import cv_bridge

import rospy
rospy.init_node('node')

This throws the following error:

/usr/bin/python2.7 /home/ethan/.PyCharmCE2016.3/config/scratches/scratch_4.py
src/tcmalloc.cc:277] Attempt to free invalid pointer 0xa2e78616d5f7475 

Process finished with exit code 134 (interrupted by signal 6: SIGABRT)

I'll also post to stackoverflow and to the cv_bridge page (ros-perception/vision_opencv#161).

What other attempted solutions have you tried?

I tried reinstalling ros and tensorflow. No change. I also tried print(cv_bridge.__file__) to make sure I was importing the right directory for cv_bridge.

Logs or other output that would be helpful

(If logs are large, please upload as attachment or provide link).

@prb12
Copy link
Member

prb12 commented Mar 7, 2017

@jhseu Could you please comment on whether recent jemalloc/tcmalloc changes might affect this?

@jhseu
Copy link
Contributor

jhseu commented Mar 7, 2017

It's unrelated to jemalloc. My guess is it's an issue with your usage of tcmalloc.

That error would happen if you call tcmalloc's malloc() and try to free with glibc malloc() and vice-versa. Disable tcmalloc?

@ethanabrooks
Copy link
Author

ethanabrooks commented Mar 7, 2017

@jhseu , I'm not exactly sure how to disable tcmalloc. I assume it's getting called either from tensorflow or cv_bridge, so would the best way be to find the actual tcmalloc function call and change it to malloc?

@jhseu
Copy link
Contributor

jhseu commented Mar 7, 2017

It's definitely not in TensorFlow. We don't use tcmalloc anywhere.

So it's either coming from your environment or being used by cv_bridge. You can track it down through gdb python and run /path/to/your/script.py.

@ethanabrooks
Copy link
Author

ethanabrooks commented Mar 7, 2017

This was the output:

(gdb) run test.py
Starting program: /usr/bin/python test.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff2eda700 (LWP 5777)]
[New Thread 0x7ffff26d9700 (LWP 5778)]
[New Thread 0x7fffefed8700 (LWP 5779)]
[New Thread 0x7fffed6d7700 (LWP 5780)]
[New Thread 0x7fffeaed6700 (LWP 5781)]
[New Thread 0x7fffe86d5700 (LWP 5782)]
[New Thread 0x7fffe5ed4700 (LWP 5783)]
[Thread 0x7fffe86d5700 (LWP 5782) exited]
[Thread 0x7fffe5ed4700 (LWP 5783) exited]
[Thread 0x7fffed6d7700 (LWP 5780) exited]
[Thread 0x7fffeaed6700 (LWP 5781) exited]
[Thread 0x7ffff2eda700 (LWP 5777) exited]
[Thread 0x7fffefed8700 (LWP 5779) exited]
[Thread 0x7ffff26d9700 (LWP 5778) exited]
[New Thread 0x7fffe5ed4700 (LWP 5788)]
src/tcmalloc.cc:277] Attempt to free invalid pointer 0xa2e78616d5f7475 

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffe5ed4700 (LWP 5788)]
0x00007ffff75e2cc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56	../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.

I wasn't really able to make sense of it. I also searched through all of /opt/ros/indigo/ for tcmalloc with no results.

In a debugger, I stepped through the program until it threw the error. The offending line was /opt/ros/indigo/lib/python2.7/dist-packages/rosgraph/xmlrpc.py:199:

    def start(self):
        """
        Initiate a thread to run the XML RPC server. Uses thread.start_new_thread.
        """
        _thread.start_new_thread(self.run, ())

Is it possible that cv_bridge is using a version of OpenCV that is not compatible with the recent Tensorflow update?

❯ pkg-config --modversion opencv
2.4.13

@jhseu
Copy link
Contributor

jhseu commented Mar 7, 2017

We don't depend on opencv in TensorFlow, so I'm not sure. Closing out, though, because this bug is unlikely to be an issue in TensorFlow.

@jhseu jhseu closed this as completed Mar 7, 2017
@ethanabrooks
Copy link
Author

That may be, but the script does not throw the error without the import tensorflow.contrib line.

@jhseu
Copy link
Contributor

jhseu commented Mar 7, 2017

It's still unlikely to be in TensorFlow. My best guess without trying it out is that there's a shared module dependency somewhere, TF is using glibc malloc upon module import, and somewhere along the long someone is using tcmalloc and freeing.

Libraries shouldn't be switching out malloc implementations unless its usage is completely self-contained.

@ethanabrooks
Copy link
Author

ethanabrooks commented Mar 7, 2017

I (sort of) fixed it:

import cv_bridge  # <-- note: switched with
import tensorflow.contrib  # this
import rospy
rospy.init_node('node')

This does not throw an error. Why the order of imports matters is beyond me. These kinds of things seem to crop up often when working with ros.

@jhseu
Copy link
Contributor

jhseu commented Mar 7, 2017

Yeah, import order affects symbol resolution order. My explanation before is most likely right, and it's a bug in cv_bridge.

@prb12
Copy link
Member

prb12 commented Mar 7, 2017

@lobachevzky Thanks for following up with the workaround. It does indeed look like cv_bridge is doing something bad with tc_malloc.

@ethanabrooks
Copy link
Author

ethanabrooks commented Mar 9, 2017

Ok. I did a little more digging and I found this line in my .zshrc:

export LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4"  

Commenting this out solved the problem. I'm not really sure which of the three libraries that were involved would be responsible, but it might be good to include a more informative error message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants