New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OS X segfault on import #2278

Closed
pikeas opened this Issue May 9, 2016 · 19 comments

Comments

Projects
None yet
8 participants
@pikeas

pikeas commented May 9, 2016

OS X 10.11.2, with CUDA:

/Developer/NVIDIA/CUDA-7.5/lib/libcudadevrt.a
/Developer/NVIDIA/CUDA-7.5/lib/libcudart.7.5.dylib
/Developer/NVIDIA/CUDA-7.5/lib/libcudart.dylib -> libcudart.7.5.dylib
/Developer/NVIDIA/CUDA-7.5/lib/libcudart_static.a
/Developer/NVIDIA/CUDA-7.5/lib/libcudnn.5.dylib
/Developer/NVIDIA/CUDA-7.5/lib/libcudnn.dylib -> libcudnn.5.dylib
/Developer/NVIDIA/CUDA-7.5/lib/libcudnn_static.a

Tensorflow built according to https://medium.com/@fabmilo/how-to-compile-tensorflow-with-cuda-support-on-osx-fd27108e27e1#.v8ibv617m, main difference being that CUDA toolkit was installed from NVidia installer instead of via brew cask install cuda, and using homebrew Python 3.5 instead of Anaconda Python.

In other words:

  1. Install CUDA toolkit.
  2. Download cudnn-7.5-osx-x64-v5.0-rc.tgz and move files to /Developer/NVIDIA/CUDA-7.5/{include,lib}
  3. Install bazel 0.2.1 via brew.
  4. Create Python 3.5 virtualenv, install numpy 1.11 into it so tensorflow can build against it(?).
  5. Clone tensorflow repo.
  6. Build with:
PYTHON_BIN_PATH="/Users/pikeas/.virtualenvs/hnn/bin/python" CUDA_TOOLKIT_PATH="/Developer/NVIDIA/CUDA-7.5" CUDNN_INSTALL_PATH="/Developer/NVIDIA/CUDA-7.5" TF_UNOFFICIAL_SETTING=1 TF_NEED_CUDA=1 TF_CUDA_COMPUTE_CAPABILITIES="3.0" TF_CUDNN_VERSION="5" TF_CUDA_VERSION="7.5" TF_CUDA_VERSION_TOOLKIT=7.5 ./configure
bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
bazel-bin/tensorflow/tools/pip_package/build_pip_package
  1. export DYLD_LIBRARY_PATH=/Developer/NVIDIA/CUDA-7.5/lib
  2. Install built tensorflow-0.8.0-py3-none-any.whl into virtualenv.
  3. import tensorflow fails with:
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.7.5.dylib locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.5.dylib locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.7.5.dylib locally
[1]    78583 segmentation fault  python

I've tried removing scipy per the recent similar Linux issue, which didn't help.

@pikeas

This comment has been minimized.

Show comment
Hide comment
@pikeas

pikeas May 9, 2016

I created a test file which contains just import tensorflow. Then:

$ gdb -ex r --args python test.py
warning: `/Users/travis/build/MacPython/numpy-wheels/numpy/build/temp.macosx-10.6-intel-3.5/Users/travis/build/MacPython/numpy-wheels/numpy/numpy/_build_utils/src/apple_sgemv_fix.o': can't open to read symbols: No such file or directory.
...
Dozens of lines like this for different numpy files.
My username is not Travis and I'm not running OS X 10.6, so I have no idea where these references are from.
...
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.7.5.dylib locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.5.dylib locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.7.5.dylib locally

Program received signal SIGSEGV, Segmentation fault.
0x00007fff8f874152 in strlen () from /usr/lib/system/libsystem_c.dylib

pikeas commented May 9, 2016

I created a test file which contains just import tensorflow. Then:

$ gdb -ex r --args python test.py
warning: `/Users/travis/build/MacPython/numpy-wheels/numpy/build/temp.macosx-10.6-intel-3.5/Users/travis/build/MacPython/numpy-wheels/numpy/numpy/_build_utils/src/apple_sgemv_fix.o': can't open to read symbols: No such file or directory.
...
Dozens of lines like this for different numpy files.
My username is not Travis and I'm not running OS X 10.6, so I have no idea where these references are from.
...
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.7.5.dylib locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.5.dylib locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.7.5.dylib locally

Program received signal SIGSEGV, Segmentation fault.
0x00007fff8f874152 in strlen () from /usr/lib/system/libsystem_c.dylib
@petewarden

This comment has been minimized.

Show comment
Hide comment
@petewarden

petewarden May 9, 2016

Member

From looking at the warnings, it does seem like there's a numpy installation problem. Are you able to run a simple test program that imports numpy?

Member

petewarden commented May 9, 2016

From looking at the warnings, it does seem like there's a numpy installation problem. Are you able to run a simple test program that imports numpy?

@pikeas

This comment has been minimized.

Show comment
Hide comment
@pikeas

pikeas May 9, 2016

Yes, numpy installed correctly - standard pip install into the virtualenv later used by PYTHON_BIN_PATH. It can be imported and used without issue.

pikeas commented May 9, 2016

Yes, numpy installed correctly - standard pip install into the virtualenv later used by PYTHON_BIN_PATH. It can be imported and used without issue.

@qbx2

This comment has been minimized.

Show comment
Hide comment
@qbx2

qbx2 May 10, 2016

Contributor

Could you show me a system call trace using dtruss?

Contributor

qbx2 commented May 10, 2016

Could you show me a system call trace using dtruss?

@pikeas

This comment has been minimized.

Show comment
Hide comment
@pikeas

pikeas May 10, 2016

It's 2200 lines of output, so I've created a gist: https://gist.github.com/pikeas/293556206511cda72b94bcf154c35ddc

It's my first use of dtrace, so let me know if this output is what you need. I invoked it by doing:

sudo -i
export DYLD_LIBRARY_PATH=/Developer/NVIDIA/CUDA-7.5/lib
dtruss /Users/pikeas/.virtualenvs/hnn/bin/python foo.py > dtruss.out 2>&1

foo.py only imports tensorflow.

pikeas commented May 10, 2016

It's 2200 lines of output, so I've created a gist: https://gist.github.com/pikeas/293556206511cda72b94bcf154c35ddc

It's my first use of dtrace, so let me know if this output is what you need. I invoked it by doing:

sudo -i
export DYLD_LIBRARY_PATH=/Developer/NVIDIA/CUDA-7.5/lib
dtruss /Users/pikeas/.virtualenvs/hnn/bin/python foo.py > dtruss.out 2>&1

foo.py only imports tensorflow.

@qbx2

This comment has been minimized.

Show comment
Hide comment
@qbx2

qbx2 May 10, 2016

Contributor

ImportError: dlopen(/Users/pikeas/.virtualenvs/hnn/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so, 10): Library not loaded: @rpath/libcudart.7.5.dylib

Would you check if the file exists there, and try to put it?

Contributor

qbx2 commented May 10, 2016

ImportError: dlopen(/Users/pikeas/.virtualenvs/hnn/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so, 10): Library not loaded: @rpath/libcudart.7.5.dylib

Would you check if the file exists there, and try to put it?

@pikeas

This comment has been minimized.

Show comment
Hide comment
@pikeas

pikeas May 10, 2016

@qbx2 Hm, I'm not sure what you mean? _pywrap_tensorflow.so exists in that location, and if you're referring to @rpath/libcudart.7.5.dylib, the file is in /Developer/NVIDIA/CUDA-7.5/lib, which is where my CUDA was installed (default location from NVIDIA installer). That's why I set DYLD_LIBRARY_PATH.

pikeas commented May 10, 2016

@qbx2 Hm, I'm not sure what you mean? _pywrap_tensorflow.so exists in that location, and if you're referring to @rpath/libcudart.7.5.dylib, the file is in /Developer/NVIDIA/CUDA-7.5/lib, which is where my CUDA was installed (default location from NVIDIA installer). That's why I set DYLD_LIBRARY_PATH.

@qbx2

This comment has been minimized.

Show comment
Hide comment
@qbx2

qbx2 May 10, 2016

Contributor

Would you try dtruss using following method:

  1. Launch python from any shell, and find its pid. (For me, it's 5136)
  2. sudo dtruss -p 5136
  3. import tensorflow in python interactive shell
  4. Check dtruss output
Contributor

qbx2 commented May 10, 2016

Would you try dtruss using following method:

  1. Launch python from any shell, and find its pid. (For me, it's 5136)
  2. sudo dtruss -p 5136
  3. import tensorflow in python interactive shell
  4. Check dtruss output
@pikeas

This comment has been minimized.

Show comment
Hide comment
@pikeas

pikeas May 10, 2016

@qbx2 I really appreciate the help! Here's the dtruss output for import tensorflow run in a separate REPL - https://gist.github.com/pikeas/ea3cfca6ee49b190319f1f2da65e57c3

I think the original dtruss output may have been from a shell containing a tensorflow directory, which would have caused the output in the first Gist. Sorry for the confusion.

pikeas commented May 10, 2016

@qbx2 I really appreciate the help! Here's the dtruss output for import tensorflow run in a separate REPL - https://gist.github.com/pikeas/ea3cfca6ee49b190319f1f2da65e57c3

I think the original dtruss output may have been from a shell containing a tensorflow directory, which would have caused the output in the first Gist. Sorry for the confusion.

@qbx2

This comment has been minimized.

Show comment
Hide comment
@qbx2

qbx2 May 10, 2016

Contributor

Yes, it seems to be a numpy.random problem. Could you check if numpy.random works?

Contributor

qbx2 commented May 10, 2016

Yes, it seems to be a numpy.random problem. Could you check if numpy.random works?

@pikeas

This comment has been minimized.

Show comment
Hide comment
@pikeas

pikeas May 10, 2016

Yep, works:

$ /Users/pikeas/.virtualenvs/hnn/bin/python -c 'from numpy.random import random; print(random(5))'
[ 0.87424813  0.06297989  0.89478638  0.2749335   0.93622187]

pikeas commented May 10, 2016

Yep, works:

$ /Users/pikeas/.virtualenvs/hnn/bin/python -c 'from numpy.random import random; print(random(5))'
[ 0.87424813  0.06297989  0.89478638  0.2749335   0.93622187]
@qbx2

This comment has been minimized.

Show comment
Hide comment
@qbx2

qbx2 May 10, 2016

Contributor

I think the gist is truncated?

Contributor

qbx2 commented May 10, 2016

I think the gist is truncated?

@pikeas

This comment has been minimized.

Show comment
Hide comment
@pikeas

pikeas May 10, 2016

@qbx2 I agree that the gist looks truncated, but I've just re-run and got the same result.

It looks like some dtruss runs stop early at the numpy.random output, but some runs reach a more useful trace:

write_nocancel(0x2, "I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.7.5.dylib locally\nNone\n\n        Returns\n        -------\n        result : MaskedArray\n            The imaginary part of the masked array.\n\n        See Also\n        ---", 0x6C)       = 108 0
getattrlist("/Users\0", 0x7FFF93BD8DC4, 0x7FFF51E1F530)      = 0 0
getattrlist("/Users/pikeas\0", 0x7FFF93BD8DC4, 0x7FFF51E1F530)       = 0 0
getattrlist("/Users/pikeas/.virtualenvs\0", 0x7FFF93BD8DC4, 0x7FFF51E1F530)      = 0 0
getattrlist("/Users/pikeas/.virtualenvs/hnn\0", 0x7FFF93BD8DC4, 0x7FFF51E1F530)      = 0 0
getattrlist("/Users/pikeas/.virtualenvs/hnn/bin\0", 0x7FFF93BD8DC4, 0x7FFF51E1F530)      = 0 0
getattrlist("/Users/pikeas/.virtualenvs/hnn/bin/python\0", 0x7FFF93BD8DC4, 0x7FFF51E1F530)       = 0 0
readlink("/Users/pikeas/.virtualenvs/hnn/bin/python\0", 0x7FFF51E1E930, 0x3FF)       = 9 0
getattrlist("/Users/pikeas/.virtualenvs/hnn/bin/python3.5\0", 0x7FFF93BD8DC4, 0x7FFF51E1F530)        = 0 0
getattrlist("/Users\0", 0x7FFF93BD8DC4, 0x7FFF51E1FAE0)      = 0 0
getattrlist("/Users/pikeas\0", 0x7FFF93BD8DC4, 0x7FFF51E1FAE0)       = 0 0
getattrlist("/Users/pikeas/.virtualenvs\0", 0x7FFF93BD8DC4, 0x7FFF51E1FAE0)      = 0 0
getattrlist("/Users/pikeas/.virtualenvs/hnn\0", 0x7FFF93BD8DC4, 0x7FFF51E1FAE0)      = 0 0
getattrlist("/Users/pikeas/.virtualenvs/hnn/bin\0", 0x7FFF93BD8DC4, 0x7FFF51E1FAE0)      = 0 0
getattrlist("/Users/pikeas/.virtualenvs/hnn/bin/driver\0", 0x7FFF93BD8DC4, 0x7FFF51E1FAE0)       = -1 Err#2
stat64("/Developer/NVIDIA/CUDA-7.5/lib/libcuda.dylib\0", 0x7FFF51E1F998, 0x7FFF51E1FAE0)         = -1 Err#2
stat64("$ORIGIN/../../_solib_darwin/_U_S_Sthird_Uparty_Sgpus_Scuda_Ccudart___Uthird_Uparty_Sgpus_Scuda_Slib/libcuda.dylib\0", 0x7FFF51E1F548, 0x7FFF51E1FAE0)        = -1 Err#2
stat64("third_party/gpus/cuda/lib/libcuda.dylib\0", 0x7FFF51E1F598, 0x7FFF51E1FAE0)      = -1 Err#2
stat64("third_party/gpus/cuda/extras/CUPTI/lib/libcuda.dylib\0", 0x7FFF51E1F588, 0x7FFF51E1FAE0)         = -1 Err#2
stat64("libcuda.dylib\0", 0x7FFF51E1F5C8, 0x7FFF51E1FAE0)        = -1 Err#2
stat64("/Users/pikeas/lib/libcuda.dylib\0", 0x7FFF51E1F9A8, 0x7FFF51E1FAE0)      = -1 Err#2
stat64("/usr/local/lib/libcuda.dylib\0", 0x7FFF51E1F9A8, 0x7FFF51E1FAE0)         = -1 Err#2
stat64("/usr/lib/libcuda.dylib\0", 0x7FFF51E1F9B8, 0x7FFF51E1FAE0)       = -1 Err#2

pikeas commented May 10, 2016

@qbx2 I agree that the gist looks truncated, but I've just re-run and got the same result.

It looks like some dtruss runs stop early at the numpy.random output, but some runs reach a more useful trace:

write_nocancel(0x2, "I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.7.5.dylib locally\nNone\n\n        Returns\n        -------\n        result : MaskedArray\n            The imaginary part of the masked array.\n\n        See Also\n        ---", 0x6C)       = 108 0
getattrlist("/Users\0", 0x7FFF93BD8DC4, 0x7FFF51E1F530)      = 0 0
getattrlist("/Users/pikeas\0", 0x7FFF93BD8DC4, 0x7FFF51E1F530)       = 0 0
getattrlist("/Users/pikeas/.virtualenvs\0", 0x7FFF93BD8DC4, 0x7FFF51E1F530)      = 0 0
getattrlist("/Users/pikeas/.virtualenvs/hnn\0", 0x7FFF93BD8DC4, 0x7FFF51E1F530)      = 0 0
getattrlist("/Users/pikeas/.virtualenvs/hnn/bin\0", 0x7FFF93BD8DC4, 0x7FFF51E1F530)      = 0 0
getattrlist("/Users/pikeas/.virtualenvs/hnn/bin/python\0", 0x7FFF93BD8DC4, 0x7FFF51E1F530)       = 0 0
readlink("/Users/pikeas/.virtualenvs/hnn/bin/python\0", 0x7FFF51E1E930, 0x3FF)       = 9 0
getattrlist("/Users/pikeas/.virtualenvs/hnn/bin/python3.5\0", 0x7FFF93BD8DC4, 0x7FFF51E1F530)        = 0 0
getattrlist("/Users\0", 0x7FFF93BD8DC4, 0x7FFF51E1FAE0)      = 0 0
getattrlist("/Users/pikeas\0", 0x7FFF93BD8DC4, 0x7FFF51E1FAE0)       = 0 0
getattrlist("/Users/pikeas/.virtualenvs\0", 0x7FFF93BD8DC4, 0x7FFF51E1FAE0)      = 0 0
getattrlist("/Users/pikeas/.virtualenvs/hnn\0", 0x7FFF93BD8DC4, 0x7FFF51E1FAE0)      = 0 0
getattrlist("/Users/pikeas/.virtualenvs/hnn/bin\0", 0x7FFF93BD8DC4, 0x7FFF51E1FAE0)      = 0 0
getattrlist("/Users/pikeas/.virtualenvs/hnn/bin/driver\0", 0x7FFF93BD8DC4, 0x7FFF51E1FAE0)       = -1 Err#2
stat64("/Developer/NVIDIA/CUDA-7.5/lib/libcuda.dylib\0", 0x7FFF51E1F998, 0x7FFF51E1FAE0)         = -1 Err#2
stat64("$ORIGIN/../../_solib_darwin/_U_S_Sthird_Uparty_Sgpus_Scuda_Ccudart___Uthird_Uparty_Sgpus_Scuda_Slib/libcuda.dylib\0", 0x7FFF51E1F548, 0x7FFF51E1FAE0)        = -1 Err#2
stat64("third_party/gpus/cuda/lib/libcuda.dylib\0", 0x7FFF51E1F598, 0x7FFF51E1FAE0)      = -1 Err#2
stat64("third_party/gpus/cuda/extras/CUPTI/lib/libcuda.dylib\0", 0x7FFF51E1F588, 0x7FFF51E1FAE0)         = -1 Err#2
stat64("libcuda.dylib\0", 0x7FFF51E1F5C8, 0x7FFF51E1FAE0)        = -1 Err#2
stat64("/Users/pikeas/lib/libcuda.dylib\0", 0x7FFF51E1F9A8, 0x7FFF51E1FAE0)      = -1 Err#2
stat64("/usr/local/lib/libcuda.dylib\0", 0x7FFF51E1F9A8, 0x7FFF51E1FAE0)         = -1 Err#2
stat64("/usr/lib/libcuda.dylib\0", 0x7FFF51E1F9B8, 0x7FFF51E1FAE0)       = -1 Err#2
@pikeas

This comment has been minimized.

Show comment
Hide comment
@pikeas

pikeas May 10, 2016

Solved!

Gist problem - dtruss sometimes truncated its output. When I re-ran, I got a slightly longer trace that mentioned libcuda.dylib. This file is not in /Developer/NVIDIA/CUDA-7.5/lib, but it is in /usr/local/cuda/lib.

In other words, the solution is adding to my dylib export: export DYLD_LIBRARY_PATH="/Developer/NVIDIA/CUDA-7.5/lib:/usr/local/cuda/lib"

Please note that I used stock everything: CUDA from NVIDIA, Python from homebrew, numpy from pip, tensorflow from source. As far as I can tell, anyone building under Mac OS X El Capitan, and very likely Yosemite/Mavericks as well, will experience the same problem.

I strongly urge the project to create and maintain until OS X build instructions.

pikeas commented May 10, 2016

Solved!

Gist problem - dtruss sometimes truncated its output. When I re-ran, I got a slightly longer trace that mentioned libcuda.dylib. This file is not in /Developer/NVIDIA/CUDA-7.5/lib, but it is in /usr/local/cuda/lib.

In other words, the solution is adding to my dylib export: export DYLD_LIBRARY_PATH="/Developer/NVIDIA/CUDA-7.5/lib:/usr/local/cuda/lib"

Please note that I used stock everything: CUDA from NVIDIA, Python from homebrew, numpy from pip, tensorflow from source. As far as I can tell, anyone building under Mac OS X El Capitan, and very likely Yosemite/Mavericks as well, will experience the same problem.

I strongly urge the project to create and maintain until OS X build instructions.

@qbx2

This comment has been minimized.

Show comment
Hide comment
@qbx2

qbx2 May 10, 2016

Contributor

In the setup tutorial ( https://www.tensorflow.org/versions/r0.8/get_started/os_setup.html ),
It says that install CUDA and CuDNN in /usr/local/cuda.
I think you've missed it.

Contributor

qbx2 commented May 10, 2016

In the setup tutorial ( https://www.tensorflow.org/versions/r0.8/get_started/os_setup.html ),
It says that install CUDA and CuDNN in /usr/local/cuda.
I think you've missed it.

@eagleflo

This comment has been minimized.

Show comment
Hide comment
@eagleflo

eagleflo Sep 6, 2016

I experienced a very similar issue with prebuilt TensorFlow 0.10 binary and CUDA installed according to instructions.

It turns out TensorFlow wants to import libcuda.1.dylib, not the libcuda.dylib that NVIDIA's CUDA installer installed. Manually creating a new symbolic link from libcuda.dylib to libcuda.1.dylib in /usr/local/cuda/lib fixed the issue for me.

eagleflo commented Sep 6, 2016

I experienced a very similar issue with prebuilt TensorFlow 0.10 binary and CUDA installed according to instructions.

It turns out TensorFlow wants to import libcuda.1.dylib, not the libcuda.dylib that NVIDIA's CUDA installer installed. Manually creating a new symbolic link from libcuda.dylib to libcuda.1.dylib in /usr/local/cuda/lib fixed the issue for me.

@martinianodl

This comment has been minimized.

Show comment
Hide comment
@martinianodl

martinianodl Oct 26, 2016

It also worked for me running on a MacBook Pro (Retina, 15-inch, Late 2013) + Tensorflow r0.11 + CUDA 8.0 + cuDNN v8 + Anaconda3

I just used the command:
sudo ln -s /usr/local/cuda/lib/libcuda.dylib /usr/local/cuda/lib/libcuda.1.dylib

and now it is working no more Segmentation Fault:11.

martinianodl commented Oct 26, 2016

It also worked for me running on a MacBook Pro (Retina, 15-inch, Late 2013) + Tensorflow r0.11 + CUDA 8.0 + cuDNN v8 + Anaconda3

I just used the command:
sudo ln -s /usr/local/cuda/lib/libcuda.dylib /usr/local/cuda/lib/libcuda.1.dylib

and now it is working no more Segmentation Fault:11.

@dangrie158

This comment has been minimized.

Show comment
Hide comment
@dangrie158

dangrie158 Oct 30, 2016

Same here. Linking the libcuda.dylib to libcuda.1.dylib fixed the problem. Maybe this step should be added to the documentation as it seems to happen with CUDA 8.0 reproducible

dangrie158 commented Oct 30, 2016

Same here. Linking the libcuda.dylib to libcuda.1.dylib fixed the problem. Maybe this step should be added to the documentation as it seems to happen with CUDA 8.0 reproducible

@djwbrown

This comment has been minimized.

Show comment
Hide comment
@djwbrown

djwbrown Jan 6, 2017

If this is the naming convention from NVIDIA, maybe a change in the source related to stream_executor/dso_loader.cc is needed where it's getting tripped up.

djwbrown commented Jan 6, 2017

If this is the naming convention from NVIDIA, maybe a change in the source related to stream_executor/dso_loader.cc is needed where it's getting tripped up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment