Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installer not setting rpath for MAGMA (OS X w/ GPU) #27409

Open
elbamos opened this issue Oct 4, 2019 · 19 comments
Open

Installer not setting rpath for MAGMA (OS X w/ GPU) #27409

elbamos opened this issue Oct 4, 2019 · 19 comments
Labels
module: build Build system issues module: cuda Related to torch.cuda, and CUDA support in general module: macos Mac OS related issues triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@elbamos
Copy link

elbamos commented Oct 4, 2019

🐛 Bug

The installation scripts aren't adding the magma path to the dylib. This is at least as far back as 1.1, and exists in the current master.

It's easily fixable post-install with install_name_tool -add_rpath /usr/local/magma/lib /path/to/libtorch.dylib (actually in 1.1 its the caffe_gpu dylib), but of course this should be set properly by the installer.

To Reproduce

Steps to reproduce the behavior:

  1. Compile 1.1 or later on OS X with GPU and MAGA support.
  2. Launch python, import torch
Python 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 13:42:17)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Volumes/home500/anaconda/envs/pytorch1.2/lib/python3.6/site-packages/torch/__init__.py", line 81, in <module>
    from torch._C import *
ImportError: dlopen(/Volumes/home500/anaconda/envs/pytorch1.2/lib/python3.6/site-packages/torch/_C.cpython-36m-darwin.so, 9): Library not loaded: @rpath/libmagma.so
  Referenced from: /Volumes/home500/anaconda/envs/pytorch1.2/lib/python3.6/site-packages/torch/lib/libtorch.dylib
  Reason: image not found

Expected behavior

Not throw an exception, and instead return silently and run properly.

Environment

PyTorch version: 1.1.0
Is debug build: No
CUDA used to build PyTorch: 10.0

OS: Mac OSX 10.13.6
GCC version: Could not collect
CMake version: version 3.14.0

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: GeForce GTX 1080 Ti
Nvidia driver version: 1.1.0
cuDNN version: Probably one of the following:
/usr/local/cuda/lib/libcudnn.7.dylib
/usr/local/cuda/lib/libcudnn_static.a

Versions of relevant libraries:
[pip3] numpy==1.16.4
[conda] blas 1.0 mkl
[conda] gpytorch 0.3.5 pypi_0 pypi
[conda] mkl 2019.4 233
[conda] mkl-include 2019.4 233
[conda] mkl-service 2.3.0 py36hfbe908c_0
[conda] mkl_fft 1.0.14 py36h5e564d8_0
[conda] mkl_random 1.1.0 py36ha771720_0
[conda] torch 1.1.0 pypi_0 pypi
[conda] torchfile 0.1.0 pypi_0 pypi
[conda] torchnet 0.0.4 pypi_0 pypi
[conda] torchtext 0.4.0 pypi_0 pypi
[conda] torchvision 0.4.0a0+d31eafa pypi_0 pypi

@pietern pietern added module: build Build system issues module: cuda Related to torch.cuda, and CUDA support in general module: macos Mac OS related issues triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Oct 8, 2019
@pietern
Copy link
Contributor

pietern commented Oct 8, 2019

Thanks for reporting. I suppose this is only an issue if MAGMA is installed in a non-system path?

Is this something you could submit a PR for?

cc @soumith for MAGMA

@elbamos
Copy link
Author

elbamos commented Oct 9, 2019

Yes, I'm sure it wouldn't arise if MAGMA was installed in a system path. There isn't a package installer on OSX that supports MAGMA, which needs to get compiled against the system's CUDA anyway. MAGMA from source wants to install at /usr/local/magma.

The pytorch build process knows to look for, and properly finds, MAGMA at that path.

The pytorch build process has become so complex at this point, I'm reluctant to submit a PR that would touch it. Also, since not many of the recent master builds are passing CI, I wouldn't really have an effective way of testing the PR against platforms other than my own.

@pietern
Copy link
Contributor

pietern commented Nov 20, 2019

@soumith I think you're more familiar with magma et al. Who should take a look at this?

@soumith
Copy link
Member

soumith commented Nov 21, 2019

in terms of cmake / rpath, maybe @xuhdev would know.

@xuhdev
Copy link
Collaborator

xuhdev commented Nov 21, 2019

Could you try from the latest source? A lot of things have changed since then, and I doubt whether it still exist in the latest version. For the old version, I don't think it hurts to stick to your workaround (i.e., install_name_tool -add_rpath /usr/local/magma/lib /path/to/libtorch.dylib).

@elbamos
Copy link
Author

elbamos commented Nov 22, 2019

@xuhdev I just tested fea963d, and the issue is still there.

I think what's going on is that the installer expects magma to have been installed via a python package and therefore to be accessible from the python library path.

@xuhdev
Copy link
Collaborator

xuhdev commented Nov 22, 2019

Thanks for the info. I'll try to look into this on Monday.

@xuhdev
Copy link
Collaborator

xuhdev commented Nov 25, 2019

Did you install from the source? If so, would you mind showing the output of

grep MAGMA_LIBRARIES build/CMakeCache.txt

@elbamos
Copy link
Author

elbamos commented Dec 1, 2019

@xuhdev

MAGMA_LIBRARIES:FILEPATH=/usr/local/magma/lib/libmagma.so

That is where they live.

@xuhdev
Copy link
Collaborator

xuhdev commented Dec 2, 2019

(Sorry for asking more questions; Because I can't reproduce this issue, I have to rely on your info)

Could you show the path printed from otool -L /path/to/libtorch.dylib, both before and after you run the install_name_tool workaround?

@elbamos
Copy link
Author

elbamos commented Dec 2, 2019

@xuhdev Hey I'm happy to help any way I can! (Sorry for the delay to your prior question - I was out of town on business.)

Here's what I get from a fresh compile:

/Volumes/home500/anaconda/envs/pytorch1.3/lib/python3.6/site-packages/torch/lib/libtorch.dylib:
	@rpath/libtorch.dylib (compatibility version 0.0.0, current version 0.0.0)
	@rpath/libcudart.10.0.dylib (compatibility version 0.0.0, current version 10.0.130)
	@rpath/libiomp5.dylib (compatibility version 5.0.0, current version 5.0.0)
	@rpath/libmkl_intel_lp64.dylib (compatibility version 0.0.0, current version 0.0.0)
	@rpath/libmkl_intel_thread.dylib (compatibility version 0.0.0, current version 0.0.0)
	@rpath/libmkl_core.dylib (compatibility version 0.0.0, current version 0.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1252.50.4)
	@rpath/libc10_cuda.dylib (compatibility version 0.0.0, current version 0.0.0)
	@rpath/libnvrtc.10.0.dylib (compatibility version 0.0.0, current version 10.0.130)
	@rpath/libnvToolsExt.1.dylib (compatibility version 0.0.0, current version 1.0.0)
	@rpath/libcusparse.10.0.dylib (compatibility version 0.0.0, current version 10.0.130)
	@rpath/libcurand.10.0.dylib (compatibility version 0.0.0, current version 10.0.130)
	@rpath/libmagma.so (compatibility version 0.0.0, current version 0.0.0)
	@rpath/libcudnn.7.dylib (compatibility version 0.0.0, current version 7.6.4)
	@rpath/libc10.dylib (compatibility version 0.0.0, current version 0.0.0)
	@rpath/libcufft.10.0.dylib (compatibility version 0.0.0, current version 10.0.145)
	@rpath/libcublas.10.0.dylib (compatibility version 0.0.0, current version 10.0.130)
	/usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 400.9.0)

And after install_name_data, of course the otool output doesn't change:

/Volumes/home500/anaconda/envs/pytorch1.3/lib/python3.6/site-packages/torch/lib/libtorch.dylib:
	@rpath/libtorch.dylib (compatibility version 0.0.0, current version 0.0.0)
	@rpath/libcudart.10.0.dylib (compatibility version 0.0.0, current version 10.0.130)
	@rpath/libiomp5.dylib (compatibility version 5.0.0, current version 5.0.0)
	@rpath/libmkl_intel_lp64.dylib (compatibility version 0.0.0, current version 0.0.0)
	@rpath/libmkl_intel_thread.dylib (compatibility version 0.0.0, current version 0.0.0)
	@rpath/libmkl_core.dylib (compatibility version 0.0.0, current version 0.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1252.50.4)
	@rpath/libc10_cuda.dylib (compatibility version 0.0.0, current version 0.0.0)
	@rpath/libnvrtc.10.0.dylib (compatibility version 0.0.0, current version 10.0.130)
	@rpath/libnvToolsExt.1.dylib (compatibility version 0.0.0, current version 1.0.0)
	@rpath/libcusparse.10.0.dylib (compatibility version 0.0.0, current version 10.0.130)
	@rpath/libcurand.10.0.dylib (compatibility version 0.0.0, current version 10.0.130)
	@rpath/libmagma.so (compatibility version 0.0.0, current version 0.0.0)
	@rpath/libcudnn.7.dylib (compatibility version 0.0.0, current version 7.6.4)
	@rpath/libc10.dylib (compatibility version 0.0.0, current version 0.0.0)
	@rpath/libcufft.10.0.dylib (compatibility version 0.0.0, current version 10.0.145)
	@rpath/libcublas.10.0.dylib (compatibility version 0.0.0, current version 10.0.130)
	/usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 400.9.0)

@xuhdev
Copy link
Collaborator

xuhdev commented Dec 2, 2019

Oops, I'm sorry, I meant otool -l /path/to/libtorch.dylib

@elbamos
Copy link
Author

elbamos commented Dec 2, 2019

Thought you might...

Before:
before.txt

After:
after.txt

And the diff is:

4c4
<  0xfeedfacf 16777223          3  0x00           6    33       3224 0x00918085
---
>  0xfeedfacf 16777223          3  0x00           6    34       3264 0x00918085
494a495,498
> Load command 33
>           cmd LC_RPATH
>       cmdsize 40
>          path /usr/local/magma/lib/ (offset 12)

@xuhdev
Copy link
Collaborator

xuhdev commented Dec 4, 2019

When you built PyTorch, did you have DYLD_LIBRARY_PATH, DYLD_FALLBACK_LIBRARY_PATH, or LIBRARY_PATH set? Is the path to libmagma.so in any of these variables?

@elbamos
Copy link
Author

elbamos commented Dec 5, 2019

Nope, and nope.

[pytorch1.3] master(+1/-1)+* ± env | grep LIBRARY
CAML_LD_LIBRARY_PATH=/Users/aelberg/.opam/system/lib/stublibs:/usr/local/lib/ocaml/stublibs

@xuhdev
Copy link
Collaborator

xuhdev commented Dec 5, 2019

I have no idea of what's going on in your situation. Your RPATH is empty upon built. I probably will revisit this after I have some other thoughts. Thanks for the past info though!

@elbamos
Copy link
Author

elbamos commented Dec 5, 2019 via email

@xuhdev
Copy link
Collaborator

xuhdev commented Dec 5, 2019

@elbamos Sure; Let's see whether we can sniff something there

@elbamos
Copy link
Author

elbamos commented Dec 12, 2019

@xuhdev Here you go:
buildlog.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: build Build system issues module: cuda Related to torch.cuda, and CUDA support in general module: macos Mac OS related issues triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

4 participants