Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updating the torch and torch_xla wheels in the colab notebook #572

Closed
backpropper opened this issue Apr 6, 2019 · 24 comments
Closed

updating the torch and torch_xla wheels in the colab notebook #572

backpropper opened this issue Apr 6, 2019 · 24 comments

Comments

@backpropper
Copy link

backpropper commented Apr 6, 2019

The pip libraries listed here seem to be outdated. (Also discussed with @asuhan on slack and #528 ) I am using the nightly builds of tf (1.14.1).

I get the following error when importing torch:
ImportError: libcudart.so.10.0: cannot open shared object file: No such file or directory

Seems like these are compiled using CUDA libraries. Do you have the corresponding CPU versions?

Also is there an official page where you list the nightly builds of torch_xla?

I also tried building from source but it didnt help either.

@dlibenzi
Copy link
Collaborator

dlibenzi commented Apr 7, 2019

Yes, the PIPs are for Colab, which has proper CUDA libraries.

What issue did you get compiling from source?

@backpropper
Copy link
Author

backpropper commented Apr 7, 2019

But even importing the CUDA 10 libraries seems to give the same error. So do you mean that I cannot use the pips for normal runs not on colab? Is there an updated version of the pips available? and is this logged somewhere?

@backpropper
Copy link
Author

so does this mean I also need to install tf-nightly-gpu?

@dlibenzi
Copy link
Collaborator

dlibenzi commented Apr 7, 2019

Are you planning to use Colab, or Cloud TPU?

If you have gotten TF nightly whitelisting, must be the latter, so I suggest you build from source for now.
You did not mention the error you have gotten when building from source ...

@backpropper
Copy link
Author

Yes I am using the Cloud TPU.

When I build from source, it installs perfectly, although I do have to disable CUDA otherwise it gives a similar error to this. But after that when I try to run the test/test_train_mnist_tensor.py file, I get

return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
RuntimeError: Must not create a new variable from a variable, use its .data()

and test/test_operations.py gives a Segmentation Fault

@dlibenzi
Copy link
Collaborator

dlibenzi commented Apr 7, 2019

Yes, you have to build with NO_CUDA=1 if you have not a CUDA environment (this is described in the PT build-from-source document).

Do you have a deeper stack frame on the above error?

@backpropper
Copy link
Author

and if I install the nightly GPU builds of tf, pytorch and torch_xla, I get this error while importing torch_xla:
ImportError: .....python3.6/site-packages/_XLAC.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN5torch3jit15specializeUndefERNS0_5GraphE

No I do have CUDA drivers included but it still happened gave that error. So I used USE_CUDA=False.

@backpropper
Copy link
Author

I can try again (building from source) if there's no other option (i.e. to use nightly pips). Is the current master version stable?

@dlibenzi
Copy link
Collaborator

dlibenzi commented Apr 7, 2019

Until we have streamlined to PIP building, I suggest building from source.
Since we use PT C++ APIs, the PT/XLA code based is tightly coupled with the PT one, so the PT wheel and the PT/XLA one MUST be built together.

The current version of master is as stable as the older PIPs, but you get the new bits as well.
We do not have a stable/development release process yet.

@backpropper
Copy link
Author

Also just to verify do I need to have tensorflow installed before building pytorch and xla (I know that xla compiles it from source)? Also does it work with tensorborad?

@backpropper
Copy link
Author

Also what python version is recommended? Is 3.7 supported?

@dlibenzi
Copy link
Collaborator

dlibenzi commented Apr 7, 2019

No, the PT/XLA repo carries the TF code as submodule.

But, if you want to use TF standalone, then yes, you need it of course.
But for PT/XLA only, you do not need to install anything TF.

TensorBoard? I am not sure PT produces model checkpoints which are compatible with the TF ones.

@dlibenzi
Copy link
Collaborator

dlibenzi commented Apr 7, 2019

We use 3.6 and it is known to be working.
I suggest 3.6 if you can choose.

@backpropper
Copy link
Author

What I meant was having TF binary installed separately would not interfere with the PT/XLA installation?

@backpropper
Copy link
Author

Also, I plan to install both the repo in develop mode since xla is being constantly updated

@dlibenzi
Copy link
Collaborator

dlibenzi commented Apr 7, 2019

No, you can have TF installed, and PT/XLA, and they will not interfere.

@backpropper
Copy link
Author

@asuhan said otherwise. Also he advised me install using COMPILE_PARALLEL=0

@backpropper
Copy link
Author

should I do NO_DISTRIBUTED=1 too?

@dlibenzi
Copy link
Collaborator

dlibenzi commented Apr 7, 2019

COMPILE_PARALLEL=0 might be needed ... but only if your PT/XLA build hangs.
We have seen this happening on some machines. Might be 3.7 related.

I do not set NO_DISTRIBUTED=1 and it works for me.

@backpropper
Copy link
Author

Yes it hanged for me as well earlier.

@dlibenzi
Copy link
Collaborator

dlibenzi commented Apr 7, 2019

Then use COMPILE_PARALLEL=0

@dlibenzi
Copy link
Collaborator

dlibenzi commented Apr 7, 2019

As far as TF, we build the TF lib statically, so we carry no dependency on libtensorflow.so:

(pytorch) dlibenzi@dlibenzi2:~/google-git/pytorch/xla$ ldd build/lib.linux-x86_64-3.6/torch_xla/lib/libxla_computation_client.so 
	linux-vdso.so.1 (0x00007ffc8ed8e000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f8f3c1a8000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f8f3bea4000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f8f3bc87000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f8f3ba7f000)
	libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f8f3b6fa000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f8f3b4e2000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f8f3b143000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f8f475d8000)

@backpropper
Copy link
Author

Cool thnx

@ailzhang
Copy link
Contributor

ailzhang commented Sep 5, 2019

Closing the issue as the issue is resolved. Please feel free to reopen if you have followup questions.

@ailzhang ailzhang closed this as completed Sep 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants