TPU crash during importing Trainer from transformers #6990

asiff00 · 2024-04-29T12:04:12Z

🐛 Bug

The Colab/Kaggle notebook crashes while trying to import 'Trainer' from the transformers library.

To Reproduce

!pip install transformers !pip install torch_xla[tpu] -f https://storage.googleapis.com/libtpu-releases/index.html from transformers import Trainer

or

!pip install transformers !pip install torch_xla[tpu] from transformers import Trainer

or

!pip install transformers !pip install torch_xla from transformers import Trainer

Steps to reproduce the behavior:

Install xla
Import Trainer from the transformer library.
Your environment crashes with the following error:
ERROR: Unknown command line flag 'xla_latency_hiding_scheduler_rerun'

Environment

Reproducible on XLA backend [TPU]:
torch_xla version: 2.2.0+libtpu

The text was updated successfully, but these errors were encountered:

JackCaoG · 2024-04-29T19:21:39Z

The flag is from https://github.com/pytorch/xla/blob/r2.2/torch_xla/__init__.py#L43-L44, I am trying to get my kaggle TPU and see if I can repo this.

JackCaoG · 2024-04-29T21:30:21Z

OK I was able to confirm that it did crash. I tried to install the new torch 2.3 on my TPUVM with

pip install torch~=2.3.0 torch_xla[tpu]~=2.3.0 -f https://storage.googleapis.com/libtpu-releases/index.html

and this seems to work

>>> import torch
>>> import torch_xla
>>> t1 = torch.randn(5,5, device='xla:0')
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1714426096.507062 2529487 pjrt_api.cc:100] GetPjrtApi was found for tpu at /home/jackcao/.local/lib/python3.8/site-packages/libtpu/libtpu.so
I0000 00:00:1714426096.507156 2529487 pjrt_api.cc:79] PJRT_Api is set for device type tpu
I0000 00:00:1714426096.507162 2529487 pjrt_api.cc:146] The PJRT plugin has PJRT API version 0.46. The framework PJRT API version is 0.46.
>>> t1
tensor([[ 1.1453, -0.9900,  0.5783,  1.7081,  1.1962],
        [ 0.6340,  1.6611,  0.2455, -0.7434,  3.1036],
        [-1.1664,  0.5326,  1.7286, -0.7094,  1.1267],
        [ 1.2665, -0.2168, -3.1145, -1.9214, -1.2044],
        [ 1.8507,  0.0055,  1.2275, -0.2037, -0.7610]], device='xla:0')
>>> torch_xla.__version__
'2.3.0'
>>> from transformers import Trainer
>>> Trainer
<class 'transformers.trainer.Trainer'>

There is also a in flight pr to update the default torch version to 2.3. Do you mind manually install the 2.3 for now?

JackCaoG · 2024-04-29T21:56:31Z

Ah I know.. it is Kaggle that preinstall tensorflow and HF will try to import tensorflow which will load tensorflow's libtpu which is not compatible with the pytorch/xla.

!yes | pip3 uninstall tensorflow

fixed the issue on my end.

JackCaoG · 2024-04-29T22:07:00Z

I will assign this bug to @will-cromar to add a warning message to make this more clear in the future releases.

asiff00 · 2024-04-29T22:09:38Z

This specific problem solved with (#6990 (comment))

!yes | pip3 uninstall tensorflow !pip install torch~=2.3.0 torch_xla[tpu]~=2.3.0 -f https://storage.googleapis.com/libtpu-releases/index.html
I'll test a few other transformers class before closing it.

will-cromar · 2024-05-06T17:28:57Z

Let us know if you're still having issues

JackCaoG assigned will-cromar Apr 29, 2024

will-cromar mentioned this issue Apr 30, 2024

Complain when TensorFlow is installed #7004

Merged

will-cromar closed this as completed May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TPU crash during importing Trainer from transformers #6990

TPU crash during importing Trainer from transformers #6990

asiff00 commented Apr 29, 2024 •

edited

JackCaoG commented Apr 29, 2024

JackCaoG commented Apr 29, 2024

JackCaoG commented Apr 29, 2024

JackCaoG commented Apr 29, 2024

asiff00 commented Apr 29, 2024

will-cromar commented May 6, 2024

TPU crash during importing Trainer from transformers #6990

TPU crash during importing Trainer from transformers #6990

Comments

asiff00 commented Apr 29, 2024 • edited

🐛 Bug

To Reproduce

Environment

JackCaoG commented Apr 29, 2024

JackCaoG commented Apr 29, 2024

JackCaoG commented Apr 29, 2024

JackCaoG commented Apr 29, 2024

asiff00 commented Apr 29, 2024

will-cromar commented May 6, 2024

asiff00 commented Apr 29, 2024 •

edited