Skip to content

Runs on GPU, error on TPU: Computation requires more parameters (546) than supported (limit 236) #1963

@hrbigelow

Description

@hrbigelow

❓ Questions and Help

Hi all,

Could anyone give a clue what might be going wrong? I have run this commit, from this colab

which has produced this output: debug run

Some lines from it are:

Exception in device=TPU:0: Invalid argument: From /job:tpu_worker/replica:0/task:0:
Computation requires more parameters (546) than supported (limit 236).
         [[{{node XRTCompile}}]]
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
    fn(gindex, *args)
  File "ae-wavenet/train.py", line 56, in _mp_fn
    m.train(index)
  File "/content/ae-wavenet/chassis.py", line 127, in train
    loss = self.optim_step_fn()
  File "/content/ae-wavenet/chassis.py", line 95, in <lambda>
    optimizer_args={'closure': self.loss_fn}))
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 538, in optimizer_step
    loss = optimizer.step(**optimizer_args)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/optim/adam.py", line 62, in step
    loss = closure()
  File "/content/ae-wavenet/chassis.py", line 178, in loss_fn
    self.run_batch()
  File "/content/ae-wavenet/chassis.py", line 170, in run_batch
    batch = next(self.data_iter)
  File "/content/ae-wavenet/chassis.py", line 34, in __next__
    vb = self.per_dev_loader.__next__()[0]
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 31, in __next__
    return self.next()
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 34, in next
    xm.mark_step()
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 477, in mark_step
    wait=xu.getenv_as('XLA_SYNC_WAIT', bool, False))
RuntimeError: Invalid argument: From /job:tpu_worker/replica:0/task:0:
Computation requires more parameters (546) than supported (limit 236).
         [[{{node XRTCompile}}]]
Writing run results to /tmp/debug_run-eef90b0a0f8e-root-0
XLA Environment:
  XRT_TPU_CONFIG=tpu_worker;0;10.74.90.234:8470
  TF_FORCE_GPU_ALLOW_GROWTH=true
  XLA_IR_DEBUG=1
  XLA_HLO_DEBUG=1
  TF_CPP_LOG_THREAD_ID=1
  TF_CPP_VMODULE=tensor=5,computation_client=5,xrt_computation_client=5,aten_xla_type=1
  XLA_SAVE_TENSORS_FILE=/tmp/debug_run-eef90b0a0f8e-root-0/graphs
  XLA_METRICS_FILE=/tmp/debug_run-eef90b0a0f8e-root-0/metrics

The same code has run successfully on my GTX1070 Max-Q laptop environment with PyTorch version 1.3.1

I've never seen the error before (but it has been several months since I've used torch_xla)

Thanks in advance!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions