-
Notifications
You must be signed in to change notification settings - Fork 566
Closed
Description
❓ Questions and Help
Hi all,
Could anyone give a clue what might be going wrong? I have run this commit, from this colab
which has produced this output: debug run
Some lines from it are:
Exception in device=TPU:0: Invalid argument: From /job:tpu_worker/replica:0/task:0:
Computation requires more parameters (546) than supported (limit 236).
[[{{node XRTCompile}}]]
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
fn(gindex, *args)
File "ae-wavenet/train.py", line 56, in _mp_fn
m.train(index)
File "/content/ae-wavenet/chassis.py", line 127, in train
loss = self.optim_step_fn()
File "/content/ae-wavenet/chassis.py", line 95, in <lambda>
optimizer_args={'closure': self.loss_fn}))
File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 538, in optimizer_step
loss = optimizer.step(**optimizer_args)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/optim/adam.py", line 62, in step
loss = closure()
File "/content/ae-wavenet/chassis.py", line 178, in loss_fn
self.run_batch()
File "/content/ae-wavenet/chassis.py", line 170, in run_batch
batch = next(self.data_iter)
File "/content/ae-wavenet/chassis.py", line 34, in __next__
vb = self.per_dev_loader.__next__()[0]
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 31, in __next__
return self.next()
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 34, in next
xm.mark_step()
File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 477, in mark_step
wait=xu.getenv_as('XLA_SYNC_WAIT', bool, False))
RuntimeError: Invalid argument: From /job:tpu_worker/replica:0/task:0:
Computation requires more parameters (546) than supported (limit 236).
[[{{node XRTCompile}}]]
Writing run results to /tmp/debug_run-eef90b0a0f8e-root-0
XLA Environment:
XRT_TPU_CONFIG=tpu_worker;0;10.74.90.234:8470
TF_FORCE_GPU_ALLOW_GROWTH=true
XLA_IR_DEBUG=1
XLA_HLO_DEBUG=1
TF_CPP_LOG_THREAD_ID=1
TF_CPP_VMODULE=tensor=5,computation_client=5,xrt_computation_client=5,aten_xla_type=1
XLA_SAVE_TENSORS_FILE=/tmp/debug_run-eef90b0a0f8e-root-0/graphs
XLA_METRICS_FILE=/tmp/debug_run-eef90b0a0f8e-root-0/metrics
The same code has run successfully on my GTX1070 Max-Q laptop environment with PyTorch version 1.3.1
I've never seen the error before (but it has been several months since I've used torch_xla)
Thanks in advance!
Metadata
Metadata
Assignees
Labels
No labels