You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Instructions for updating:
renamed to `run`
0%|| 0/16 [00:37<?, ?it/s]epochs: 0%|| 0/6 [00:37<?, ?it/s]Traceback (most recent call last):
File "example_t5.py", line 47, in<module>
trainer.train(model, strategy, tokenizer, inputs)
File "/root/ttt/ttt/t2t_trainer.py", line 227, in train
epoch_total_loss += loss.numpy()
File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1063, in numpy
maybe_arr = self._numpy() # pylint: disable=protected-access
File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1031, in _numpy
six.raise_from(core._status_to_exception(e.code, e.message), None) # pylint: disable=protected-access
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnavailableError: Socket closed
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/tensorflow/python/distribute/tpu_strategy.py", line 540, in async_wait
context.async_wait()
File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/tensorflow/python/eager/context.py", line 2319, in async_wait
context().sync_executors()
File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/tensorflow/python/eager/context.py", line 658, in sync_executors
pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
tensorflow.python.framework.errors_impl.UnavailableError: 2 root error(s) found.
(0) Unavailable: Socket closed
(1) Invalid argument: Unable to find a context_id matching the specified one (14250917626996280268). Perhaps the worker was restarted, or the context was GC'd?0 successful operations.0 derived errors ignored.2020-10-23 19:51:06.239763: W 3876 ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:57] Ignoring an error encountered when deleting remote tensors handles: Invalid argument: Unable to find a context_id matching the specified one (14250917626996280268). Perhaps the worker was restarted, or the context was GC'd?
Additional GRPC error information from remote target /job:worker/replica:0/task:0:
:{"created":"@1603482666.236322988","description":"Error received from peer ipv4:x.x.x.x:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find a context_id matching the specified one (14250917626996280268). Perhaps the worker was restarted, or the context was GC'd?","grpc_status":3}
2020-10-23 19:51:06.241849: W 3781 tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:76] Unable to destroy remote tensor handles. If you are running a tf.function, it usually indicates some op in the graph gets an error: Unable to find a context_id matching the specified one (14250917626996280268). Perhaps the worker was restarted, or the context was GC'd?
The text was updated successfully, but these errors were encountered:
The latest commit solved the following bug:
The text was updated successfully, but these errors were encountered: