Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPU error in Google GCP - fixed #2

Open
wangcongcong123 opened this issue Oct 23, 2020 · 0 comments
Open

TPU error in Google GCP - fixed #2

wangcongcong123 opened this issue Oct 23, 2020 · 0 comments

Comments

@wangcongcong123
Copy link
Owner

The latest commit solved the following bug:

Instructions for updating:
renamed to `run`
  0%|                                                                                              | 0/16 [00:37<?, ?it/s]epochs:   0%|                                                                                       | 0/6 [00:37<?, ?it/s]Traceback (most recent call last):
  File "example_t5.py", line 47, in <module>
    trainer.train(model, strategy, tokenizer, inputs)
  File "/root/ttt/ttt/t2t_trainer.py", line 227, in train
    epoch_total_loss += loss.numpy()
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1063, in numpy
    maybe_arr = self._numpy()  # pylint: disable=protected-access
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1031, in _numpy
    six.raise_from(core._status_to_exception(e.code, e.message), None)  # pylint: disable=protected-access
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnavailableError: Socket closed
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/tensorflow/python/distribute/tpu_strategy.py", line 540, in async_wait
    context.async_wait()
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/tensorflow/python/eager/context.py", line 2319, in async_wait
    context().sync_executors()
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/tensorflow/python/eager/context.py", line 658, in sync_executors
    pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
tensorflow.python.framework.errors_impl.UnavailableError: 2 root error(s) found.
  (0) Unavailable: Socket closed
  (1) Invalid argument: Unable to find a context_id matching the specified one (14250917626996280268). Perhaps the worker was restarted, or the context was GC'd?
0 successful operations.
0 derived errors ignored.
2020-10-23 19:51:06.239763: W    3876 ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:57] Ignoring an error encountered when deleting remote tensors handles: Invalid argument: Unable to find a context_id matching the specified one (14250917626996280268). Perhaps the worker was restarted, or the context was GC'd?
Additional GRPC error information from remote target /job:worker/replica:0/task:0:
:{"created":"@1603482666.236322988","description":"Error received from peer ipv4:x.x.x.x:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find a context_id matching the specified one (14250917626996280268). Perhaps the worker was restarted, or the context was GC'd?","grpc_status":3}
2020-10-23 19:51:06.241849: W    3781 tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:76] Unable to destroy remote tensor handles. If you are running a tf.function, it usually indicates some op in the graph gets an error: Unable to find a context_id matching the specified one (14250917626996280268). Perhaps the worker was restarted, or the context was GC'd?
@wangcongcong123 wangcongcong123 changed the title TPU error in Google GCP TPU error in Google GCP - solved Oct 23, 2020
@wangcongcong123 wangcongcong123 changed the title TPU error in Google GCP - solved TPU error in Google GCP - fixed Oct 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant