Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR:tensorflow: Failed to close session after error.Other threads may hang. #44824

Closed
etetteh opened this issue Nov 13, 2020 · 6 comments
Closed
Assignees
Labels
comp:tpus tpu, tpuestimator type:support Support issues

Comments

@etetteh
Copy link

etetteh commented Nov 13, 2020

I am trying to pretrain my ELECTRA base, I keep getting this output:

Running training

2020-11-13 08:00:18.044763: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:370] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
Model is built!
2020-11-13 08:00:48.956655: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:370] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
ERROR:tensorflow:Error recorded from infeed: From /job:worker/replica:0/task:0:
{{function_node _inference_tf_data_experimental_map_and_batch_69}} Key: segment_ids. Can't parse serialized Example.
[[{{node ParseSingleExample/ParseSingleExample}}]]
[[input_pipeline_task0/while/IteratorGetNext]]
ERROR:tensorflow:Closing session due to error From /job:worker/replica:0/task:0:
{{function_node _inference_tf_data_experimental_map_and_batch_69}} Key: segment_ids. Can't parse serialized Example.
[[{{node ParseSingleExample/ParseSingleExample}}]]
[[input_pipeline_task0/while/IteratorGetNext]]
2020-11-13 08:01:08.642776: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1605254468.642525410","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-11-13 08:01:08.642779: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1605254468.642549072","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
ERROR:tensorflow:Error recorded from outfeed: Step was cancelled by an explicit call to Session::Close().
ERROR:tensorflow:

Failed to close session after error.Other threads may hang.

2020-11-13 08:01:50.857700: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:370] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
ERROR:tensorflow:Error recorded from infeed: From /job:worker/replica:0/task:0:
{{function_node _inference_tf_data_experimental_map_and_batch_69}} Key: segment_ids. Can't parse serialized Example.
[[{{node ParseSingleExample/ParseSingleExample}}]]
[[input_pipeline_task0/while/IteratorGetNext]]

@etetteh etetteh added the type:feature Feature requests label Nov 13, 2020
@Saduf2019 Saduf2019 assigned Saduf2019 and unassigned ravikyram Nov 13, 2020
@Saduf2019
Copy link
Contributor

@etetteh
We see that the issue template has not been filled, could you please do so as it helps us analyse the issue [tf version, steps followed before you ran into this error or stand alone code to reproduce the issue faced]

@Saduf2019 Saduf2019 added the stat:awaiting response Status - Awaiting response from author label Nov 13, 2020
@etetteh
Copy link
Author

etetteh commented Nov 13, 2020

I am using TensorFlow version 1.15.4, and a Google cloud TPU of the same version on GCP.
I am also using python version 3.6

import logging
import tensorflow as tf
 
log = logging.getLogger('tensorflow')
log.setLevel(logging.INFO)

formatter = logging.Formatter('%(asctime)s :  %(message)s')
sh = logging.StreamHandler()
sh.setLevel(logging.INFO)
sh.setFormatter(formatter)
log.handlers = [sh]

Pretty much, what I am doing is running the following line of code:

 python run_pretraining.py  \
                          --data-dir <my_data_dir> \
                          --model-name <my_model_name> \
                          --hparams '{"model_size": "base", "use_tpu": True, "num_tpu_cores": 8,  tpu_name: <my_ypu_name>,  tpu_zone: us-central1-f,  gcp_project: <my_gcp_id>}'

@Saduf2019
Copy link
Contributor

@etetteh
Can you please upgrade to 2.x as there is no support for 1.x now, and let us know if you face any issues.

@etetteh
Copy link
Author

etetteh commented Nov 13, 2020

@Saduf2019 The GOOGLE ELECTRA code base is in version 1.15, or you mean upgrade the version of the tpu?

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Nov 15, 2020
@Saduf2019 Saduf2019 added the comp:tpus tpu, tpuestimator label Dec 10, 2020
@Saduf2019 Saduf2019 assigned ymodak and unassigned Saduf2019 Dec 10, 2020
@ymodak ymodak added type:support Support issues and removed type:feature Feature requests labels Dec 14, 2020
@ymodak
Copy link
Contributor

ymodak commented Dec 14, 2020

Closing this issue since its resolved on another thread. Thanks!

@ymodak ymodak closed this as completed Dec 14, 2020
@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:tpus tpu, tpuestimator type:support Support issues
Projects
None yet
Development

No branches or pull requests

5 participants