-
Notifications
You must be signed in to change notification settings - Fork 3.4k
TPU failing to reinitialize after eval #1202
Comments
We see this error internally as well -- best to workaround it for now as
you've done, will try to get to the bottom of this.
…On Mon, Nov 5, 2018 at 10:21 AM Michael Mezher ***@***.***> wrote:
TPUs running tensorflow 1.11.0 are failing to reinitialize after
evaluation during "continuous_train_and_eval" schedule. Not sure if this is
a bug in Tensor2Tensor or TF's scheduling for TPUs.
This problem does not exist with TPUs running tensorflow 1.9.0
Still able to train, but have to set schedule to "train" and avoid
evaluation.
Environment information
tensor2tensor==1.10.0
tensorboard==1.11.0
tensorflow==1.11.0
tensorflow-metadata==0.9.0
tensorflow-serving-api==1.11.0
tensorflow-transform==0.9.0
Trace:
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-11-05-18:14:34
INFO:tensorflow:TPU job name tpu_worker
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from
gs://besc-route-prediction/tensor2tensor/train_variableGHR/tpu/model_kinSparse_singleShard/model.ckpt-1000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Init TPU system
INFO:tensorflow:Starting infeed thread controller.
INFO:tensorflow:Starting outfeed thread controller.
INFO:tensorflow:Enqueue next (30) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (30) batch(es) of data from outfeed.
INFO:tensorflow:Evaluation [30/30]
INFO:tensorflow:Stop infeed thread controller
INFO:tensorflow:Shutting down InfeedController thread.
INFO:tensorflow:InfeedController received shutdown signal, stopping.
INFO:tensorflow:Infeed thread finished, shutting down.
INFO:tensorflow:infeed marked as finished
INFO:tensorflow:Stop output thread controller
INFO:tensorflow:Shutting down OutfeedController thread.
INFO:tensorflow:OutfeedController received shutdown signal, stopping.
INFO:tensorflow:Outfeed thread finished, shutting down.
INFO:tensorflow:outfeed marked as finished
INFO:tensorflow:Shutdown TPU system.
INFO:tensorflow:Finished evaluation at 2018-11-05-18:15:24
INFO:tensorflow:Saving dict for global step 1000: global_step = 1000, loss
= 5.689253, targets/metrics-routes_packed/accuracy = 0.14732198,
targets/metrics-routes_packed/accuracy_per_sequence = 0.0,
targets/metrics-routes_packed/accuracy_top5 = 0.31858853,
targets/metrics-routes_packed/neg_log_perplexity = -5.773135,
targets_position/metrics-routes_packed/accuracy = 0.019026693,
targets_position/metrics-routes_packed/accuracy_per_sequence = 0.0,
targets_position/metrics-routes_packed/accuracy_top5 = 0.041145835,
targets_position/metrics-routes_packed/neg_log_perplexity = -12.372437,
targets_segmentation/metrics-routes_packed/accuracy = 0.019026693,
targets_segmentation/metrics-routes_packed/accuracy_per_sequence = 0.0,
targets_segmentation/metrics-routes_packed/accuracy_top5 = 0.041145835,
targets_segmentation/metrics-routes_packed/neg_log_perplexity = -12.372437
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1000:
gs://besc-route-prediction/tensor2tensor/train_variableGHR/tpu/model_kinSparse_singleShard/model.ckpt-1000
INFO:tensorflow:evaluation_loop marked as finished
INFO:tensorflow:loss = 4.7897573, step = 1000 (88.605 sec)
INFO:tensorflow:global_step/sec: 1.1286
INFO:tensorflow:examples/sec: 9.02881
INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed.
INFO:tensorflow:Error recorded from training_loop: The TPU system has not
been initialized.
[[{{node TPUReplicate/_compile/_18113138105308393941/_194}} = TPUCompileNumDynamicShapes=0,
Tguaranteed_constants=[], function=cluster_14266836224844131312[],
metadata="\n\006\010...7\323\224M", num_computations=1,
_device="/job:tpu_worker/replica:0/task:0/device:CPU:0"]]
[[{{node
training/transformer/body/decoder/layer_3/self_attention/layer_prepostprocess/layer_norm/layer_norm_scale/Adafactor/read/_67_G943}}
= _Recvclient_terminated=false,
recv_device="/job:tpu_worker/replica:0/task:0/device:TPU:0",
send_device="/job:tpu_worker/replica:0/task:0/device:CPU:0",
send_device_incarnation=-8638422706644811085,
tensor_name="edge_1039_...r/read/_67", tensor_type=DT_FLOAT,
_device="/job:tpu_worker/replica:0/task:0/device:TPU:0"]]
INFO:tensorflow:Error recorded from infeed: Step was cancelled by an
explicit call to Session::Close().
INFO:tensorflow:training_loop marked as finished
WARNING:tensorflow:Reraising captured error
Traceback (most recent call last):
File "/usr/local/bin/t2t-trainer", line 33, in
tf.app.run()
File
"/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py",
line 125, in run
_sys.exit(main(argv))
File "/usr/local/bin/t2t-trainer", line 28, in main
t2t_trainer.main(argv)
File
"/usr/local/lib/python2.7/dist-packages/tensor2tensor/bin/t2t_trainer.py",
line 379, in main
execute_schedule(exp)
File
"/usr/local/lib/python2.7/dist-packages/tensor2tensor/bin/t2t_trainer.py",
line 341, in execute_schedule
getattr(exp, FLAGS.schedule)()
File
"/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_lib.py",
line 391, in continuous_train_and_eval
self._eval_spec)
File
"/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py",
line 471, in train_and_evaluate
return executor.run()
File
"/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py",
line 610, in run
return self.run_local()
File
"/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py",
line 711, in run_local
saving_listeners=saving_listeners)
File
"/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py",
line 2400, in train
rendezvous.raise_errors()
File
"/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py",
line 128, in raise_errors
six.reraise(typ, value, traceback)
File
"/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py",
line 2394, in train
saving_listeners=saving_listeners
File
"/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py",
line 356, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File
"/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py",
line 1181, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File
"/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py",
line 1215, in _train_model_default
saving_listeners)
File
"/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py",
line 1409, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File
"/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py",
line 671, in run
run_metadata=run_metadata)
File
"/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py",
line 1148, in run
run_metadata=run_metadata)
File
"/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py",
line 1239, in run
raise six.reraise(*original_exc_info)
File
"/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py",
line 1224, in run
return self._sess.run(*args, **kwargs)
File
"/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py",
line 1296, in run
run_metadata=run_metadata)
File
"/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py",
line 1076, in run
return self._sess.run(*args, **kwargs)
File
"/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py",
line 887, in run
run_metadata_ptr)
File
"/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py",
line 1110, in _run
feed_dict_tensor, options, run_metadata)
File
"/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py",
line 1286, in _do_run
run_metadata)
File
"/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py",
line 1308, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.FailedPreconditionError: The TPU
system has not been initialized.
[[{{node TPUReplicate/_compile/_18113138105308393941/_194}} = TPUCompileNumDynamicShapes=0,
Tguaranteed_constants=[], function=cluster_14266836224844131312[],
metadata="\n\006\010...7\323\224M", num_computations=1,
_device="/job:tpu_worker/replica:0/task:0/device:CPU:0"]]
[[{{node
training/transformer/body/decoder/layer_3/self_attention/layer_prepostprocess/layer_norm/layer_norm_scale/Adafactor/read/_67_G943}}
= _Recvclient_terminated=false,
recv_device="/job:tpu_worker/replica:0/task:0/device:TPU:0",
send_device="/job:tpu_worker/replica:0/task:0/device:CPU:0",
send_device_incarnation=-8638422706644811085,
tensor_name="edge_1039_...r/read/_67", tensor_type=DT_FLOAT,
_device="/job:tpu_worker/replica:0/task:0/device:TPU:0"]]
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1202>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/ABFCFviDqz0u7-pYZgEzhne2JdBia9wfks5usIG7gaJpZM4YO_q6>
.
--
We are here on earth to do good for others. What the others are here for, I
don't know.
Afroz Mohiuddin
|
Same problem. |
This still seems to be a problem in T2T 1.12.0. |
The same here, it's able to train but it fails to reinitialize train after evaluation. Trace:
|
Has there been any progress on this issue? I think I might be encountering the same problem, except while using InMemoryEvaluatorHook: https://stackoverflow.com/questions/55071546/using-inmemoryevaluatorhook-with-tpu-throws-exception |
Same with tf-gpu 1.13.1, t2t@3b38635f12348036ea1e0166857f43a6b971ab07, tpu v3, same code trains on a gpu without error, while running schedule "train_and_evaluate". |
Is there any progress with this issue? |
Getting the similar errors with tpu 1.13, while trying to implement a CheckpointSaverHook doing the same thing. |
I have the same problem. At the end of evaluation session, the |
TPUs running tensorflow 1.11.0 are failing to reinitialize after evaluation during "continuous_train_and_eval" schedule. Not sure if this is a bug in Tensor2Tensor or TF's scheduling for TPUs.
This problem does not exist with TPUs running tensorflow 1.9.0
Still able to train, but have to set schedule to "train" and avoid evaluation.
Environment information
tensor2tensor==1.10.0
tensorboard==1.11.0
tensorflow==1.11.0
tensorflow-metadata==0.9.0
tensorflow-serving-api==1.11.0
tensorflow-transform==0.9.0
Trace:
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-11-05-18:14:34
INFO:tensorflow:TPU job name tpu_worker
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from gs://besc-route-prediction/tensor2tensor/train_variableGHR/tpu/model_kinSparse_singleShard/model.ckpt-1000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Init TPU system
INFO:tensorflow:Starting infeed thread controller.
INFO:tensorflow:Starting outfeed thread controller.
INFO:tensorflow:Enqueue next (30) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (30) batch(es) of data from outfeed.
INFO:tensorflow:Evaluation [30/30]
INFO:tensorflow:Stop infeed thread controller
INFO:tensorflow:Shutting down InfeedController thread.
INFO:tensorflow:InfeedController received shutdown signal, stopping.
INFO:tensorflow:Infeed thread finished, shutting down.
INFO:tensorflow:infeed marked as finished
INFO:tensorflow:Stop output thread controller
INFO:tensorflow:Shutting down OutfeedController thread.
INFO:tensorflow:OutfeedController received shutdown signal, stopping.
INFO:tensorflow:Outfeed thread finished, shutting down.
INFO:tensorflow:outfeed marked as finished
INFO:tensorflow:Shutdown TPU system.
INFO:tensorflow:Finished evaluation at 2018-11-05-18:15:24
INFO:tensorflow:Saving dict for global step 1000: global_step = 1000, loss = 5.689253, targets/metrics-routes_packed/accuracy = 0.14732198, targets/metrics-routes_packed/accuracy_per_sequence = 0.0, targets/metrics-routes_packed/accuracy_top5 = 0.31858853, targets/metrics-routes_packed/neg_log_perplexity = -5.773135, targets_position/metrics-routes_packed/accuracy = 0.019026693, targets_position/metrics-routes_packed/accuracy_per_sequence = 0.0, targets_position/metrics-routes_packed/accuracy_top5 = 0.041145835, targets_position/metrics-routes_packed/neg_log_perplexity = -12.372437, targets_segmentation/metrics-routes_packed/accuracy = 0.019026693, targets_segmentation/metrics-routes_packed/accuracy_per_sequence = 0.0, targets_segmentation/metrics-routes_packed/accuracy_top5 = 0.041145835, targets_segmentation/metrics-routes_packed/neg_log_perplexity = -12.372437
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1000: gs://besc-route-prediction/tensor2tensor/train_variableGHR/tpu/model_kinSparse_singleShard/model.ckpt-1000
INFO:tensorflow:evaluation_loop marked as finished
INFO:tensorflow:loss = 4.7897573, step = 1000 (88.605 sec)
INFO:tensorflow:global_step/sec: 1.1286
INFO:tensorflow:examples/sec: 9.02881
INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed.
INFO:tensorflow:Error recorded from training_loop: The TPU system has not been initialized.
[[{{node TPUReplicate/_compile/_18113138105308393941/_194}} = TPUCompileNumDynamicShapes=0, Tguaranteed_constants=[], function=cluster_14266836224844131312[], metadata="\n\006\010...7\323\224M", num_computations=1, _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"]]
[[{{node training/transformer/body/decoder/layer_3/self_attention/layer_prepostprocess/layer_norm/layer_norm_scale/Adafactor/read/_67_G943}} = _Recvclient_terminated=false, recv_device="/job:tpu_worker/replica:0/task:0/device:TPU:0", send_device="/job:tpu_worker/replica:0/task:0/device:CPU:0", send_device_incarnation=-8638422706644811085, tensor_name="edge_1039_...r/read/_67", tensor_type=DT_FLOAT, _device="/job:tpu_worker/replica:0/task:0/device:TPU:0"]]
INFO:tensorflow:Error recorded from infeed: Step was cancelled by an explicit call to
Session::Close()
.INFO:tensorflow:training_loop marked as finished
WARNING:tensorflow:Reraising captured error
Traceback (most recent call last):
File "/usr/local/bin/t2t-trainer", line 33, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/usr/local/bin/t2t-trainer", line 28, in main
t2t_trainer.main(argv)
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 379, in main
execute_schedule(exp)
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 341, in execute_schedule
getattr(exp, FLAGS.schedule)()
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_lib.py", line 391, in continuous_train_and_eval
self._eval_spec)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
return executor.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 610, in run
return self.run_local()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 711, in run_local
saving_listeners=saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2400, in train
rendezvous.raise_errors()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors
six.reraise(typ, value, traceback)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2394, in train
saving_listeners=saving_listeners
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 356, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1181, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1215, in _train_model_default
saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1409, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 671, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1148, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1239, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1224, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1296, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1076, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 887, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1110, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1286, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1308, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.FailedPreconditionError: The TPU system has not been initialized.
[[{{node TPUReplicate/_compile/_18113138105308393941/_194}} = TPUCompileNumDynamicShapes=0, Tguaranteed_constants=[], function=cluster_14266836224844131312[], metadata="\n\006\010...7\323\224M", num_computations=1, _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"]]
[[{{node training/transformer/body/decoder/layer_3/self_attention/layer_prepostprocess/layer_norm/layer_norm_scale/Adafactor/read/_67_G943}} = _Recvclient_terminated=false, recv_device="/job:tpu_worker/replica:0/task:0/device:TPU:0", send_device="/job:tpu_worker/replica:0/task:0/device:CPU:0", send_device_incarnation=-8638422706644811085, tensor_name="edge_1039_...r/read/_67", tensor_type=DT_FLOAT, _device="/job:tpu_worker/replica:0/task:0/device:TPU:0"]]
The text was updated successfully, but these errors were encountered: