Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

TPU failing to reinitialize after eval #1202

Open
mikeymezher opened this issue Nov 5, 2018 · 9 comments
Open

TPU failing to reinitialize after eval #1202

mikeymezher opened this issue Nov 5, 2018 · 9 comments

Comments

@mikeymezher
Copy link
Contributor

TPUs running tensorflow 1.11.0 are failing to reinitialize after evaluation during "continuous_train_and_eval" schedule. Not sure if this is a bug in Tensor2Tensor or TF's scheduling for TPUs.

This problem does not exist with TPUs running tensorflow 1.9.0

Still able to train, but have to set schedule to "train" and avoid evaluation.

Environment information

tensor2tensor==1.10.0
tensorboard==1.11.0
tensorflow==1.11.0
tensorflow-metadata==0.9.0
tensorflow-serving-api==1.11.0
tensorflow-transform==0.9.0

Trace:
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-11-05-18:14:34
INFO:tensorflow:TPU job name tpu_worker
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from gs://besc-route-prediction/tensor2tensor/train_variableGHR/tpu/model_kinSparse_singleShard/model.ckpt-1000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Init TPU system
INFO:tensorflow:Starting infeed thread controller.
INFO:tensorflow:Starting outfeed thread controller.
INFO:tensorflow:Enqueue next (30) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (30) batch(es) of data from outfeed.
INFO:tensorflow:Evaluation [30/30]
INFO:tensorflow:Stop infeed thread controller
INFO:tensorflow:Shutting down InfeedController thread.
INFO:tensorflow:InfeedController received shutdown signal, stopping.
INFO:tensorflow:Infeed thread finished, shutting down.
INFO:tensorflow:infeed marked as finished
INFO:tensorflow:Stop output thread controller
INFO:tensorflow:Shutting down OutfeedController thread.
INFO:tensorflow:OutfeedController received shutdown signal, stopping.
INFO:tensorflow:Outfeed thread finished, shutting down.
INFO:tensorflow:outfeed marked as finished
INFO:tensorflow:Shutdown TPU system.
INFO:tensorflow:Finished evaluation at 2018-11-05-18:15:24
INFO:tensorflow:Saving dict for global step 1000: global_step = 1000, loss = 5.689253, targets/metrics-routes_packed/accuracy = 0.14732198, targets/metrics-routes_packed/accuracy_per_sequence = 0.0, targets/metrics-routes_packed/accuracy_top5 = 0.31858853, targets/metrics-routes_packed/neg_log_perplexity = -5.773135, targets_position/metrics-routes_packed/accuracy = 0.019026693, targets_position/metrics-routes_packed/accuracy_per_sequence = 0.0, targets_position/metrics-routes_packed/accuracy_top5 = 0.041145835, targets_position/metrics-routes_packed/neg_log_perplexity = -12.372437, targets_segmentation/metrics-routes_packed/accuracy = 0.019026693, targets_segmentation/metrics-routes_packed/accuracy_per_sequence = 0.0, targets_segmentation/metrics-routes_packed/accuracy_top5 = 0.041145835, targets_segmentation/metrics-routes_packed/neg_log_perplexity = -12.372437
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1000: gs://besc-route-prediction/tensor2tensor/train_variableGHR/tpu/model_kinSparse_singleShard/model.ckpt-1000
INFO:tensorflow:evaluation_loop marked as finished
INFO:tensorflow:loss = 4.7897573, step = 1000 (88.605 sec)
INFO:tensorflow:global_step/sec: 1.1286
INFO:tensorflow:examples/sec: 9.02881
INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed.
INFO:tensorflow:Error recorded from training_loop: The TPU system has not been initialized.
[[{{node TPUReplicate/_compile/_18113138105308393941/_194}} = TPUCompileNumDynamicShapes=0, Tguaranteed_constants=[], function=cluster_14266836224844131312[], metadata="\n\006\010...7\323\224M", num_computations=1, _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"]]
[[{{node training/transformer/body/decoder/layer_3/self_attention/layer_prepostprocess/layer_norm/layer_norm_scale/Adafactor/read/_67_G943}} = _Recvclient_terminated=false, recv_device="/job:tpu_worker/replica:0/task:0/device:TPU:0", send_device="/job:tpu_worker/replica:0/task:0/device:CPU:0", send_device_incarnation=-8638422706644811085, tensor_name="edge_1039_...r/read/_67", tensor_type=DT_FLOAT, _device="/job:tpu_worker/replica:0/task:0/device:TPU:0"]]
INFO:tensorflow:Error recorded from infeed: Step was cancelled by an explicit call to Session::Close().
INFO:tensorflow:training_loop marked as finished
WARNING:tensorflow:Reraising captured error
Traceback (most recent call last):
File "/usr/local/bin/t2t-trainer", line 33, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/usr/local/bin/t2t-trainer", line 28, in main
t2t_trainer.main(argv)
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 379, in main
execute_schedule(exp)
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 341, in execute_schedule
getattr(exp, FLAGS.schedule)()
File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_lib.py", line 391, in continuous_train_and_eval
self._eval_spec)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
return executor.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 610, in run
return self.run_local()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 711, in run_local
saving_listeners=saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2400, in train
rendezvous.raise_errors()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors
six.reraise(typ, value, traceback)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2394, in train
saving_listeners=saving_listeners
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 356, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1181, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1215, in _train_model_default
saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1409, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 671, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1148, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1239, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1224, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1296, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1076, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 887, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1110, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1286, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1308, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.FailedPreconditionError: The TPU system has not been initialized.
[[{{node TPUReplicate/_compile/_18113138105308393941/_194}} = TPUCompileNumDynamicShapes=0, Tguaranteed_constants=[], function=cluster_14266836224844131312[], metadata="\n\006\010...7\323\224M", num_computations=1, _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"]]
[[{{node training/transformer/body/decoder/layer_3/self_attention/layer_prepostprocess/layer_norm/layer_norm_scale/Adafactor/read/_67_G943}} = _Recvclient_terminated=false, recv_device="/job:tpu_worker/replica:0/task:0/device:TPU:0", send_device="/job:tpu_worker/replica:0/task:0/device:CPU:0", send_device_incarnation=-8638422706644811085, tensor_name="edge_1039_...r/read/_67", tensor_type=DT_FLOAT, _device="/job:tpu_worker/replica:0/task:0/device:TPU:0"]]

@afrozenator
Copy link
Contributor

afrozenator commented Nov 5, 2018 via email

@BogdanDidenko
Copy link

Same problem.

@mikeymezher
Copy link
Contributor Author

This still seems to be a problem in T2T 1.12.0.
Is this the case internally as well? @afrozenator

@edumotya
Copy link

The same here, it's able to train but it fails to reinitialize train after evaluation. Trace:

INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Init TPU system
INFO:tensorflow:Initialized TPU in 7 seconds
INFO:tensorflow:Starting infeed thread controller.
INFO:tensorflow:Initialized dataset iterators in 0 seconds
INFO:tensorflow:Starting outfeed thread controller.
INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed.
INFO:tensorflow:Evaluation [100/100]
INFO:tensorflow:Stop infeed thread controller
INFO:tensorflow:Shutting down InfeedController thread.
INFO:tensorflow:InfeedController received shutdown signal, stopping.
INFO:tensorflow:Infeed thread finished, shutting down.
INFO:tensorflow:infeed marked as finished
INFO:tensorflow:Stop output thread controller
INFO:tensorflow:Shutting down OutfeedController thread.
INFO:tensorflow:OutfeedController received shutdown signal, stopping.
INFO:tensorflow:Outfeed thread finished, shutting down.
INFO:tensorflow:outfeed marked as finished
INFO:tensorflow:Shutdown TPU system.
INFO:tensorflow:Finished evaluation at 2019-02-10-19:23:16
INFO:tensorflow:Saving dict for global step 100: global_step = 100, loss = 7.272978, mae = 2.8202152
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 100: gs://xxxx/model.ckpt-100
INFO:tensorflow:evaluation_loop marked as finished
INFO:tensorflow:loss = 4.7624865, step = 100
INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed.
INFO:tensorflow:Error recorded from training_loop: Combined status information from 9 operations:

Status code: Failed precondition [9x]
  The TPU system has not been initialized.
  	 [[node TPUReplicateMetadata (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1171) ]] [1x]
  The TPU system has not been initialized.
  	 [[node TPUReplicateMetadata (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1171) ]]
  	 [[{{node cluster/_variable_copy/_980}}]] [3x]
  The TPU system has not been initialized.
  	 [[node TPUReplicateMetadata (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1171) ]]
  	 [[{{node cluster/_variable_copy/_980}}]]
  	 [[{{node cluster/_variable_copy/_980}}]] [5x]
(0 successful operations.)

@dapatil211
Copy link

Has there been any progress on this issue? I think I might be encountering the same problem, except while using InMemoryEvaluatorHook: https://stackoverflow.com/questions/55071546/using-inmemoryevaluatorhook-with-tpu-throws-exception

@cwbeitel
Copy link
Contributor

Same with tf-gpu 1.13.1, t2t@3b38635f12348036ea1e0166857f43a6b971ab07, tpu v3, same code trains on a gpu without error, while running schedule "train_and_evaluate".

@emukans
Copy link

emukans commented Nov 6, 2019

Is there any progress with this issue?

@jpark621
Copy link

Getting the similar errors with tpu 1.13, while trying to implement a CheckpointSaverHook doing the same thing.

@Nephalen
Copy link

I have the same problem.

At the end of evaluation session, the TPUInfeedOutfeedSessionHook calls a routine to shutdown TPU system. However, since the training session is not re-initialized (it will only be called once if you include evaluation in save_listener or hook), TPU system is not re-initialized.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants