Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

checkpoint cannot be found when running distributed mode with 2 ps #394

Closed
tobyyouup opened this issue Oct 31, 2017 · 6 comments
Closed

Comments

@tobyyouup
Copy link

tobyyouup commented Oct 31, 2017

Hi @rsepassi @lukaszkaiser , this an old problem mentioned in #45, now reopen again cause I am trying running in distributed mode again. When I use 2 ps + 2 worker, and the 2 ps are on two different physical machines and there are no shared file system. I found each ps just store part of the checkpoint. eg: there are 4 parts of checkpoint in total, one ps in machine A store 00 and 01 checkpoint, while the other ps in machine B store 02 and 03 checkpoint. And then the worker shows cannot find "model.ckpt-0_temp_02a6a6cf87a94843b039bf54f3fbf449/part-00000-of-00004.index".

Is there anything wrong? How could run ps on different physical machines without share file system?
The details of the problem is as follows:

When I run ps and and worker on two machines(2 ps and 2 worker, each machine with one ps and one worker). The worker with id=0(the first worker) show errors:

`
_###

INFO:tensorflow:Saving checkpoints for 0 into t2t_train_dis/wmt_ende_tokens_32k/transformer-transformer_base/model.ckpt.
Traceback (most recent call last):
File "/usr/anaconda2/bin/t2t-trainer", line 83, in
tf.app.run()
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/usr/anaconda2/bin/t2t-trainer", line 79, in main
schedule=FLAGS.schedule)
File "/usr/anaconda2/lib/python2.7/site-packages/tensor2tensor/utils/trainer_utils.py", line 270, in run
experiment_fn=exp_fn, schedule=schedule, output_dir=output_dir)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 210, in run
return _execute_schedule(experiment, schedule)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 47, in _execute_schedule
return task()
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train
hooks=self._train_monitors + extra_hooks)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train
monitors=hooks)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
return func(*args, **kwargs)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit
loss = self._train_model(input_fn=input_fn, hooks=hooks)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1007, in _train_model
_, loss = mon_sess.run([model_fn_ops.train_op, model_fn_ops.loss])
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 505, in run
run_metadata=run_metadata)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 842, in run
run_metadata=run_metadata)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 798, in run
return self._sess.run(*args, **kwargs)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 960, in run
run_metadata=run_metadata))
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 368, in after_run
self._save(global_step, run_context.session)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 384, in _save
self._get_saver().save(session, self._save_path, global_step=step)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1488, in save
raise exc
tensorflow.python.framework.errors_impl.NotFoundError: t2t_train_dis/wmt_ende_tokens_32k/transformer-transformer_base/model.ckpt-0_temp_02a6a6cf87a94843b039bf54f3fbf449/part-00000-of-00004.index
[[Node: save/MergeV2Checkpoints = MergeV2Checkpoints[delete_old_dirs=true, _device="/job:ps/replica:0/task:1/cpu:0"](save/MergeV2Checkpoints/checkpoint_prefixes, _recv_save/Const_0_S3639)]]

Caused by op u'save/MergeV2Checkpoints', defined at:
File "/usr/anaconda2/bin/t2t-trainer", line 83, in
tf.app.run()
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/usr/anaconda2/bin/t2t-trainer", line 79, in main
schedule=FLAGS.schedule)
File "/usr/anaconda2/lib/python2.7/site-packages/tensor2tensor/utils/trainer_utils.py", line 270, in run
experiment_fn=exp_fn, schedule=schedule, output_dir=output_dir)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 210, in run
return _execute_schedule(experiment, schedule)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 47, in _execute_schedule
return task()
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train
hooks=self._train_monitors + extra_hooks)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train
monitors=hooks)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
return func(*args, **kwargs)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit
loss = self._train_model(input_fn=input_fn, hooks=hooks)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1003, in _train_model
config=self._session_config
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 352, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 648, in init
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 477, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 822, in init
_WrappedSession.init(self, self._create_session())
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 827, in _create_session
return self._sess_creator.create_session()
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 538, in create_session
self.tf_sess = self._session_creator.create_session()
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 403, in create_session
self._scaffold.finalize()
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 205, in finalize
self._saver.build()
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1170, in build
restore_sequentially=self._restore_sequentially)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 685, in build
save_tensor = self._AddShardedSaveOps(filename_tensor, per_device)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 361, in _AddShardedSaveOps
return self._AddShardedSaveOpsForV2(filename_tensor, per_device)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 343, in _AddShardedSaveOpsForV2
sharded_prefixes, checkpoint_prefix, delete_old_dirs=True)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 185, in merge_v2_checkpoints
delete_old_dirs=delete_old_dirs, name=name)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1269, in init
self._traceback = _extract_stack()

NotFoundError (see above for traceback): t2t_train_dis/wmt_ende_tokens_32k/transformer-transformer_base/model.ckpt-0_temp_02a6a6cf87a94843b039bf54f3fbf449/part-00000-of-00004.index
[[Node: save/MergeV2Checkpoints = MergeV2Checkpoints[delete_old_dirs=true, _device="/job:ps/replica:0/task:1/cpu:0"](save/MergeV2Checkpoints/checkpoint_prefixes, _recv_save/Const_0_S3639)]]
`

It shows the checkpoint "part-00000-of-00004.index" not exist, but this file actually exist. the file status is:
-rw-rw-r-- 1 root root 12 7月 15 18:41 part-00000-of-00004.data-00000-of-00001
-rw-rw-r-- 1 root root 177 7月 15 18:41 part-00000-of-00004.index
-rw-rw-r-- 1 root root 107839496 7月 15 18:41 part-00001-of-00004.data-00000-of-00001
-rw-rw-r-- 1 root root 15176 7月 15 18:41 part-00001-of-00004.index

the part 00002 and 000003 is on the second machine:
-rw-rw-r-- 1 root root 12 Jul 15 03:38 part-00002-of-00004.data-00000-of-00001
-rw-rw-r-- 1 root root 217 Jul 15 03:38 part-00002-of-00004.index
-rw-rw-r-- 1 root root 635314176 Jul 15 03:38 part-00003-of-00004.data-00000-of-00001
-rw-rw-r-- 1 root root 18960 Jul 15 03:38 part-00003-of-00004.index

The two ps and the second worker works normally, just the the first worker fails.
the script to start the ps and worker is:

`PROBLEM=wmt_ende_tokens_32k
MODEL=transformer
HPARAMS=transformer_base
DATA_DIR=./t2t_data_new
model_name=t2t_train_dis
TRAIN_DIR=$model_name/$PROBLEM/$MODEL-$HPARAMS
mkdir -p $TRAIN_DIR

worker:
TF_CONFIG='{"environment": "cloud", "cluster": {"ps": ["10.150.147.74:10002", "10.150.144.48:10000"], "master": ["10.150.147.74:5002", "10.150.144.48:5000"]}, "task": {"index": 0, "type": "master"}}' t2t-trainer --master=grpc://10.150.147.74:5002 --ps_replicas=2 --worker_replicas=2 --worker_gpu=1 --worker_id=0 --worker_job='/job:master' --ps_gpu=1 --schedule=train --data_dir=$DATA_DIR --problems=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR

TF_CONFIG='{"environment": "cloud", "cluster": {"ps": ["10.150.147.74:10002", "10.150.144.48:10000"], "master": ["10.150.147.74:5002", "10.150.144.48:5000"]}, "task": {"index": 1, "type": "master"}}' t2t-trainer --master=grpc://10.150.144.48:5000 --ps_replicas=2 --worker_replicas=2 --worker_gpu=1 --worker_id=1 --worker_job='/job:master' --ps_gpu=1 --schedule=train --data_dir=$DATA_DIR --problems=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR

ps:
TF_CONFIG='{"environment": "cloud", "cluster": {"ps": ["10.150.147.74:10002", "10.150.144.48:10000"], "master": ["10.150.147.74:5002", "10.150.144.48:5000"]}, "task": {"index": 0, "type": "ps"}}' t2t-trainer --master=grpc://10.150.147.74:10002 --schedule=run_std_server --data_dir=$DATA_DIR --problems=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR

TF_CONFIG='{"environment": "cloud", "cluster": {"ps": ["10.150.147.74:10002", "10.150.144.48:10000"], "master": ["10.150.147.74:5002", "10.150.144.48:5000"]}, "task": {"index": 1, "type": "ps"}}' t2t-trainer --master=grpc://10.150.144.48:10000 --schedule=run_std_server --data_dir=$DATA_DIR --problems=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR
`

@rsepassi
Copy link
Contributor

rsepassi commented Oct 31, 2017 via email

@tobyyouup
Copy link
Author

@rsepassi Thanks for you reply. If I want to run it on multiple physical machines and there are no AWS or GCP, just local clusters. How could I use the same output dir in different machines? with NFS?

@rsepassi
Copy link
Contributor

rsepassi commented Nov 1, 2017 via email

@tobyyouup
Copy link
Author

@rsepassi Yes, I have solved this with NFS. Thanks for your help.

@tobyyouup
Copy link
Author

tobyyouup commented Nov 4, 2017

@rsepassi Another question. I have not seen any codes related to sync mode in parameter server. Shouldn't it use the class defined here https://www.tensorflow.org/api_docs/python/tf/train/SyncReplicasOptimizer to sync the training process in ps+worker settings?
Is there anything i missed?

@rsepassi
Copy link
Contributor

We use in-graph sync training, with one master, many workers. SyncReplicasOptimizer is for when you have many masters (usually 1-to-1 master-to-worker) that update to shared parameter servers. This issue is not the right place for the discussion though. If you'd like to ask follow-up questions, please open a new issue or ask on Gitter.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants