checkpoint cannot be found when running distributed mode with 2 ps #394

tobyyouup · 2017-10-31T11:43:37Z

Hi @rsepassi @lukaszkaiser , this an old problem mentioned in #45, now reopen again cause I am trying running in distributed mode again. When I use 2 ps + 2 worker, and the 2 ps are on two different physical machines and there are no shared file system. I found each ps just store part of the checkpoint. eg: there are 4 parts of checkpoint in total, one ps in machine A store 00 and 01 checkpoint, while the other ps in machine B store 02 and 03 checkpoint. And then the worker shows cannot find "model.ckpt-0_temp_02a6a6cf87a94843b039bf54f3fbf449/part-00000-of-00004.index".

Is there anything wrong? How could run ps on different physical machines without share file system?
The details of the problem is as follows:

When I run ps and and worker on two machines(2 ps and 2 worker, each machine with one ps and one worker). The worker with id=0(the first worker) show errors:

`
_###

INFO:tensorflow:Saving checkpoints for 0 into t2t_train_dis/wmt_ende_tokens_32k/transformer-transformer_base/model.ckpt.
Traceback (most recent call last):
File "/usr/anaconda2/bin/t2t-trainer", line 83, in
tf.app.run()
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/usr/anaconda2/bin/t2t-trainer", line 79, in main
schedule=FLAGS.schedule)
File "/usr/anaconda2/lib/python2.7/site-packages/tensor2tensor/utils/trainer_utils.py", line 270, in run
experiment_fn=exp_fn, schedule=schedule, output_dir=output_dir)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 210, in run
return _execute_schedule(experiment, schedule)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 47, in _execute_schedule
return task()
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train
hooks=self._train_monitors + extra_hooks)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train
monitors=hooks)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
return func(*args, **kwargs)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit
loss = self._train_model(input_fn=input_fn, hooks=hooks)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1007, in _train_model
_, loss = mon_sess.run([model_fn_ops.train_op, model_fn_ops.loss])
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 505, in run
run_metadata=run_metadata)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 842, in run
run_metadata=run_metadata)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 798, in run
return self._sess.run(*args, **kwargs)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 960, in run
run_metadata=run_metadata))
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 368, in after_run
self._save(global_step, run_context.session)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 384, in _save
self._get_saver().save(session, self._save_path, global_step=step)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1488, in save
raise exc
tensorflow.python.framework.errors_impl.NotFoundError: t2t_train_dis/wmt_ende_tokens_32k/transformer-transformer_base/model.ckpt-0_temp_02a6a6cf87a94843b039bf54f3fbf449/part-00000-of-00004.index
[[Node: save/MergeV2Checkpoints = MergeV2Checkpoints[delete_old_dirs=true, _device="/job:ps/replica:0/task:1/cpu:0"](save/MergeV2Checkpoints/checkpoint_prefixes, _recv_save/Const_0_S3639)]]

Caused by op u'save/MergeV2Checkpoints', defined at:
File "/usr/anaconda2/bin/t2t-trainer", line 83, in
tf.app.run()
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/usr/anaconda2/bin/t2t-trainer", line 79, in main
schedule=FLAGS.schedule)
File "/usr/anaconda2/lib/python2.7/site-packages/tensor2tensor/utils/trainer_utils.py", line 270, in run
experiment_fn=exp_fn, schedule=schedule, output_dir=output_dir)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 210, in run
return _execute_schedule(experiment, schedule)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 47, in _execute_schedule
return task()
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train
hooks=self._train_monitors + extra_hooks)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train
monitors=hooks)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
return func(*args, **kwargs)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit
loss = self._train_model(input_fn=input_fn, hooks=hooks)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1003, in _train_model
config=self._session_config
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 352, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 648, in init
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 477, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 822, in init
_WrappedSession.init(self, self._create_session())
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 827, in _create_session
return self._sess_creator.create_session()
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 538, in create_session
self.tf_sess = self._session_creator.create_session()
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 403, in create_session
self._scaffold.finalize()
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 205, in finalize
self._saver.build()
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1170, in build
restore_sequentially=self._restore_sequentially)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 685, in build
save_tensor = self._AddShardedSaveOps(filename_tensor, per_device)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 361, in _AddShardedSaveOps
return self._AddShardedSaveOpsForV2(filename_tensor, per_device)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 343, in _AddShardedSaveOpsForV2
sharded_prefixes, checkpoint_prefix, delete_old_dirs=True)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 185, in merge_v2_checkpoints
delete_old_dirs=delete_old_dirs, name=name)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1269, in init
self._traceback = _extract_stack()

NotFoundError (see above for traceback): t2t_train_dis/wmt_ende_tokens_32k/transformer-transformer_base/model.ckpt-0_temp_02a6a6cf87a94843b039bf54f3fbf449/part-00000-of-00004.index
[[Node: save/MergeV2Checkpoints = MergeV2Checkpoints[delete_old_dirs=true, _device="/job:ps/replica:0/task:1/cpu:0"](save/MergeV2Checkpoints/checkpoint_prefixes, _recv_save/Const_0_S3639)]]
`

It shows the checkpoint "part-00000-of-00004.index" not exist, but this file actually exist. the file status is:
-rw-rw-r-- 1 root root 12 7月 15 18:41 part-00000-of-00004.data-00000-of-00001
-rw-rw-r-- 1 root root 177 7月 15 18:41 part-00000-of-00004.index
-rw-rw-r-- 1 root root 107839496 7月 15 18:41 part-00001-of-00004.data-00000-of-00001
-rw-rw-r-- 1 root root 15176 7月 15 18:41 part-00001-of-00004.index

the part 00002 and 000003 is on the second machine:
-rw-rw-r-- 1 root root 12 Jul 15 03:38 part-00002-of-00004.data-00000-of-00001
-rw-rw-r-- 1 root root 217 Jul 15 03:38 part-00002-of-00004.index
-rw-rw-r-- 1 root root 635314176 Jul 15 03:38 part-00003-of-00004.data-00000-of-00001
-rw-rw-r-- 1 root root 18960 Jul 15 03:38 part-00003-of-00004.index

The two ps and the second worker works normally, just the the first worker fails.
the script to start the ps and worker is：

`PROBLEM=wmt_ende_tokens_32k
MODEL=transformer
HPARAMS=transformer_base
DATA_DIR=./t2t_data_new
model_name=t2t_train_dis
TRAIN_DIR=$model_name/$PROBLEM/$MODEL-$HPARAMS
mkdir -p $TRAIN_DIR

worker:
TF_CONFIG='{"environment": "cloud", "cluster": {"ps": ["10.150.147.74:10002", "10.150.144.48:10000"], "master": ["10.150.147.74:5002", "10.150.144.48:5000"]}, "task": {"index": 0, "type": "master"}}' t2t-trainer --master=grpc://10.150.147.74:5002 --ps_replicas=2 --worker_replicas=2 --worker_gpu=1 --worker_id=0 --worker_job='/job:master' --ps_gpu=1 --schedule=train --data_dir=$DATA_DIR --problems=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR

TF_CONFIG='{"environment": "cloud", "cluster": {"ps": ["10.150.147.74:10002", "10.150.144.48:10000"], "master": ["10.150.147.74:5002", "10.150.144.48:5000"]}, "task": {"index": 1, "type": "master"}}' t2t-trainer --master=grpc://10.150.144.48:5000 --ps_replicas=2 --worker_replicas=2 --worker_gpu=1 --worker_id=1 --worker_job='/job:master' --ps_gpu=1 --schedule=train --data_dir=$DATA_DIR --problems=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR

ps:
TF_CONFIG='{"environment": "cloud", "cluster": {"ps": ["10.150.147.74:10002", "10.150.144.48:10000"], "master": ["10.150.147.74:5002", "10.150.144.48:5000"]}, "task": {"index": 0, "type": "ps"}}' t2t-trainer --master=grpc://10.150.147.74:10002 --schedule=run_std_server --data_dir=$DATA_DIR --problems=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR

TF_CONFIG='{"environment": "cloud", "cluster": {"ps": ["10.150.147.74:10002", "10.150.144.48:10000"], "master": ["10.150.147.74:5002", "10.150.144.48:5000"]}, "task": {"index": 1, "type": "ps"}}' t2t-trainer --master=grpc://10.150.144.48:10000 --schedule=run_std_server --data_dir=$DATA_DIR --problems=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR
`

rsepassi · 2017-10-31T15:49:42Z

TensorFlow expects there to be a single shared output directory. No known way around that. If you’re using AWS, EFS would work. With GCP I think you can use a GCS bucket as the save directory.

…

On Tue, Oct 31, 2017 at 4:43 AM tobyyouup ***@***.***> wrote: Hi @rsepassi <https://github.com/rsepassi> @lukaszkaiser <https://github.com/lukaszkaiser> , this an old problem mentioned in #45 <#45>, now reopen again cause I am trying running in distributed mode again. When I use 2 ps + 2 worker, and the 2 ps are on two different physical machines and there are no shared file system. I found each ps just store part of the checkpoint. eg: there are 4 parts of checkpoint in total, one ps in machine A store 00 and 01 checkpoint, while the other ps in machine B store 02 and 03 checkpoint. And then the worker shows cannot find "model.ckpt-0_temp_02a6a6cf87a94843b039bf54f3fbf449/part-00000-of-00004.index". Is there anything wrong? How could run ps on different physical machines without share file system? The details of the problem is as follows: When I run ps and and worker on two machines(2 ps and 2 worker, each machine with one ps and one worker). The worker with id=0(the first worker) show errors: ` INFO:tensorflow:Saving checkpoints for 0 into t2t_train_dis/wmt_ende_tokens_32k/transformer-transformer_base/model.ckpt. Traceback (most recent call last): File "/usr/anaconda2/bin/t2t-trainer", line 83, in tf.app.run() File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/usr/anaconda2/bin/t2t-trainer", line 79, in main schedule=FLAGS.schedule) File "/usr/anaconda2/lib/python2.7/site-packages/tensor2tensor/utils/trainer_utils.py", line 270, in run experiment_fn=exp_fn, schedule=schedule, output_dir=output_dir) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 210, in run return _execute_schedule(experiment, schedule) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 47, in _execute_schedule return task() File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train hooks=self._train_monitors + extra_hooks) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train monitors=hooks) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 289, in new_func return func(*args, **kwargs) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit loss = self._train_model(input_fn=input_fn, hooks=hooks) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1007, in _train_model _, loss = mon_sess.run([model_fn_ops.train_op, model_fn_ops.loss]) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 505, in run run_metadata=run_metadata) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 842, in run run_metadata=run_metadata) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 798, in run return self._sess.run(*args, **kwargs) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 960, in run run_metadata=run_metadata)) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 368, in after_run self._save(global_step, run_context.session) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 384, in _save self._get_saver().save(session, self._save_path, global_step=step) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1488, in save raise exc tensorflow.python.framework.errors_impl.NotFoundError: t2t_train_dis/wmt_ende_tokens_32k/transformer-transformer_base/model.ckpt-0_temp_02a6a6cf87a94843b039bf54f3fbf449/part-00000-of-00004.index [[Node: save/MergeV2Checkpoints = MergeV2Checkpoints[delete_old_dirs=true, _device="/job:ps/replica:0/task:1/cpu:0"](save/MergeV2Checkpoints/checkpoint_prefixes, _recv_save/Const_0_S3639)]] Caused by op u'save/MergeV2Checkpoints', defined at: File "/usr/anaconda2/bin/t2t-trainer", line 83, in tf.app.run() File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/usr/anaconda2/bin/t2t-trainer", line 79, in main schedule=FLAGS.schedule) File "/usr/anaconda2/lib/python2.7/site-packages/tensor2tensor/utils/trainer_utils.py", line 270, in run experiment_fn=exp_fn, schedule=schedule, output_dir=output_dir) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 210, in run return _execute_schedule(experiment, schedule) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 47, in _execute_schedule return task() File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train hooks=self._train_monitors + extra_hooks) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train monitors=hooks) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 289, in new_func return func(*args, **kwargs) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit loss = self._train_model(input_fn=input_fn, hooks=hooks) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1003, in _train_model config=self._session_config File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 352, in MonitoredTrainingSession stop_grace_period_secs=stop_grace_period_secs) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 648, in init stop_grace_period_secs=stop_grace_period_secs) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 477, in init self._sess = _RecoverableSession(self._coordinated_creator) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 822, in init _WrappedSession.init(self, self._create_session()) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 827, in _create_session return self._sess_creator.create_session() File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 538, in create_session self.tf_sess = self._session_creator.create_session() File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 403, in create_session self._scaffold.finalize() File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 205, in finalize self._saver.build() File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1170, in build restore_sequentially=self._restore_sequentially) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 685, in build save_tensor = self._AddShardedSaveOps(filename_tensor, per_device) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 361, in _AddShardedSaveOps return self._AddShardedSaveOpsForV2(filename_tensor, per_device) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 343, in _AddShardedSaveOpsForV2 sharded_prefixes, checkpoint_prefix, delete_old_dirs=True) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 185, in merge_v2_checkpoints delete_old_dirs=delete_old_dirs, name=name) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op op_def=op_def) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op original_op=self._default_original_op, op_def=op_def) File "/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1269, in init self._traceback = _extract_stack() NotFoundError (see above for traceback): t2t_train_dis/wmt_ende_tokens_32k/transformer-transformer_base/model.ckpt-0_temp_02a6a6cf87a94843b039bf54f3fbf449/part-00000-of-00004.index [[Node: save/MergeV2Checkpoints = MergeV2Checkpoints[delete_old_dirs=true, _device="/job:ps/replica:0/task:1/cpu:0"](save/MergeV2Checkpoints/checkpoint_prefixes, _recv_save/Const_0_S3639)]]` It shows the checkpoint "part-00000-of-00004.index" not exist, but this file actually exist. the file status is: -rw-rw-r-- 1 root root 12 7月 15 18:41 part-00000-of-00004.data-00000-of-00001 -rw-rw-r-- 1 root root 177 7月 15 18:41 part-00000-of-00004.index -rw-rw-r-- 1 root root 107839496 7月 15 18:41 part-00001-of-00004.data-00000-of-00001 -rw-rw-r-- 1 root root 15176 7月 15 18:41 part-00001-of-00004.index the part 00002 and 000003 is on the second machine: -rw-rw-r-- 1 root root 12 Jul 15 03:38 part-00002-of-00004.data-00000-of-00001 -rw-rw-r-- 1 root root 217 Jul 15 03:38 part-00002-of-00004.index -rw-rw-r-- 1 root root 635314176 Jul 15 03:38 part-00003-of-00004.data-00000-of-00001 -rw-rw-r-- 1 root root 18960 Jul 15 03:38 part-00003-of-00004.index The two ps and the second worker works normally, just the the first worker fails. the script to start the ps and worker is： `PROBLEM=wmt_ende_tokens_32k MODEL=transformer HPARAMS=transformer_base DATA_DIR=./t2t_data_new model_name=t2t_train_dis TRAIN_DIR=$model_name/$PROBLEM/$MODEL-$HPARAMS mkdir -p $TRAIN_DIR worker: TF_CONFIG='{"environment": "cloud", "cluster": {"ps": [" 10.150.147.74:10002", "10.150.144.48:10000"], "master": [" 10.150.147.74:5002", "10.150.144.48:5000"]}, "task": {"index": 0, "type": "master"}}' t2t-trainer --master=grpc://10.150.147.74:5002 --ps_replicas=2 --worker_replicas=2 --worker_gpu=1 --worker_id=0 --worker_job='/job:master' --ps_gpu=1 --schedule=train --data_dir=$DATA_DIR --problems=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR TF_CONFIG='{"environment": "cloud", "cluster": {"ps": [" 10.150.147.74:10002", "10.150.144.48:10000"], "master": [" 10.150.147.74:5002", "10.150.144.48:5000"]}, "task": {"index": 1, "type": "master"}}' t2t-trainer --master=grpc://10.150.144.48:5000 --ps_replicas=2 --worker_replicas=2 --worker_gpu=1 --worker_id=1 --worker_job='/job:master' --ps_gpu=1 --schedule=train --data_dir=$DATA_DIR --problems=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR ps: TF_CONFIG='{"environment": "cloud", "cluster": {"ps": [" 10.150.147.74:10002", "10.150.144.48:10000"], "master": [" 10.150.147.74:5002", "10.150.144.48:5000"]}, "task": {"index": 0, "type": "ps"}}' t2t-trainer --master=grpc://10.150.147.74:10002 --schedule=run_std_server --data_dir=$DATA_DIR --problems=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR TF_CONFIG='{"environment": "cloud", "cluster": {"ps": [" 10.150.147.74:10002", "10.150.144.48:10000"], "master": [" 10.150.147.74:5002", "10.150.144.48:5000"]}, "task": {"index": 1, "type": "ps"}}' t2t-trainer --master=grpc://10.150.144.48:10000 --schedule=run_std_server --data_dir=$DATA_DIR --problems=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#394>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABEGWx3nJI-yqHbE3F8gCRoB4TZi3orOks5sxwftgaJpZM4QMm9-> .

tobyyouup · 2017-11-01T05:50:23Z

@rsepassi Thanks for you reply. If I want to run it on multiple physical machines and there are no AWS or GCP, just local clusters. How could I use the same output dir in different machines? with NFS?

rsepassi · 2017-11-01T14:55:27Z

Yes, NFS is probably your best bet.

…

On Tue, Oct 31, 2017 at 10:50 PM tobyyouup ***@***.***> wrote: @rsepassi <https://github.com/rsepassi> Thanks for you reply. If I want to run it on multiple physical machines and there are no AWS or GCP, just local clusters. How could I use the same output dir in different machines? with NFS? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#394 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABEGWxzbI5_NpYFWA4hLS8kheeB_57xdks5syAahgaJpZM4QMm9-> .

tobyyouup · 2017-11-04T14:39:28Z

@rsepassi Yes, I have solved this with NFS. Thanks for your help.

tobyyouup · 2017-11-04T14:46:47Z

@rsepassi Another question. I have not seen any codes related to sync mode in parameter server. Shouldn't it use the class defined here https://www.tensorflow.org/api_docs/python/tf/train/SyncReplicasOptimizer to sync the training process in ps+worker settings?
Is there anything i missed?

rsepassi · 2017-11-13T22:36:21Z

We use in-graph sync training, with one master, many workers. SyncReplicasOptimizer is for when you have many masters (usually 1-to-1 master-to-worker) that update to shared parameter servers. This issue is not the right place for the discussion though. If you'd like to ask follow-up questions, please open a new issue or ask on Gitter.

rsepassi closed this as completed Nov 13, 2017

chengenbao mentioned this issue Feb 20, 2019

euler如何支持hdfs alibaba/euler#31

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

checkpoint cannot be found when running distributed mode with 2 ps #394

checkpoint cannot be found when running distributed mode with 2 ps #394

tobyyouup commented Oct 31, 2017 •

edited

rsepassi commented Oct 31, 2017 via email

tobyyouup commented Nov 1, 2017

rsepassi commented Nov 1, 2017 via email

tobyyouup commented Nov 4, 2017

tobyyouup commented Nov 4, 2017 •

edited

rsepassi commented Nov 13, 2017

checkpoint cannot be found when running distributed mode with 2 ps #394

checkpoint cannot be found when running distributed mode with 2 ps #394

Comments

tobyyouup commented Oct 31, 2017 • edited

rsepassi commented Oct 31, 2017 via email

tobyyouup commented Nov 1, 2017

rsepassi commented Nov 1, 2017 via email

tobyyouup commented Nov 4, 2017

tobyyouup commented Nov 4, 2017 • edited

rsepassi commented Nov 13, 2017

tobyyouup commented Oct 31, 2017 •

edited

tobyyouup commented Nov 4, 2017 •

edited