-
Notifications
You must be signed in to change notification settings - Fork 3.4k
checkpoint cannot be found when running distributed mode with 2 ps #394
Comments
TensorFlow expects there to be a single shared output directory. No known
way around that. If you’re using AWS, EFS would work. With GCP I think you
can use a GCS bucket as the save directory.
…On Tue, Oct 31, 2017 at 4:43 AM tobyyouup ***@***.***> wrote:
Hi @rsepassi <https://github.com/rsepassi> @lukaszkaiser
<https://github.com/lukaszkaiser> , this an old problem mentioned in #45
<#45>, now reopen again
cause I am trying running in distributed mode again. When I use 2 ps + 2
worker, and the 2 ps are on two different physical machines and there are
no shared file system. I found each ps just store part of the checkpoint.
eg: there are 4 parts of checkpoint in total, one ps in machine A store 00
and 01 checkpoint, while the other ps in machine B store 02 and 03
checkpoint. And then the worker shows cannot find
"model.ckpt-0_temp_02a6a6cf87a94843b039bf54f3fbf449/part-00000-of-00004.index".
Is there anything wrong? How could run ps on different physical machines
without share file system?
The details of the problem is as follows:
When I run ps and and worker on two machines(2 ps and 2 worker, each
machine with one ps and one worker). The worker with id=0(the first worker)
show errors:
`
INFO:tensorflow:Saving checkpoints for 0 into
t2t_train_dis/wmt_ende_tokens_32k/transformer-transformer_base/model.ckpt.
Traceback (most recent call last):
File "/usr/anaconda2/bin/t2t-trainer", line 83, in
tf.app.run()
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py",
line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/usr/anaconda2/bin/t2t-trainer", line 79, in main
schedule=FLAGS.schedule)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensor2tensor/utils/trainer_utils.py",
line 270, in run
experiment_fn=exp_fn, schedule=schedule, output_dir=output_dir)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/learn_runner.py",
line 210, in run
return _execute_schedule(experiment, schedule)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/learn_runner.py",
line 47, in _execute_schedule
return task()
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/experiment.py",
line 275, in train
hooks=self._train_monitors + extra_hooks)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/experiment.py",
line 665, in _call_train
monitors=hooks)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py",
line 289, in new_func
return func(*args, **kwargs)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py",
line 455, in fit
loss = self._train_model(input_fn=input_fn, hooks=hooks)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py",
line 1007, in _train_model
_, loss = mon_sess.run([model_fn_ops.train_op, model_fn_ops.loss])
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py",
line 505, in run
run_metadata=run_metadata)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py",
line 842, in run
run_metadata=run_metadata)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py",
line 798, in run
return self._sess.run(*args, **kwargs)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py",
line 960, in run
run_metadata=run_metadata))
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/basic_session_run_hooks.py",
line 368, in after_run
self._save(global_step, run_context.session)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/basic_session_run_hooks.py",
line 384, in _save
self._get_saver().save(session, self._save_path, global_step=step)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py",
line 1488, in save
raise exc
tensorflow.python.framework.errors_impl.NotFoundError:
t2t_train_dis/wmt_ende_tokens_32k/transformer-transformer_base/model.ckpt-0_temp_02a6a6cf87a94843b039bf54f3fbf449/part-00000-of-00004.index
[[Node: save/MergeV2Checkpoints = MergeV2Checkpoints[delete_old_dirs=true,
_device="/job:ps/replica:0/task:1/cpu:0"](save/MergeV2Checkpoints/checkpoint_prefixes,
_recv_save/Const_0_S3639)]]
Caused by op u'save/MergeV2Checkpoints', defined at:
File "/usr/anaconda2/bin/t2t-trainer", line 83, in
tf.app.run()
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py",
line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/usr/anaconda2/bin/t2t-trainer", line 79, in main
schedule=FLAGS.schedule)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensor2tensor/utils/trainer_utils.py",
line 270, in run
experiment_fn=exp_fn, schedule=schedule, output_dir=output_dir)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/learn_runner.py",
line 210, in run
return _execute_schedule(experiment, schedule)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/learn_runner.py",
line 47, in _execute_schedule
return task()
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/experiment.py",
line 275, in train
hooks=self._train_monitors + extra_hooks)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/experiment.py",
line 665, in _call_train
monitors=hooks)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py",
line 289, in new_func
return func(*args, **kwargs)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py",
line 455, in fit
loss = self._train_model(input_fn=input_fn, hooks=hooks)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py",
line 1003, in _train_model
config=self._session_config
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py",
line 352, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py",
line 648, in init
stop_grace_period_secs=stop_grace_period_secs)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py",
line 477, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py",
line 822, in init
_WrappedSession.init(self, self._create_session())
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py",
line 827, in _create_session
return self._sess_creator.create_session()
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py",
line 538, in create_session
self.tf_sess = self._session_creator.create_session()
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py",
line 403, in create_session
self._scaffold.finalize()
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py",
line 205, in finalize
self._saver.build()
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py",
line 1170, in build
restore_sequentially=self._restore_sequentially)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py",
line 685, in build
save_tensor = self._AddShardedSaveOps(filename_tensor, per_device)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py",
line 361, in _AddShardedSaveOps
return self._AddShardedSaveOpsForV2(filename_tensor, per_device)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py",
line 343, in _AddShardedSaveOpsForV2
sharded_prefixes, checkpoint_prefix, delete_old_dirs=True)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py",
line 185, in merge_v2_checkpoints
delete_old_dirs=delete_old_dirs, name=name)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py",
line 767, in apply_op
op_def=op_def)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py",
line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File
"/usr/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py",
line 1269, in init
self._traceback = _extract_stack()
NotFoundError (see above for traceback):
t2t_train_dis/wmt_ende_tokens_32k/transformer-transformer_base/model.ckpt-0_temp_02a6a6cf87a94843b039bf54f3fbf449/part-00000-of-00004.index
[[Node: save/MergeV2Checkpoints = MergeV2Checkpoints[delete_old_dirs=true,
_device="/job:ps/replica:0/task:1/cpu:0"](save/MergeV2Checkpoints/checkpoint_prefixes,
_recv_save/Const_0_S3639)]]`
It shows the checkpoint "part-00000-of-00004.index" not exist, but this
file actually exist. the file status is:
-rw-rw-r-- 1 root root 12 7月 15 18:41
part-00000-of-00004.data-00000-of-00001
-rw-rw-r-- 1 root root 177 7月 15 18:41 part-00000-of-00004.index
-rw-rw-r-- 1 root root 107839496 7月 15 18:41
part-00001-of-00004.data-00000-of-00001
-rw-rw-r-- 1 root root 15176 7月 15 18:41 part-00001-of-00004.index
the part 00002 and 000003 is on the second machine:
-rw-rw-r-- 1 root root 12 Jul 15 03:38
part-00002-of-00004.data-00000-of-00001
-rw-rw-r-- 1 root root 217 Jul 15 03:38 part-00002-of-00004.index
-rw-rw-r-- 1 root root 635314176 Jul 15 03:38
part-00003-of-00004.data-00000-of-00001
-rw-rw-r-- 1 root root 18960 Jul 15 03:38 part-00003-of-00004.index
The two ps and the second worker works normally, just the the first worker
fails.
the script to start the ps and worker is:
`PROBLEM=wmt_ende_tokens_32k
MODEL=transformer
HPARAMS=transformer_base
DATA_DIR=./t2t_data_new
model_name=t2t_train_dis
TRAIN_DIR=$model_name/$PROBLEM/$MODEL-$HPARAMS
mkdir -p $TRAIN_DIR
worker:
TF_CONFIG='{"environment": "cloud", "cluster": {"ps": ["
10.150.147.74:10002", "10.150.144.48:10000"], "master": ["
10.150.147.74:5002", "10.150.144.48:5000"]}, "task": {"index": 0, "type":
"master"}}' t2t-trainer --master=grpc://10.150.147.74:5002
--ps_replicas=2 --worker_replicas=2 --worker_gpu=1 --worker_id=0
--worker_job='/job:master' --ps_gpu=1 --schedule=train --data_dir=$DATA_DIR
--problems=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS
--output_dir=$TRAIN_DIR
TF_CONFIG='{"environment": "cloud", "cluster": {"ps": ["
10.150.147.74:10002", "10.150.144.48:10000"], "master": ["
10.150.147.74:5002", "10.150.144.48:5000"]}, "task": {"index": 1, "type":
"master"}}' t2t-trainer --master=grpc://10.150.144.48:5000
--ps_replicas=2 --worker_replicas=2 --worker_gpu=1 --worker_id=1
--worker_job='/job:master' --ps_gpu=1 --schedule=train --data_dir=$DATA_DIR
--problems=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS
--output_dir=$TRAIN_DIR
ps:
TF_CONFIG='{"environment": "cloud", "cluster": {"ps": ["
10.150.147.74:10002", "10.150.144.48:10000"], "master": ["
10.150.147.74:5002", "10.150.144.48:5000"]}, "task": {"index": 0, "type":
"ps"}}' t2t-trainer --master=grpc://10.150.147.74:10002
--schedule=run_std_server --data_dir=$DATA_DIR --problems=$PROBLEM
--model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR
TF_CONFIG='{"environment": "cloud", "cluster": {"ps": ["
10.150.147.74:10002", "10.150.144.48:10000"], "master": ["
10.150.147.74:5002", "10.150.144.48:5000"]}, "task": {"index": 1, "type":
"ps"}}' t2t-trainer --master=grpc://10.150.144.48:10000
--schedule=run_std_server --data_dir=$DATA_DIR --problems=$PROBLEM
--model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#394>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/ABEGWx3nJI-yqHbE3F8gCRoB4TZi3orOks5sxwftgaJpZM4QMm9->
.
|
@rsepassi Thanks for you reply. If I want to run it on multiple physical machines and there are no AWS or GCP, just local clusters. How could I use the same output dir in different machines? with NFS? |
Yes, NFS is probably your best bet.
…On Tue, Oct 31, 2017 at 10:50 PM tobyyouup ***@***.***> wrote:
@rsepassi <https://github.com/rsepassi> Thanks for you reply. If I want
to run it on multiple physical machines and there are no AWS or GCP, just
local clusters. How could I use the same output dir in different machines?
with NFS?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#394 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABEGWxzbI5_NpYFWA4hLS8kheeB_57xdks5syAahgaJpZM4QMm9->
.
|
@rsepassi Yes, I have solved this with NFS. Thanks for your help. |
@rsepassi Another question. I have not seen any codes related to sync mode in parameter server. Shouldn't it use the class defined here https://www.tensorflow.org/api_docs/python/tf/train/SyncReplicasOptimizer to sync the training process in ps+worker settings? |
We use in-graph sync training, with one master, many workers. SyncReplicasOptimizer is for when you have many masters (usually 1-to-1 master-to-worker) that update to shared parameter servers. This issue is not the right place for the discussion though. If you'd like to ask follow-up questions, please open a new issue or ask on Gitter. |
Hi @rsepassi @lukaszkaiser , this an old problem mentioned in #45, now reopen again cause I am trying running in distributed mode again. When I use 2 ps + 2 worker, and the 2 ps are on two different physical machines and there are no shared file system. I found each ps just store part of the checkpoint. eg: there are 4 parts of checkpoint in total, one ps in machine A store 00 and 01 checkpoint, while the other ps in machine B store 02 and 03 checkpoint. And then the worker shows cannot find "model.ckpt-0_temp_02a6a6cf87a94843b039bf54f3fbf449/part-00000-of-00004.index".
Is there anything wrong? How could run ps on different physical machines without share file system?
The details of the problem is as follows:
When I run ps and and worker on two machines(2 ps and 2 worker, each machine with one ps and one worker). The worker with id=0(the first worker) show errors:
`
_###
It shows the checkpoint "part-00000-of-00004.index" not exist, but this file actually exist. the file status is:
-rw-rw-r-- 1 root root 12 7月 15 18:41 part-00000-of-00004.data-00000-of-00001
-rw-rw-r-- 1 root root 177 7月 15 18:41 part-00000-of-00004.index
-rw-rw-r-- 1 root root 107839496 7月 15 18:41 part-00001-of-00004.data-00000-of-00001
-rw-rw-r-- 1 root root 15176 7月 15 18:41 part-00001-of-00004.index
the part 00002 and 000003 is on the second machine:
-rw-rw-r-- 1 root root 12 Jul 15 03:38 part-00002-of-00004.data-00000-of-00001
-rw-rw-r-- 1 root root 217 Jul 15 03:38 part-00002-of-00004.index
-rw-rw-r-- 1 root root 635314176 Jul 15 03:38 part-00003-of-00004.data-00000-of-00001
-rw-rw-r-- 1 root root 18960 Jul 15 03:38 part-00003-of-00004.index
The two ps and the second worker works normally, just the the first worker fails.
the script to start the ps and worker is:
`PROBLEM=wmt_ende_tokens_32k
MODEL=transformer
HPARAMS=transformer_base
DATA_DIR=./t2t_data_new
model_name=t2t_train_dis
TRAIN_DIR=$model_name/$PROBLEM/$MODEL-$HPARAMS
mkdir -p $TRAIN_DIR
worker:
TF_CONFIG='{"environment": "cloud", "cluster": {"ps": ["10.150.147.74:10002", "10.150.144.48:10000"], "master": ["10.150.147.74:5002", "10.150.144.48:5000"]}, "task": {"index": 0, "type": "master"}}' t2t-trainer --master=grpc://10.150.147.74:5002 --ps_replicas=2 --worker_replicas=2 --worker_gpu=1 --worker_id=0 --worker_job='/job:master' --ps_gpu=1 --schedule=train --data_dir=$DATA_DIR --problems=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR
TF_CONFIG='{"environment": "cloud", "cluster": {"ps": ["10.150.147.74:10002", "10.150.144.48:10000"], "master": ["10.150.147.74:5002", "10.150.144.48:5000"]}, "task": {"index": 1, "type": "master"}}' t2t-trainer --master=grpc://10.150.144.48:5000 --ps_replicas=2 --worker_replicas=2 --worker_gpu=1 --worker_id=1 --worker_job='/job:master' --ps_gpu=1 --schedule=train --data_dir=$DATA_DIR --problems=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR
ps:
TF_CONFIG='{"environment": "cloud", "cluster": {"ps": ["10.150.147.74:10002", "10.150.144.48:10000"], "master": ["10.150.147.74:5002", "10.150.144.48:5000"]}, "task": {"index": 0, "type": "ps"}}' t2t-trainer --master=grpc://10.150.147.74:10002 --schedule=run_std_server --data_dir=$DATA_DIR --problems=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR
TF_CONFIG='{"environment": "cloud", "cluster": {"ps": ["10.150.147.74:10002", "10.150.144.48:10000"], "master": ["10.150.147.74:5002", "10.150.144.48:5000"]}, "task": {"index": 1, "type": "ps"}}' t2t-trainer --master=grpc://10.150.144.48:10000 --schedule=run_std_server --data_dir=$DATA_DIR --problems=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR
`
The text was updated successfully, but these errors were encountered: