Skip to content

train with multi-gpu with MirroredStrategy will hang-up #22889

@honeytidy

Description

@honeytidy

System information

Have I written custom code: N/A
OS Platform and Distribution: CentOS Linux release 7.3.1611
TensorFlow installed from: (pip install tf-nightly-gpu)
TensorFlow version: Tensorflow('v1.9.0-rc2-5345-g57d31aa599', '1.12.0-dev20181005')
Bazel version: N/A
GPU model and memory: Tesla P40 24G
Exact command to reproduce: N/A
Mobile device: N/A
CUDA/cuDNN version: cuda 9.0 with cudnn7.1.4

I train with tensorflow for multi-gpu with MirroredStrategy and estimator. I got the problem:
when I set the distribute mode with the following code it will got stuck after runing some training steps:

distribution = tf.contrib.distribute.MirroredStrategy()
config = tf.estimator.RunConfig(train_distribute=distribution)
estimator = tf.estimator.Estimator(model_fn=mymodel_fn, model_dir='logs',
        config=config)

bug when I run without distribute mode like this:

distribution = tf.contrib.distribute.MirroredStrategy()
config = tf.estimator.RunConfig()
estimator = tf.estimator.Estimator(model_fn=mymodel_fn, model_dir='logs',
        config=config)

It runs ok. Why?
Is that a bug of MirroredStrategy?

Metadata

Metadata

Labels

TF 1.12Issues related to TF 1.12comp:dist-stratDistribution Strategy related issuesstaleThis label marks the issue/pr stale - to be closed automatically if no activitystat:awaiting responseStatus - Awaiting response from author

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions