Skip to content
This repository was archived by the owner on Dec 9, 2024. It is now read-only.
This repository was archived by the owner on Dec 9, 2024. It is now read-only.

OSError on 128 GPUs for distributed_replicated on AWS P3 #165

@richardliaw

Description

@richardliaw

Hi,

I'm trying to run a distributed replicated benchmark with 128 V100s and I'm getting a OSError.

Some more details:

  • Using AWS P3 instances (16 of them)
  • Batch size is 64
  • Running resnet101

Does anyone know how I can get around this issue or if there are any obvious mistakes that I'm making? The same commands work fine for 8 machines (64 GPUs)

I've pasted the commands run below:

##########
('Run the following commands on', '172.31.89.130')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=0 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=0 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.92.187')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=1 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=1 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.95.87')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=2 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=2 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.91.114')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=3 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=3 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.81.43')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=4 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=4 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.90.229')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=5 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=5 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.91.125')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=6 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=6 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.85.199')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=7 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=7 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.93.20')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=8 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=8 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.87.145')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=9 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=9 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.93.84')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=10 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=10 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.89.237')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=11 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=11 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.83.145')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=12 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=12 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.82.121')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=13 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=13 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.80.160')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=14 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=14 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.85.86')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=15 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=15 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

Here is the stderr of the run (of one of the workers):

/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/matplotlib/__init__.py:962: UserWarning: Duplicate key in file "/home/ubuntu/.config/matplotlib/matplotlibrc", line #2
  (fname, cnt))
/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/matplotlib/__init__.py:962: UserWarning: Duplicate key in file "/home/ubuntu/.config/matplotlib/matplotlibrc", line #3
  (fname, cnt))
2018-04-17 00:27:36.779965: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-04-17 00:27:37.724333: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-17 00:27:37.725452: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-04-17 00:27:37.949377: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-17 00:27:37.950506: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 1 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1d.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-04-17 00:27:38.200838: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-17 00:27:38.202080: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 2 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1c.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-04-17 00:27:38.410280: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-17 00:27:38.411397: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 3 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1b.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-04-17 00:27:38.633574: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-17 00:27:38.634718: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 4 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1a.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-04-17 00:27:38.833131: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-17 00:27:38.834389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 5 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:19.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-04-17 00:27:39.027737: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-17 00:27:39.029552: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 6 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:18.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-04-17 00:27:39.261369: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-17 00:27:39.262446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 7 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:17.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-04-17 00:27:39.262746: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7
2018-04-17 00:27:42.069669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-17 00:27:42.069719: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 1 2 3 4 5 6 7
2018-04-17 00:27:42.069731: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N Y Y Y Y N N N
2018-04-17 00:27:42.069738: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 1:   Y N Y Y N Y N N
2018-04-17 00:27:42.069745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 2:   Y Y N Y N N Y N
2018-04-17 00:27:42.069752: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 3:   Y Y Y N N N N Y
2018-04-17 00:27:42.069758: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 4:   Y N N N N Y Y Y
2018-04-17 00:27:42.069766: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 5:   N Y N N Y N Y Y
2018-04-17 00:27:42.069772: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 6:   N N Y N Y Y N Y
2018-04-17 00:27:42.069779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 7:   N N N Y Y Y Y N
2018-04-17 00:27:42.072586: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:0 with 14867 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
2018-04-17 00:27:42.248215: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:1 with 14867 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1d.0, compute capability: 7.0)
2018-04-17 00:27:42.400232: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:2 with 14867 MB memory) -> physical GPU (device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1c.0, compute capability: 7.0)
2018-04-17 00:27:42.582713: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:3 with 14867 MB memory) -> physical GPU (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1b.0, compute capability: 7.0)
2018-04-17 00:27:42.775726: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:4 with 14867 MB memory) -> physical GPU (device: 4, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1a.0, compute capability: 7.0)
2018-04-17 00:27:42.933943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:5 with 14867 MB memory) -> physical GPU (device: 5, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:19.0, compute capability: 7.0)
2018-04-17 00:27:43.115514: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:6 with 14867 MB memory) -> physical GPU (device: 6, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:18.0, compute capability: 7.0)
2018-04-17 00:27:43.309956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:7 with 14867 MB memory) -> physical GPU (device: 7, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:17.0, compute capability: 7.0)
2018-04-17 00:27:43.501367: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 172.31.89.130:50000, 1 -> 172.31.92.187:50000, 2 -> 172.31.95.87:50000, 3 -> 172.31.91.114:50000, 4 -> 172.31.81.43:50000, 5 -> 172.31.90.229:50000, 6 -> 172.31.91.125:50000, 7 -> 172.31.85.199:50000, 8 -> 172.31.93.20:50000, 9 -> 172.31.87.145:50000, 10 -> 172.31.93.84:50000, 11 -> 172.31.89.237:50000, 12 -> 172.31.83.145:50000, 13 -> 172.31.82.121:50000, 14 -> 172.31.80.160:50000, 15 -> 172.31.85.86:50000}
2018-04-17 00:27:43.501438: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:50001, 1 -> 172.31.92.187:50001, 2 -> 172.31.95.87:50001, 3 -> 172.31.91.114:50001, 4 -> 172.31.81.43:50001, 5 -> 172.31.90.229:50001, 6 -> 172.31.91.125:50001, 7 -> 172.31.85.199:50001, 8 -> 172.31.93.20:50001, 9 -> 172.31.87.145:50001, 10 -> 172.31.93.84:50001, 11 -> 172.31.89.237:50001, 12 -> 172.31.83.145:50001, 13 -> 172.31.82.121:50001, 14 -> 172.31.80.160:50001, 15 -> 172.31.85.86:50001}
2018-04-17 00:27:43.512903: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:333] Started server with target: grpc://localhost:50001
W0417 00:29:08.790456 139800790255360 tf_logging.py:126] From /home/ubuntu/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1504: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-04-17 00:29:19.709271: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unknown: Call dropped by load balancing policy
2018-04-17 00:29:25.709634: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:26.707462: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:27.711554: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:28.707978: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:29.704912: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:29.708827: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:29.711819: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
2018-04-17 00:29:29.711866: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:2
2018-04-17 00:29:29.711883: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:3
2018-04-17 00:29:29.711896: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:4
2018-04-17 00:29:29.711912: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:5
2018-04-17 00:29:29.711925: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:6
2018-04-17 00:29:29.711940: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:7
2018-04-17 00:29:29.711956: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:8
2018-04-17 00:29:29.711970: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:10
2018-04-17 00:29:29.711985: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:11
2018-04-17 00:29:29.712000: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:12
2018-04-17 00:29:29.712044: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:13
2018-04-17 00:29:29.712067: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:14
2018-04-17 00:29:29.712081: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:15
2018-04-17 00:29:29.712093: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:2
2018-04-17 00:29:29.712107: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:4
2018-04-17 00:29:29.712119: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:7
2018-04-17 00:29:29.712132: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:8
2018-04-17 00:29:29.712144: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:10
2018-04-17 00:29:29.712157: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:11
2018-04-17 00:29:29.712170: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:12
2018-04-17 00:29:29.712182: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:13
2018-04-17 00:29:29.712194: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:14
2018-04-17 00:29:32.701156: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:32.704758: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:34.707827: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:37.707906: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:39.708875: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:39.712416: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
2018-04-17 00:29:39.712459: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:2
2018-04-17 00:29:39.712474: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:4
2018-04-17 00:29:39.712484: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:5
2018-04-17 00:29:39.712495: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:6
2018-04-17 00:29:39.712506: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:7
2018-04-17 00:29:39.712517: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:8
2018-04-17 00:29:39.712528: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:11
2018-04-17 00:29:39.712545: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:12
2018-04-17 00:29:39.712573: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:13
2018-04-17 00:29:39.712586: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:14
2018-04-17 00:29:39.712597: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:15
2018-04-17 00:29:39.712610: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:7
2018-04-17 00:29:39.712623: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:10
2018-04-17 00:29:39.712635: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:11
2018-04-17 00:29:39.712695: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:12
2018-04-17 00:29:39.712711: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:13
2018-04-17 00:29:39.712722: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:14
2018-04-17 00:29:41.699911: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:49.712912: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
2018-04-17 00:29:49.712973: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:2
2018-04-17 00:29:49.712988: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:5
2018-04-17 00:29:49.713000: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:6
2018-04-17 00:29:49.713012: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:7
2018-04-17 00:29:49.713023: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:8
2018-04-17 00:29:49.713035: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:11
2018-04-17 00:29:49.713047: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:12
2018-04-17 00:29:49.713059: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:13
2018-04-17 00:29:49.713074: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:14
2018-04-17 00:29:49.713085: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:15
2018-04-17 00:29:49.713099: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:7
2018-04-17 00:29:49.713112: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:10
2018-04-17 00:29:49.713125: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:11
2018-04-17 00:29:49.713137: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:12
2018-04-17 00:29:49.713150: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:13
2018-04-17 00:29:49.713195: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:14
2018-04-17 00:29:59.713392: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
2018-04-17 00:29:59.713449: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:2
2018-04-17 00:29:59.713466: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:5
2018-04-17 00:29:59.713479: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:6
2018-04-17 00:29:59.713491: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:7
2018-04-17 00:29:59.713503: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:8
2018-04-17 00:29:59.713515: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:11
2018-04-17 00:29:59.713527: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:12
2018-04-17 00:29:59.713544: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:13
2018-04-17 00:29:59.713556: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:14
2018-04-17 00:29:59.713570: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:15
2018-04-17 00:29:59.713583: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:7
2018-04-17 00:29:59.713597: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:10
2018-04-17 00:29:59.713610: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:11
2018-04-17 00:29:59.713623: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:12
2018-04-17 00:29:59.713636: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:13
2018-04-17 00:29:59.713650: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:14
2018-04-17 00:30:05.709934: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:30:09.713870: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
2018-04-17 00:30:09.713939: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:2
2018-04-17 00:30:09.713960: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:5
2018-04-17 00:30:09.713972: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:6
2018-04-17 00:30:09.714000: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:7
2018-04-17 00:30:09.714015: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:8
2018-04-17 00:30:09.714027: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:11
2018-04-17 00:30:09.714060: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:12
2018-04-17 00:30:09.714074: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:13
2018-04-17 00:30:09.714087: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:14
2018-04-17 00:30:09.714099: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:15
2018-04-17 00:30:09.714113: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:7
2018-04-17 00:30:09.714129: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:10
2018-04-17 00:30:09.714141: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:12
2018-04-17 00:30:09.714156: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:13
2018-04-17 00:30:09.714168: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:14
2018-04-17 00:30:19.714395: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
2018-04-17 00:30:19.714461: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:2
2018-04-17 00:30:19.714477: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:5
2018-04-17 00:30:19.714490: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:6
2018-04-17 00:30:19.714502: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:7
2018-04-17 00:30:19.714515: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:8
2018-04-17 00:30:19.714528: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:11
2018-04-17 00:30:19.714540: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:12
2018-04-17 00:30:19.714555: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:13
2018-04-17 00:30:19.714568: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:14
2018-04-17 00:30:19.714582: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:15
2018-04-17 00:30:19.714595: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:7
2018-04-17 00:30:19.714620: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:10
2018-04-17 00:30:19.714634: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:12
2018-04-17 00:30:19.714649: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:13
2018-04-17 00:30:19.714662: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:14

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions