OSError on 128 GPUs for distributed_replicated on AWS P3

Hi,

I'm trying to run a distributed replicated benchmark with 128 V100s and I'm getting a OSError. 

Some more details:
 - Using AWS P3 instances (16 of them)
 - Batch size is 64
 - Running `resnet101`

Does anyone know how I can get around this issue or if there are any obvious mistakes that I'm making? The same commands work fine for 8 machines (64 GPUs)

I've pasted the commands run below:
```
##########
('Run the following commands on', '172.31.89.130')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=0 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=0 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.92.187')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=1 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=1 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.95.87')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=2 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=2 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.91.114')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=3 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=3 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.81.43')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=4 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=4 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.90.229')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=5 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=5 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.91.125')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=6 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=6 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.85.199')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=7 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=7 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.93.20')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=8 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=8 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.87.145')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=9 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=9 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.93.84')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=10 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=10 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.89.237')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=11 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=11 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.83.145')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=12 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=12 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.82.121')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=13 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=13 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.80.160')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=14 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=14 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.85.86')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=15 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=15 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker
```


Here is the stderr of the run (of one of the workers):
```
/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/matplotlib/__init__.py:962: UserWarning: Duplicate key in file "/home/ubuntu/.config/matplotlib/matplotlibrc", line #2
  (fname, cnt))
/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/matplotlib/__init__.py:962: UserWarning: Duplicate key in file "/home/ubuntu/.config/matplotlib/matplotlibrc", line #3
  (fname, cnt))
2018-04-17 00:27:36.779965: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-04-17 00:27:37.724333: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-17 00:27:37.725452: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-04-17 00:27:37.949377: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-17 00:27:37.950506: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 1 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1d.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-04-17 00:27:38.200838: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-17 00:27:38.202080: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 2 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1c.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-04-17 00:27:38.410280: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-17 00:27:38.411397: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 3 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1b.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-04-17 00:27:38.633574: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-17 00:27:38.634718: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 4 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1a.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-04-17 00:27:38.833131: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-17 00:27:38.834389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 5 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:19.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-04-17 00:27:39.027737: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-17 00:27:39.029552: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 6 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:18.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-04-17 00:27:39.261369: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-17 00:27:39.262446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 7 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:17.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-04-17 00:27:39.262746: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7
2018-04-17 00:27:42.069669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-17 00:27:42.069719: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 1 2 3 4 5 6 7
2018-04-17 00:27:42.069731: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N Y Y Y Y N N N
2018-04-17 00:27:42.069738: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 1:   Y N Y Y N Y N N
2018-04-17 00:27:42.069745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 2:   Y Y N Y N N Y N
2018-04-17 00:27:42.069752: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 3:   Y Y Y N N N N Y
2018-04-17 00:27:42.069758: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 4:   Y N N N N Y Y Y
2018-04-17 00:27:42.069766: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 5:   N Y N N Y N Y Y
2018-04-17 00:27:42.069772: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 6:   N N Y N Y Y N Y
2018-04-17 00:27:42.069779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 7:   N N N Y Y Y Y N
2018-04-17 00:27:42.072586: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:0 with 14867 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
2018-04-17 00:27:42.248215: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:1 with 14867 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1d.0, compute capability: 7.0)
2018-04-17 00:27:42.400232: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:2 with 14867 MB memory) -> physical GPU (device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1c.0, compute capability: 7.0)
2018-04-17 00:27:42.582713: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:3 with 14867 MB memory) -> physical GPU (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1b.0, compute capability: 7.0)
2018-04-17 00:27:42.775726: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:4 with 14867 MB memory) -> physical GPU (device: 4, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1a.0, compute capability: 7.0)
2018-04-17 00:27:42.933943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:5 with 14867 MB memory) -> physical GPU (device: 5, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:19.0, compute capability: 7.0)
2018-04-17 00:27:43.115514: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:6 with 14867 MB memory) -> physical GPU (device: 6, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:18.0, compute capability: 7.0)
2018-04-17 00:27:43.309956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:7 with 14867 MB memory) -> physical GPU (device: 7, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:17.0, compute capability: 7.0)
2018-04-17 00:27:43.501367: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 172.31.89.130:50000, 1 -> 172.31.92.187:50000, 2 -> 172.31.95.87:50000, 3 -> 172.31.91.114:50000, 4 -> 172.31.81.43:50000, 5 -> 172.31.90.229:50000, 6 -> 172.31.91.125:50000, 7 -> 172.31.85.199:50000, 8 -> 172.31.93.20:50000, 9 -> 172.31.87.145:50000, 10 -> 172.31.93.84:50000, 11 -> 172.31.89.237:50000, 12 -> 172.31.83.145:50000, 13 -> 172.31.82.121:50000, 14 -> 172.31.80.160:50000, 15 -> 172.31.85.86:50000}
2018-04-17 00:27:43.501438: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:50001, 1 -> 172.31.92.187:50001, 2 -> 172.31.95.87:50001, 3 -> 172.31.91.114:50001, 4 -> 172.31.81.43:50001, 5 -> 172.31.90.229:50001, 6 -> 172.31.91.125:50001, 7 -> 172.31.85.199:50001, 8 -> 172.31.93.20:50001, 9 -> 172.31.87.145:50001, 10 -> 172.31.93.84:50001, 11 -> 172.31.89.237:50001, 12 -> 172.31.83.145:50001, 13 -> 172.31.82.121:50001, 14 -> 172.31.80.160:50001, 15 -> 172.31.85.86:50001}
2018-04-17 00:27:43.512903: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:333] Started server with target: grpc://localhost:50001
W0417 00:29:08.790456 139800790255360 tf_logging.py:126] From /home/ubuntu/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1504: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-04-17 00:29:19.709271: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unknown: Call dropped by load balancing policy
2018-04-17 00:29:25.709634: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:26.707462: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:27.711554: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:28.707978: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:29.704912: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:29.708827: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:29.711819: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
2018-04-17 00:29:29.711866: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:2
2018-04-17 00:29:29.711883: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:3
2018-04-17 00:29:29.711896: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:4
2018-04-17 00:29:29.711912: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:5
2018-04-17 00:29:29.711925: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:6
2018-04-17 00:29:29.711940: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:7
2018-04-17 00:29:29.711956: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:8
2018-04-17 00:29:29.711970: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:10
2018-04-17 00:29:29.711985: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:11
2018-04-17 00:29:29.712000: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:12
2018-04-17 00:29:29.712044: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:13
2018-04-17 00:29:29.712067: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:14
2018-04-17 00:29:29.712081: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:15
2018-04-17 00:29:29.712093: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:2
2018-04-17 00:29:29.712107: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:4
2018-04-17 00:29:29.712119: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:7
2018-04-17 00:29:29.712132: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:8
2018-04-17 00:29:29.712144: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:10
2018-04-17 00:29:29.712157: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:11
2018-04-17 00:29:29.712170: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:12
2018-04-17 00:29:29.712182: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:13
2018-04-17 00:29:29.712194: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:14
2018-04-17 00:29:32.701156: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:32.704758: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:34.707827: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:37.707906: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:39.708875: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:39.712416: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
2018-04-17 00:29:39.712459: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:2
2018-04-17 00:29:39.712474: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:4
2018-04-17 00:29:39.712484: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:5
2018-04-17 00:29:39.712495: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:6
2018-04-17 00:29:39.712506: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:7
2018-04-17 00:29:39.712517: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:8
2018-04-17 00:29:39.712528: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:11
2018-04-17 00:29:39.712545: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:12
2018-04-17 00:29:39.712573: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:13
2018-04-17 00:29:39.712586: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:14
2018-04-17 00:29:39.712597: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:15
2018-04-17 00:29:39.712610: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:7
2018-04-17 00:29:39.712623: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:10
2018-04-17 00:29:39.712635: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:11
2018-04-17 00:29:39.712695: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:12
2018-04-17 00:29:39.712711: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:13
2018-04-17 00:29:39.712722: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:14
2018-04-17 00:29:41.699911: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:49.712912: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
2018-04-17 00:29:49.712973: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:2
2018-04-17 00:29:49.712988: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:5
2018-04-17 00:29:49.713000: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:6
2018-04-17 00:29:49.713012: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:7
2018-04-17 00:29:49.713023: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:8
2018-04-17 00:29:49.713035: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:11
2018-04-17 00:29:49.713047: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:12
2018-04-17 00:29:49.713059: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:13
2018-04-17 00:29:49.713074: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:14
2018-04-17 00:29:49.713085: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:15
2018-04-17 00:29:49.713099: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:7
2018-04-17 00:29:49.713112: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:10
2018-04-17 00:29:49.713125: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:11
2018-04-17 00:29:49.713137: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:12
2018-04-17 00:29:49.713150: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:13
2018-04-17 00:29:49.713195: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:14
2018-04-17 00:29:59.713392: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
2018-04-17 00:29:59.713449: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:2
2018-04-17 00:29:59.713466: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:5
2018-04-17 00:29:59.713479: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:6
2018-04-17 00:29:59.713491: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:7
2018-04-17 00:29:59.713503: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:8
2018-04-17 00:29:59.713515: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:11
2018-04-17 00:29:59.713527: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:12
2018-04-17 00:29:59.713544: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:13
2018-04-17 00:29:59.713556: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:14
2018-04-17 00:29:59.713570: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:15
2018-04-17 00:29:59.713583: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:7
2018-04-17 00:29:59.713597: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:10
2018-04-17 00:29:59.713610: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:11
2018-04-17 00:29:59.713623: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:12
2018-04-17 00:29:59.713636: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:13
2018-04-17 00:29:59.713650: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:14
2018-04-17 00:30:05.709934: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:30:09.713870: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
2018-04-17 00:30:09.713939: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:2
2018-04-17 00:30:09.713960: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:5
2018-04-17 00:30:09.713972: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:6
2018-04-17 00:30:09.714000: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:7
2018-04-17 00:30:09.714015: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:8
2018-04-17 00:30:09.714027: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:11
2018-04-17 00:30:09.714060: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:12
2018-04-17 00:30:09.714074: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:13
2018-04-17 00:30:09.714087: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:14
2018-04-17 00:30:09.714099: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:15
2018-04-17 00:30:09.714113: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:7
2018-04-17 00:30:09.714129: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:10
2018-04-17 00:30:09.714141: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:12
2018-04-17 00:30:09.714156: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:13
2018-04-17 00:30:09.714168: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:14
2018-04-17 00:30:19.714395: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
2018-04-17 00:30:19.714461: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:2
2018-04-17 00:30:19.714477: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:5
2018-04-17 00:30:19.714490: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:6
2018-04-17 00:30:19.714502: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:7
2018-04-17 00:30:19.714515: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:8
2018-04-17 00:30:19.714528: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:11
2018-04-17 00:30:19.714540: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:12
2018-04-17 00:30:19.714555: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:13
2018-04-17 00:30:19.714568: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:14
2018-04-17 00:30:19.714582: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:15
2018-04-17 00:30:19.714595: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:7
2018-04-17 00:30:19.714620: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:10
2018-04-17 00:30:19.714634: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:12
2018-04-17 00:30:19.714649: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:13
2018-04-17 00:30:19.714662: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:14
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OSError on 128 GPUs for distributed_replicated on AWS P3 #165

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OSError on 128 GPUs for distributed_replicated on AWS P3 #165

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions