Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARNING: One or more tensors were submitted to be reduced, gathered #403

Closed
winwinJJiang opened this issue Jul 26, 2018 · 56 comments
Closed

Comments

@winwinJJiang
Copy link

Hi, all

I got warning like this. I believe it slow down the GPU calculation.
I am using 4 node and 4 GPUs.
Any suggestions will be highly welcome!

Thank you!

WARNING: One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Stalled ops:
HorovodAllreduce_gradients_197_model_0_0_f1_l_last_add_1_grad_Reshape_1_0 [missing ranks: 0]
HorovodAllreduce_gradients_197_model_0_0_f1_l_last_Conv2D_1_grad_Conv2DBackpropFilter_0 [missing ranks: 0]
HorovodAllreduce_gradients_197_model_0_0_f1_l_last_mul_2_grad_Reshape_0 [missing ranks: 0]
HorovodAllreduce_gradients_197_model_0_0_f1_l_2_actnorm_center_add_1_grad_Reshape_1_0 [missing ranks: 0]
HorovodAllreduce_gradients_197_model_0_0_f1_l_2_actnorm_scale_mul_2_grad_Reshape_0 [missing ranks: 0]
HorovodAllreduce_gradients_197_model_0_0_f1_l_2_Conv2D_1_grad_Conv2DBackpropFilter_0 [missing ranks: 0]
HorovodAllreduce_gradients_197_model_0_0_f1_l_1_actnorm_center_add_1_grad_Reshape_1_0 [missing ranks: 0]
HorovodAllreduce_gradients_197_model_0_0_f1_l_1_actnorm_scale_mul_2_grad_Reshape_0 [missing ranks: 0]
HorovodAllreduce_gradients_197_model_0_0_f1_l_1_Conv2D_1_grad_Conv2DBackpropFilter_0 [missing ranks: 0]
HorovodAllreduce_gradients_197_AddN_2_0 [missing ranks: 0]

@alsrgv
Copy link
Member

alsrgv commented Jul 26, 2018

@winwinJJiang, how often do you see it? Does training progress after it? Oftentimes this warning means that one of the ranks (in your case rank 0) was busy running evaluation while other ranks were waiting for it. In this case, it can be simply ignored - but you would be able to use resources more efficiently if you do a distributed evaluation. You can search Issues for parallel/distributed evaluation to see some examples.

@winwinJJiang
Copy link
Author

It is happened during training. What I have done is I removed the evaluation during training.
I hope it will gone.

But I do feel the computing power is not scale up though.

@winwinJJiang
Copy link
Author

I have 4 nodes with 1 GPU in each node. Then totally I have 4 GPUs.
Do you know is there any special configrations I need to use to accelerate the computing?

@meteortwinkle
Copy link

meteortwinkle commented Jul 27, 2018

@alsrgv I have encountered this question on my training , too.
My training environment is :
4 nodes(one with 8 v100 GPU)
resnet 50
mpirun --allow-run-as-root -np 32 -H node1:8,node2:8,node3:8,node4:8 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x HOROVOD_MPI_THREADS_DISABLE=1 -x LD_LIBRARY_PATH -x PATH -mca plm_rsh_args "-p 13579" python train.py

I will always get this log(mostly rank0):
HVDAllReduce/HorovodAllreduce_gradients_group2_block0_conv1_bn_FusedBatchNormV2_grad_FusedBatchNormGradV2_1 [missing ranks: 0]
HVDAllReduce/HorovodAllreduce_gradients_group2_block0_conv1_bn_FusedBatchNormV2_grad_FusedBatchNormGradV2_2 [missing ranks: 0]

Ant it will hang and always show this waiting log, just like a dead lock.
And on the 16 GPU 4 nodes(with 4 V100 GPU) training ,it will run , but very slowly.
Could you give me some advice? Thanks a lot.

@bapriddy
Copy link
Contributor

bapriddy commented Jul 27, 2018

@winwinJJiang @alsrgv @meteortwinkle

I have a similar problem on 4 or 8 nodes...

Train Epoch #6: 29% 367/1252 [12:15<13:53, 1.06it/s, loss=2.8, accuracy=39.7]WARNING: One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. Stalled ops: allreduce.conv1.weight [missing ranks: 6] allreduce.bn1.weight [missing ranks: 6] allreduce.bn1.bias [missing ranks: 6] ...

I think we had the problem of hanging on 16 nodes as well.... 16 P100 nodes. I posted this earlier with Alex and Travis. Iv'e re-installed eveything with the latest but still get the error. I think it's some sort of mpi issue. I'm not running openmpi 3.0 but 2.0

@bapriddy
Copy link
Contributor

@meteortwinkle

Are you using nccl 2.0 ?

@winwinJJiang
Copy link
Author

@bapriddy I am using nccl-2.1 and openmpi3.0.0

@winwinJJiang
Copy link
Author

@bapriddy Actually, I do not see the computing accelerating and I am using only 4GPUs.

@bapriddy
Copy link
Contributor

@winwinJJiang I was asking @meteortwinkle because his/her sys has 4 nodes(one with 8 v100 GPU).

You have one gpu per node so nccl 2 might not be helpful since it works with mutliple gpus on mutliple nodes (along with NVLink).

@alsrgv
Copy link
Member

alsrgv commented Jul 28, 2018

@bapriddy, @winwinJJiang, @meteortwinkle, can you share what kind of model you're training, how many batches per second do you see with 1 node and with 4 nodes, and kind of network hardware do you have?

@winwinJJiang, did the warning go away after you removed evaluation from your script?

@meteortwinkle, in a heterogeneous situation like yours, computation will be bounded by slowest GPU. If it's practical, I would separate training jobs on V100 and other GPUs and run them in parallel for hyperparameter search. If not, you can try to use larger batch on V100 and smaller batch on other GPUs, but if you use batch norm it may mess things up quite a bit.

@bapriddy, do you run distributed evaluation in your training? It seems that in case of your warning rank 6 took longer to run his part of the evaluation and was late to the party.

@meteortwinkle
Copy link

@winwinJJiang
nccl2.12.2
openmpi 3.1.1
and enable gdr

@winwinJJiang
Copy link
Author

@alsrgv I am training with the Glow model, as https://github.com/openai/glow.

I am traing with batch size of 64 and it still there after I remove evaluation script.

@alsrgv
Copy link
Member

alsrgv commented Jul 28, 2018

@winwinJJiang, how often does it happen? Does training progress after the warning? Can you share answers to other questions - performance with 1 GPU, multiple GPUs, what network hardware are you using?

@meteortwinkle
Copy link

meteortwinkle commented Jul 30, 2018

@alsrgv @bapriddy @winwinJJiang
Our model is Resnet50, and mini-batch is 32(real data). But the waiting problem often came out.
What is the other GPU?Our GPUs are V100.
Also sometimes the training will run, however, the scaling efficiency was bad. A step may cost several seconds, even.
During the training process, I found the occupancy rate of some CPUs was 100% , even 200%.
Is the rank 0 process too busy to process the mpi requests in horovod ? Or openmpi lose some mpi message? Or something am I lost?
Is there any openmpi perf-tool can be used to get the performance of mpi?
Look forward your reply.

@shenqixiaojiang
Copy link

@meteortwinkle Hi, you can try the following command:

mpirun -np 16 \
    -H server1:8,server2:8 \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
    -x NCCL_SOCKET_IFNAME=eth01 \
    -mca pml ob1 \
    -mca btl_tcp_if_include eth01 \
    -mca btl ^openib -x NCCL_IB_DISABLE=1 \

@alsrgv
Copy link
Member

alsrgv commented Aug 3, 2018

@meteortwinkle, how do you feed the input data - is it coming from local disk, or from some network storage? I see that you were able to run the tests in #416. Do you still see hangs? What if you try with synthetic data?

@iRmantou
Copy link

@winwinJJiang hi, I train the glow too, I meet the same problem with much dockers. how u fix it!

@LuBingtan
Copy link

LuBingtan commented Sep 5, 2018

I meet same problem using docker with multimechine and multiple GPUs. The program hangs on at the beginning. And this is my shell cmd:

mpirun -np 3 \
    -H localhost:1,server:2 \
    --allow-run-as-root \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
    -x NCCL_SOCKET_IFNAME=^docker0 \
    -mca pml ob1 -mca btl ^openib \
    -mca btl_tcp_if_exclude lo,docker0 \
    -mca plm_rsh_args "-p 12345" \
    python tensorflow_mnist.py

is there any solutions?

@iRmantou
Copy link

iRmantou commented Sep 5, 2018

@LuBingtan It leads by some of my codes using "hvd.size()". if you have samiliar problem, you could check your horovod related code blocks ,such like hvd.size() etc.

@denisovlev
Copy link

denisovlev commented Sep 7, 2018

I see the same thing, happens pretty regularly. The data is coming from network storage.

Horovod is running under openmpi 3.1.2, on 2 servers with 4 GPU (Tesla V100) each, without docker.

The training does not progress after the warning.

WARNING: One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
Stalled ops:
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_43_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_44_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_38_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_36_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_34_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_33_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_45_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_32_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_25_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_23_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_21_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_19_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_12_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_10_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_8_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_7_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_20_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_6_0 [missing ranks: 6]

After the warning nvidia-smi shows 0% utilization on both machines even though the process is still running

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26                 Driver Version: 396.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000004:04:00.0 Off |                    0 |
| N/A   38C    P0    51W / 300W |   9962MiB / 16128MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000004:05:00.0 Off |                    0 |
| N/A   41C    P0    51W / 300W |   9962MiB / 16128MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000035:03:00.0 Off |                    0 |
| N/A   43C    P0    52W / 300W |  10986MiB / 16128MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000035:04:00.0 Off |                    0 |
| N/A   45C    P0    55W / 300W |   9962MiB / 16128MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    106386      C   python                                      9952MiB |
|    1    106387      C   python                                      9952MiB |
|    2    106388      C   python                                     10976MiB |
|    3    106389      C   python                                      9952MiB |
+-----------------------------------------------------------------------------+

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26                 Driver Version: 396.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000004:04:00.0 Off |                    0 |
| N/A   36C    P0    52W / 300W |  10986MiB / 16128MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000004:05:00.0 Off |                    0 |
| N/A   40C    P0    54W / 300W |  10986MiB / 16128MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000035:03:00.0 Off |                    0 |
| N/A   37C    P0    49W / 300W |   9962MiB / 16128MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000035:04:00.0 Off |                    0 |
| N/A   42C    P0    52W / 300W |  10090MiB / 16128MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     70399      C   python                                     10976MiB |
|    1     70400      C   python                                     10976MiB |
|    2     70402      C   python                                      9952MiB |
|    3     70403      C   python                                     10080MiB |
+-----------------------------------------------------------------------------+

I also get warning from openmpi:

[warn] event_base_loop: reentrant invocation.  Only one event_base_loop can run on each event_base at once.

Not sure if it is related but the data is read from hdf5 with hdf5py (not MPI)

@ppwwyyxx
Copy link
Contributor

ppwwyyxx commented Jan 8, 2019

May be related, but I found that horovod could get stuck like this if GPU memory is not enough.
I have a job which run well last week, but yesterday when I tried to run it I always saw it stuck at the first allreduce, with warning messages like:

HVDAllReduce/HorovodAllreduce_gradients_linear_BiasAdd_grad_BiasAddGrad_0 [missing ranks:[1,0]

together with TensorFlow OOM messages like:

2019-01-08 02:00:37.838302: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 7.62G (8187171072 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

It got stuck only when I run it with 16 machines, but not locally.

At first I ignored the TensorFlow OOM messages, because I know that TensorFlow do not necessarily fail after such messages -- it can often still run. My successful local run also has such messages. And when TensorFlow really met a unrecoverable OOM, it will throw a much larger error with stack trace and allocation status, and exit. However none of the jobs exited.
However, it turned out that memory seems to be the reason of the stuck. All my failure runs have 16GB GPU memory and it succeeded after I switched to GPUs with 32GB memory. My success run last week is on combination of 16GB and 32GB GPUs (with rank0 on 32GB). (However, my successful local runs are also with 16GB memory, perhaps different allreduce implementation costs different amount of memory?).

@nicolefinnie
Copy link

I hit the very same problem and I realized one thing: if we always make the same worker do heavy-lifting, e.g. evaluating the current model after training for a certain number of epochs, other workers are going to complain about stalled ops and missing ranks: 0 and end up being stuck in this situation without being able to recover (looks like a horovod bug in my eyes), or the recovering slows down the whole process.

My workaround is to make all workers do the same heavy lifting, for example, all workers have to do evaluation even if we only take the results from the first worker. I mean if other workers have to wait for the first guy, it won't hurt too much for all workers to do the same job.

If I find a better solution, I'll update my answer.

@tgaddair
Copy link
Collaborator

tgaddair commented Apr 1, 2019

@nicolefinnie, in your case the behavior you're seeing is what we would expect. If one of your workers is doing evaluation, but the other workers have moved on to train in the next epoch, then they will necessarily get stuck because rank 0 is busy doing evaluation.

In the ring-allreduce algorithm, the other workers cannot proceed while one of the workers is behind. You wouldn't want this behavior either, because it would effectively alter the batch size between batches, which invalidates your learning rate.

Your workaround of having all workers perform validation is a good approach. Another option is to use an MPI Barrier to wait for rank 0 to finish the evaluation step before the other workers proceed to the next epoch.

Hope that makes sense.

@nicolefinnie
Copy link

nicolefinnie commented Apr 1, 2019

@tgaddair Thanks for your explanation. However, I still hit this problem after a certain amount of time. I wonder if stretching out memory as much as possible was a cause too. My horovod processes have stuck for hours without consuming any CPU/GPUs. Reducing the batch size didn't help and the deadlock still happens. It seems other workers are still waiting for the worker 0 and then the worker 0 is waiting for other workers. I'll investigate more and keep you guys updated.

(with 4 GPUs in the log)

WARNING: One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
Stalled ops:
HorovodBroadcast_conv6_weights_0_7 [missing ranks: 0]
...
...
DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_tuple_control_dependency_28_0 [missing ranks: 1, 2, 3]

@alsrgv
Copy link
Member

alsrgv commented Apr 1, 2019

@nicolefinnie, is it possible that you have unequal number of iterations per epoch on different workers?

@nicolefinnie
Copy link

@alsrgv Sorry for getting back to you late, no, they're equally sharded (at least the log shows that way) another underlying issue could be one GPU is in one socket and other 3 GPUs are in another socket, so the slow guy has to rely on the CPU to talk to other 3 guys. It's just my pure assumption after having looked into the GPU topology of the server that the job was run on. However, it shows other 3 workers are always waiting for the lonely one during the training. However, the good thing come to those GPU workers who wait.

@alsrgv
Copy link
Member

alsrgv commented Apr 5, 2019

@nicolefinnie, your stall warning looks interesting. It basically says that all ranks except 0 ran hvd.broadcast() while rank 0 went ahead and started training. Maybe you have a logic to broadcast on non-zero ranks somewhere in your code? Broadcast should be executed on all ranks.

@nicolefinnie
Copy link

@alsrgv, Thanks for your quick response. I appreciate that. I did broadcast on all ranks by passing the root rank = 0 as follows:
session.run(hvd.broadcast_global_variables(0)) (this is run on all ranks)
However, interestingly, I happened to call broadcast multiple times in the same session on all ranks, and that's when I hit severe locks that couldn't be resolved by themselves, as you saw in the very first trace. However, after I "fixed" the problem by only broadcasting once on all ranks, the symptom has changed - other ranks always wait for the rank 0, that's why I suspect the topology of GPUs was the culprit because the rank 0 was physically located in another socket, whereas ranks 1, 2, 3 are located in the same socket.

@alsrgv
Copy link
Member

alsrgv commented Apr 9, 2019

@nicolefinnie, is it possible to publish a minimum repro of your problem on a github?

@Himscipy
Copy link

Himscipy commented Sep 13, 2019

@nicolefinnie I am facing similar issues while training the large network such as VGG16 as you mentioned in your last comment any ideas or suggestions? @alsrgv I am trying following example posted on TFP to run with horovod. But the training stuck stating following shown below. I understand that horovod has been used successfully for large networks in tf_benchmark, I even tried porting the VGG-TFP in the benchmark but there are some limitations with some of the TFP layers, Therefore I was trying to simply add horovod to the TFP code. It will be really helpful if you could suggest what things to try to fix this. Also I tried @tgaddair following suggestion also to check the model consistency on all the ranks.
for i, v in enumerate(tf.global_variables()): print(hvd.rank(), i, v)
I also tried to look into horovod Timeline but I am receiving following error
horovodrun: error: unrecognized arguments: --timeline-filename
my command for submission is
horovodrun -np 2 --timeline-filename ~/timeline.json -H localhost:2 python cifar10_bnn.py

One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. Stalled ops:HorovodBroadcast_global_step_0 [missing ranks: 0] [2019-09-12 21:25:18.467757: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_I_1_kernel_posterior_untransformed_scale_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.467935: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_1_beta_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.468123: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_1_gamma_Adam_1_0 [missing ranks: 0] [2019-09-12 21:25:18.468308: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_1_gamma_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.468493: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_II_0_bias_posterior_loc_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.468680: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_II_0_kernel_posterior_loc_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.468864: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_gamma_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.469051: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_8_gamma_Adam_1_0 [missing ranks: 0] [2019-09-12 21:25:18.469244: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_II_4_kernel_posterior_loc_Adam_1_0 [missing ranks: 0] [2019-09-12 21:25:18.469430: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_7_beta_Adam_1_0 [missing ranks: 0] [2019-09-12 21:25:18.469609: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_II_4_bias_posterior_loc_Adam_1_0 [missing ranks: 0] [2019-09-12 21:25:18.469835: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_gamma_0 [missing ranks: 0] [2019-09-12 21:25:18.470024: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_I_1_kernel_posterior_untransformed_scale_0 [missing ranks: 0] [2019-09-12 21:25:18.470221: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_5_gamma_0 [missing ranks: 0] [2019-09-12 21:25:18.470404: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_II_3_bias_posterior_loc_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.470594: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_7_gamma_Adam_1_0 [missing ranks: 0] [2019-09-12 21:25:18.470781: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_7_gamma_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.470968: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_II_3_bias_posterior_loc_Adam_1_0 [missing ranks: 0] [2019-09-12 21:25:18.471155: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_I_4_kernel_posterior_untransformed_scale_Adam_1_0 [missing ranks: 0] [2019-09-12 21:25:18.471341: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_moving_variance_0 [missing ranks: 0] [2019-09-12 21:25:18.471527: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_I_4_bias_posterior_loc_Adam_1_0 [missing ranks: 0] [2019-09-12 21:25:18.471715: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_6_beta_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.471900: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_2_beta_0 [missing ranks: 0] [2019-09-12 21:25:18.472089: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_II_0_bias_posterior_loc_Adam_1_0 [missing ranks: 0] [2019-09-12 21:25:18.472279: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_6_moving_variance_0 [missing ranks: 0] [2019-09-12 21:25:18.472467: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_II_3_kernel_posterior_loc_Adam_1_0 [missing ranks: 0] [2019-09-12 21:25:18.472656: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_I_3_bias_posterior_loc_Adam_1_0 [missing ranks: 0] [2019-09-12 21:25:18.472842: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_II_3_kernel_posterior_untransformed_scale_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.473029: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_I_3_kernel_posterior_loc_Adam_1_0 [missing ranks: 0] [2019-09-12 21:25:18.473214: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_II_4_kernel_posterior_untransformed_scale_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.473403: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_II_3_kernel_posterior_loc_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.473589: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_2_beta_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.473792: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_2_moving_variance_0 [missing ranks: 0] [2019-09-12 21:25:18.473985: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_6_beta_Adam_1_0 [missing ranks: 0] [2019-09-12 21:25:18.474171: W horovod/common/operations.cc:588] HorovodBroadcast_Dense_I_4_kernel_posterior_untransformed_scale_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.474361: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_3_gamma_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.474533: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_I_4_kernel_posterior_loc_0 [missing ranks: 0] [2019-09-12 21:25:18.474730: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_II_2_bias_posterior_loc_Adam_0 [missing ranks: 0]

@tgaddair
Copy link
Collaborator

tgaddair commented Sep 13, 2019

Hey @Himscipy, looks like your version of Horovod is out of date. Can you try upgrading to the latest version? If for whatever reason you're unable to upgrade, you can generate a timeline by using the following environment variable:

HOROVOD_TIMELINE=/path/to/timeline.json horovodrun ...

@Himscipy
Copy link

Himscipy commented Sep 14, 2019

Hey, @tgaddair Thank you for the reply. I worked your recommendation earlier to fix the horovod timeline what are your thoughts on the other error. I tried to do the trace plot for 2 ranks and 1 rank run, as you can see for the 1 rank run negotiate broadcast is negligible but for 2 ranks it takes lot of time. Any thoughts on that. The 1 rank run successfully, while other hangs up with the broadcast negotiate.

Screenshot from 2019-09-13 14-11-07

Screenshot from 2019-09-13 12-14-35

@tgaddair
Copy link
Collaborator

Hey @Himscipy, can you create a gist of the code with your modifications? I'll see if I can reproduce the hang in my environment.

Also, to clarify, you were able to successfully run the Horovod examples like tensorflow_mnist.py, correct?

@Himscipy
Copy link

Himscipy commented Sep 16, 2019

Hey @tgaddair, I cannot share the code publicly but can create a gist but in that case as well I will not be able to post the link in public since if I create a secret gist and pasting the url here will make the code accessible to anyone having url. So let me know how to share the gist with you. Thank you.

@Himscipy
Copy link

Hey @tgaddair, I created a minimalist code gist for you, my horovod and tensorflow version is as follows Tensorflow 1.13.1 and horovod 0.16.3.
Also the tensorflow_mnist.py works fine for me , the broadcast is taking a lot of time and I get stalled rank even with 2 Ranks. I am using CPU's for the run. Let me know if you could reproduce the issue.
thank you

@Himscipy
Copy link

Hello @alsrgv @tgaddair were you able to reproduce the issue from the gist I shared earlier you, I am still facing this issue. that the code stalls will Allreduce operations. Below is the snippet of the issue again.

[2019-09-24 16:48:59.102525: W horovod/common/operations.cc:588] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
Stalled ops:train/DistributedAdamOptimizer_Allreduce/HorovodAllreduce_train_gradients_AddN_1_0 [missing ranks: 0, 2, 3]
[2019-09-24 16:48:59.102792: W horovod/common/operations.cc:588] train/DistributedAdamOptimizer_Allreduce/HorovodAllreduce_train_gradients_AddN_4_0 [missing ranks: 1, 2]
[2019-09-24 16:48:59.103007: W horovod/common/operations.cc:588] train/DistributedAdamOptimizer_Allreduce/HorovodAllreduce_train_gradients_AddN_2_0 [missing ranks: 0, 1, 3]
[2019-09-24 16:49:59.104858: W horovod/common/operations.cc:588] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
Stalled ops:train/DistributedAdamOptimizer_Allreduce/HorovodAllreduce_train_gradients_AddN_1_0 [missing ranks: 0, 2, 3]
[2019-09-24 16:49:59.105124: W horovod/common/operations.cc:588] train/DistributedAdamOptimizer_Allreduce/HorovodAllreduce_train_gradients_AddN_4_0 [missing ranks: 1, 2]
[2019-09-24 16:49:59.105323: W horovod/common/operations.cc:588] train/DistributedAdamOptimizer_Allreduce/HorovodAllreduce_train_gradients_AddN_2_0 [missing ranks: 0, 1, 3]
[2019-09-24 16:50:59.106116: W horovod/common/operations.cc:588] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
Stalled ops:train/DistributedAdamOptimizer_Allreduce/HorovodAllreduce_train_gradients_AddN_1_0 [missing ranks: 0, 2, 3]
[2019-09-24 16:50:59.106378: W horovod/common/operations.cc:588] train/DistributedAdamOptimizer_Allreduce/HorovodAllreduce_train_gradients_AddN_4_0 [missing ranks: 1, 2]
[2019-09-24 16:50:59.106583: W horovod/common/operations.cc:588] train/DistributedAdamOptimizer_Allreduce/HorovodAllreduce_train_gradients_AddN_2_0 [missing ranks: 0, 1, 3]
[2019-09-24 16:51:59.111789: W horovod/common/operations.cc:588] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 


@tgaddair
Copy link
Collaborator

Hey @Himscipy thanks for putting the gist together. Looking it over, there are two possible issues I see. The first is that it looks like rank 0 is doing a lot of extra work, including validation, which may be slowing it down compared to the other workers. That would be consistent with the first error message you reported where rank 0 was the one missing from the broadcast.

The second possible issue is that you're training with uneven sample counts per worker somehow. Though it looks like that shouldn't be the case given that you have a hook to stop after a pre-determined step. Can you verify that every worker has more than that number of steps available to process?

I would suggest as a first test to try commenting out much of the if hvd.rank() == 0 code to see if that fixes some of the issues with log broadcast negotiations.

@Himscipy
Copy link

Hi @tgaddair Thank you for the response. I have tried commenting most of the hvd.rank() == 0 operations in the code, but it didn't helped, I still find the run stalling. I also changed the data-pipeline to a simple batch generator and modified the network with less number of parameter to train. The actual number of parameters are ~19 Million, I modified the arch to have 17M parameter. For the reduced parameter the distributed training works fine with 2 ranks even without putting barriers, but stalls with higher number of MPI ranks (4,8,16) and for the case of 19M parameter it get's stalled for all the rank runs. I did some profiling for the 2 rank, 17M parameter case the profiler reported that the case is backend bound that means the following
Superscalar processors can be conceptually divided into the front-end, where instructions are fetched and decoded into the operations that constitute them and the back-end, where the required computation is performed. During each cycle, the front-end generates up to two of these operations, places them into pipeline slots and moves them through the back-end. The actual number of retired pipeline slots containing useful work rarely equals this maximum. This can be because the back-end was not prepared to accept more operations of a certain kind (Back-end bound execution). Back-end bound execution may be due to long-latency operations or other contention for execution resources like too many operations being directed to a single execution port.
What do you think is it because the rank(0) is doing the all reduce operations for large tensor's(parameters to train) leading to too many operations and this can get aggravated with increase in number of workers and with large number of parameter?

@Himscipy
Copy link

Hi @tgaddair, does HOROVOD_STALL_CHECK_TIME_SECONDS affects the time all_reduce stall check also, as I not finding any change in the Stall ops error message which say's time as 60 sec only even after setting the HOROVOD_STALL_CHECK_TIME_SECONDS= 600 sec

@Himscipy
Copy link

Himscipy commented Oct 3, 2019

Hi @tgaddair ,
I am still receiving the error of stall ops train/DistributedAdamOptimizer_Allreduce/HorovodAllreduce_train_gradients_Add, I checked the number of total flops operations needed for my network ( of ~18M parameters) are about (564.46millon flops. The total number of times the operation train/DistributedAdamOptimizer_Allreduce/ called in the graph is 34 times. As you mentioned in your earlier comment that rank 0 is doing extra work considering the FLOPS could it be the case since the ops are really high. Also is there any way that I can do MPI_FINALIZE within my scripts so that I can collect the total MPI_TRACE to get some more insight on the problem.
I also run my case with HOROVOD_LOG_LEVEL=DEBUG to get some more insight, I see that the last message size it prints is following

[2019-10-01 20:44:07.685793: D horovod/common/operations.cc:1190] Created response of size 42213376
[2019-10-01 20:44:07.686039: D horovod/common/operations.cc:1229] [0]: Processing 42 tensors
[2019-10-01 20:44:07.686394: D horovod/common/operations.cc:1288] [1]: Processing 42 tensors

Considering the message size the 42MB, I don't think there is an issue, since we have a high bandwidth network, but still I don't know if this could cause an issue. I tried profiling the run also but couldn't get much information since the MPI_FINALIZE was never reached. Let me know if there could be any way to device MPI_FINALIZE if there is a stall in the ops. I have also limited the number of hvd.rank() == 0 operations. Also I have tried HOROVOD_FUSION_THRESHOLD settings as well (8,16,64,128,256,512) keeping the cycle time fixed, but no change in the result.
Thank you.

@Himscipy
Copy link

Himscipy commented Oct 9, 2019

Hello @tgaddair @alsrgv ,
I made slight changes in my program which could affect the distributed optimizer, the changes brought me success with 2 rank run, but I still face same issue of train/DistributedAdamOptimizer_Allreduce/HorovodAllreduce_train_gradients_AddN_4_0 stalled ops for higher rank run. I have tried changing HOROVOD_STALL_CHECK_TIME_SECONDS but non has affected the timeout time of '60 second' in the following message

One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. Stalled ops:train/DistributedAdamOptimizer_Allreduce/HorovodAllreduce_train_gradients_AddN_1_0

I read the source code of the message and found that this time is predefined. Is there any way to change the timeout time.

@Himscipy
Copy link

Himscipy commented Nov 11, 2019

For future users coming to this thread. The issue was resolved using the latest Horovod-0.18.2 version.
The limitations of Horovod were reported in following paper (link below) as well and the new release has the fix for it. My rank stall has been fixed by using the latest version.
https://arxiv.org/pdf/1909.11150.pdf

@apeforest
Copy link
Member

For future users coming to this thread. The issue was resolved using the latest Horovod-0.18.2 version.
The limitations of Horovod were reported in following paper (link below) as well and the new release has the fix for it. My rank stall has been fixed by using the latest version.
https://arxiv.org/pdf/1909.11150.pdf

Does that mean Horovod can support unequal work load on workers now? Meaning one can run evaluation only on rank 0?

@tgaddair
Copy link
Collaborator

Hey @apeforest, I think @Himscipy's issue was caused by a lagging worker. For training with uneven batches, @kit1980 recently landed #1058, which adds a "Join" operation for PyTorch. Would be awesome to add an implementation for MXNet as well!

@ifed-ucsd
Copy link

@winwinJJiang, how often do you see it? Does training progress after it? Oftentimes this warning means that one of the ranks (in your case rank 0) was busy running evaluation while other ranks were waiting for it. In this case, it can be simply ignored - but you would be able to use resources more efficiently if you do a distributed evaluation. You can search Issues for parallel/distributed evaluation to see some examples.

Can someone please comment on whether it is safe to ignore the horovod warning in the case that it appears when one of the workers is doing something that the others aren't (for instance, evaluation)? Does this cause the entire training process to go out of sync?

@tgaddair
Copy link
Collaborator

Hey @ifed-ucsd, usually it's not a problem. The only time it will cause things to get out of sync is if the workers end up performing different collective operations. For example, worker-1 proceeds to the next batch and calls hvd.allreduce while worker-0 does validation and calls hvd.broadcast. As long as all the workers will eventually meet back up at the same point, it will be okay.

@czs1886
Copy link

czs1886 commented Feb 19, 2020

I also got this problem when I train my pytorch model using horovod. I partitioned the validation set by ranks, why does it still get this warning? @alsrgv

@hoangcuong2011
Copy link

This happens to me as well. I hope it would not produce a big problem (i.e. much lower accuracy for the model after training).
I have a quick question - how to know which version of Horovod I am using? I googled but could not find any command. I am confident my version is > Horovod-0.18.2 but I just wanted to make one more check. Thx.

@iou2much
Copy link

my version is 0.19.0. it still happens when training stage finished and evaluating start for every epoch

@ehsab
Copy link

ehsab commented Apr 20, 2020

I hit the very same problem and I realized one thing: if we always make the same worker do heavy-lifting, e.g. evaluating the current model after training for a certain number of epochs, other workers are going to complain about stalled ops and missing ranks: 0 and end up being stuck in this situation without being able to recover (looks like a horovod bug in my eyes), or the recovering slows down the whole process.

My workaround is to make all workers do the same heavy lifting, for example, all workers have to do evaluation even if we only take the results from the first worker. I mean if other workers have to wait for the first guy, it won't hurt too much for all workers to do the same job.

If I find a better solution, I'll update my answer.

Did you find a better solution? I am trying to run pytorch_imagenet_resnet50.py in docker and get the same error:

One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.

any update?

@ehsab
Copy link

ehsab commented Apr 20, 2020

@winwinJJiang, how often do you see it? Does training progress after it? Oftentimes this warning means that one of the ranks (in your case rank 0) was busy running evaluation while other ranks were waiting for it. In this case, it can be simply ignored - but you would be able to use resources more efficiently if you do a distributed evaluation. You can search Issues for parallel/distributed evaluation to see some examples.

@alsrgv I have an exact same problem. I am just trying to run pytorch_imagenet_resnet50.py sample from Horovod. It doesn't work in some situation e.g. 4 servers 4 gpus. But it works on 1/2servers 1/2 gpus. Any idea? Does the sample work without any changes?

Thanks,

@stale
Copy link

stale bot commented Nov 6, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@rrrongon
Copy link

rrrongon commented Dec 25, 2023

Hi,
`horovodrun --check
Horovod v0.28.1:

Available Frameworks:
[ ] TensorFlow
[X] PyTorch
[ ] MXNet

Available Controllers:
[X] MPI
[X] Gloo

Available Tensor Operations:
[X] NCCL
[ ] DDL
[ ] CCL
[X] MPI
[X] Gloo`

and mpiexec --version is 4.1.6. After installation if I run my code in a single process using the command
horovodrun -np 1 python mnist_test.py it works.
However, if I run in more than 1 process, it does not work anymore. I am trying to run the code that used to run using horovod. Now it is not working anymore specifically for broadcast. Because, I have tested 'allreduce' in multiple process

`
import horovod.torch as hvd

import torch

hvd.init()

tensor = torch.tensor([1.0, 2.0, 3.0])

scaled_tensor = tensor * hvd.size()

summed_tensor = hvd.allreduce(scaled_tensor)

average_tensor = summed_tensor / hvd.size()

print(f"Rank {hvd.rank()}: Original Tensor: {tensor.numpy()}")

print(f"Rank {hvd.rank()}: Scaled Tensor: {scaled_tensor.numpy()}")

print(f"Rank {hvd.rank()}: Summed Tensor: {summed_tensor.numpy()}")

print(f"Rank {hvd.rank()}: Average Tensor: {average_tensor.numpy()}")
`

horovodrun -n 4 python hvd_allreduce.py
above code works using the command.

However, a basic broadcast shows following error
`
/horovod/torch/mpi_ops.py", line 809, in broadcast_async

output = tensor.new(tensor.shape)

AttributeError: 'NoneType' object has no attribute 'new'
`

And my existing pytorch code hangs while parameters are being tried to broadcast.

`
print("before hvd bdcast")

hvd.broadcast_parameters(model.state_dict(), root_rank=0)

print("after hvd bdcast params")
`

after hvd bdcast params
never gets printed.

Please let me know if you have any idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests