New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WARNING: One or more tensors were submitted to be reduced, gathered #403
Comments
@winwinJJiang, how often do you see it? Does training progress after it? Oftentimes this warning means that one of the ranks (in your case rank 0) was busy running evaluation while other ranks were waiting for it. In this case, it can be simply ignored - but you would be able to use resources more efficiently if you do a distributed evaluation. You can search Issues for parallel/distributed evaluation to see some examples. |
It is happened during training. What I have done is I removed the evaluation during training. But I do feel the computing power is not scale up though. |
I have 4 nodes with 1 GPU in each node. Then totally I have 4 GPUs. |
@alsrgv I have encountered this question on my training , too. I will always get this log(mostly rank0): Ant it will hang and always show this waiting log, just like a dead lock. |
@winwinJJiang @alsrgv @meteortwinkle I have a similar problem on 4 or 8 nodes...
I think we had the problem of hanging on 16 nodes as well.... 16 P100 nodes. I posted this earlier with Alex and Travis. Iv'e re-installed eveything with the latest but still get the error. I think it's some sort of mpi issue. I'm not running openmpi 3.0 but 2.0 |
Are you using nccl 2.0 ? |
@bapriddy I am using nccl-2.1 and openmpi3.0.0 |
@bapriddy Actually, I do not see the computing accelerating and I am using only 4GPUs. |
@winwinJJiang I was asking @meteortwinkle because his/her sys has 4 nodes(one with 8 v100 GPU). You have one gpu per node so nccl 2 might not be helpful since it works with mutliple gpus on mutliple nodes (along with NVLink). |
@bapriddy, @winwinJJiang, @meteortwinkle, can you share what kind of model you're training, how many batches per second do you see with 1 node and with 4 nodes, and kind of network hardware do you have? @winwinJJiang, did the warning go away after you removed evaluation from your script? @meteortwinkle, in a heterogeneous situation like yours, computation will be bounded by slowest GPU. If it's practical, I would separate training jobs on V100 and other GPUs and run them in parallel for hyperparameter search. If not, you can try to use larger batch on V100 and smaller batch on other GPUs, but if you use batch norm it may mess things up quite a bit. @bapriddy, do you run distributed evaluation in your training? It seems that in case of your warning rank 6 took longer to run his part of the evaluation and was late to the party. |
@winwinJJiang |
@alsrgv I am training with the Glow model, as https://github.com/openai/glow. I am traing with batch size of 64 and it still there after I remove evaluation script. |
@winwinJJiang, how often does it happen? Does training progress after the warning? Can you share answers to other questions - performance with 1 GPU, multiple GPUs, what network hardware are you using? |
@alsrgv @bapriddy @winwinJJiang |
@meteortwinkle Hi, you can try the following command:
|
@meteortwinkle, how do you feed the input data - is it coming from local disk, or from some network storage? I see that you were able to run the tests in #416. Do you still see hangs? What if you try with synthetic data? |
@winwinJJiang hi, I train the glow too, I meet the same problem with much dockers. how u fix it! |
I meet same problem using docker with multimechine and multiple GPUs. The program hangs on at the beginning. And this is my shell cmd:
is there any solutions? |
@LuBingtan It leads by some of my codes using "hvd.size()". if you have samiliar problem, you could check your horovod related code blocks ,such like hvd.size() etc. |
I see the same thing, happens pretty regularly. The data is coming from network storage. Horovod is running under openmpi 3.1.2, on 2 servers with 4 GPU (Tesla V100) each, without docker. The training does not progress after the warning.
After the warning
I also get warning from openmpi:
Not sure if it is related but the data is read from hdf5 with hdf5py (not MPI) |
May be related, but I found that horovod could get stuck like this if GPU memory is not enough.
together with TensorFlow OOM messages like:
It got stuck only when I run it with 16 machines, but not locally. At first I ignored the TensorFlow OOM messages, because I know that TensorFlow do not necessarily fail after such messages -- it can often still run. My successful local run also has such messages. And when TensorFlow really met a unrecoverable OOM, it will throw a much larger error with stack trace and allocation status, and exit. However none of the jobs exited. |
I hit the very same problem and I realized one thing: if we always make the same worker do heavy-lifting, e.g. evaluating the current model after training for a certain number of epochs, other workers are going to complain about My workaround is to make all workers do the same heavy lifting, for example, all workers have to do evaluation even if we only take the results from the first worker. I mean if other workers have to wait for the first guy, it won't hurt too much for all workers to do the same job. If I find a better solution, I'll update my answer. |
@nicolefinnie, in your case the behavior you're seeing is what we would expect. If one of your workers is doing evaluation, but the other workers have moved on to train in the next epoch, then they will necessarily get stuck because rank 0 is busy doing evaluation. In the ring-allreduce algorithm, the other workers cannot proceed while one of the workers is behind. You wouldn't want this behavior either, because it would effectively alter the batch size between batches, which invalidates your learning rate. Your workaround of having all workers perform validation is a good approach. Another option is to use an MPI Barrier to wait for rank 0 to finish the evaluation step before the other workers proceed to the next epoch. Hope that makes sense. |
@tgaddair Thanks for your explanation. However, I still hit this problem after a certain amount of time. I wonder if stretching out memory as much as possible was a cause too. My horovod processes have stuck for hours without consuming any CPU/GPUs. Reducing the batch size didn't help and the deadlock still happens. It seems other workers are still waiting for the worker 0 and then the worker 0 is waiting for other workers. I'll investigate more and keep you guys updated. (with 4 GPUs in the log)
|
@nicolefinnie, is it possible that you have unequal number of iterations per epoch on different workers? |
@alsrgv Sorry for getting back to you late, no, they're equally sharded (at least the log shows that way) another underlying issue could be one GPU is in one socket and other 3 GPUs are in another socket, so the slow guy has to rely on the CPU to talk to other 3 guys. It's just my pure assumption after having looked into the GPU topology of the server that the job was run on. However, it shows other 3 workers are always waiting for the lonely one during the training. However, the good thing come to those GPU workers who wait. |
@nicolefinnie, your stall warning looks interesting. It basically says that all ranks except 0 ran |
@alsrgv, Thanks for your quick response. I appreciate that. I did broadcast on all ranks by passing the root rank = 0 as follows: |
@nicolefinnie, is it possible to publish a minimum repro of your problem on a github? |
@nicolefinnie I am facing similar issues while training the large network such as VGG16 as you mentioned in your last comment any ideas or suggestions? @alsrgv I am trying following example posted on TFP to run with horovod. But the training stuck stating following shown below. I understand that horovod has been used successfully for large networks in tf_benchmark, I even tried porting the VGG-TFP in the benchmark but there are some limitations with some of the TFP layers, Therefore I was trying to simply add horovod to the TFP code. It will be really helpful if you could suggest what things to try to fix this. Also I tried @tgaddair following suggestion also to check the model consistency on all the ranks.
|
Hey @Himscipy, looks like your version of Horovod is out of date. Can you try upgrading to the latest version? If for whatever reason you're unable to upgrade, you can generate a timeline by using the following environment variable:
|
Hey, @tgaddair Thank you for the reply. I worked your recommendation earlier to fix the horovod timeline what are your thoughts on the other error. I tried to do the trace plot for 2 ranks and 1 rank run, as you can see for the 1 rank run negotiate broadcast is negligible but for 2 ranks it takes lot of time. Any thoughts on that. The 1 rank run successfully, while other hangs up with the broadcast negotiate. |
Hey @Himscipy, can you create a gist of the code with your modifications? I'll see if I can reproduce the hang in my environment. Also, to clarify, you were able to successfully run the Horovod examples like |
Hey @tgaddair, I cannot share the code publicly but can create a gist but in that case as well I will not be able to post the link in public since if I create a secret gist and pasting the url here will make the code accessible to anyone having url. So let me know how to share the gist with you. Thank you. |
Hey @tgaddair, I created a minimalist code gist for you, my horovod and tensorflow version is as follows Tensorflow 1.13.1 and horovod 0.16.3. |
Hello @alsrgv @tgaddair were you able to reproduce the issue from the gist I shared earlier you, I am still facing this issue. that the code stalls will Allreduce operations. Below is the snippet of the issue again.
|
Hey @Himscipy thanks for putting the gist together. Looking it over, there are two possible issues I see. The first is that it looks like rank 0 is doing a lot of extra work, including validation, which may be slowing it down compared to the other workers. That would be consistent with the first error message you reported where rank 0 was the one missing from the broadcast. The second possible issue is that you're training with uneven sample counts per worker somehow. Though it looks like that shouldn't be the case given that you have a hook to stop after a pre-determined step. Can you verify that every worker has more than that number of steps available to process? I would suggest as a first test to try commenting out much of the |
Hi @tgaddair Thank you for the response. I have tried commenting most of the |
Hi @tgaddair, does |
Hi @tgaddair ,
Considering the message size the 42MB, I don't think there is an issue, since we have a high bandwidth network, but still I don't know if this could cause an issue. I tried profiling the run also but couldn't get much information since the MPI_FINALIZE was never reached. Let me know if there could be any way to device MPI_FINALIZE if there is a stall in the ops. I have also limited the number of |
Hello @tgaddair @alsrgv ,
I read the source code of the message and found that this time is predefined. Is there any way to change the timeout time. |
For future users coming to this thread. The issue was resolved using the latest Horovod-0.18.2 version. |
Does that mean Horovod can support unequal work load on workers now? Meaning one can run evaluation only on rank 0? |
Hey @apeforest, I think @Himscipy's issue was caused by a lagging worker. For training with uneven batches, @kit1980 recently landed #1058, which adds a "Join" operation for PyTorch. Would be awesome to add an implementation for MXNet as well! |
Can someone please comment on whether it is safe to ignore the horovod warning in the case that it appears when one of the workers is doing something that the others aren't (for instance, evaluation)? Does this cause the entire training process to go out of sync? |
Hey @ifed-ucsd, usually it's not a problem. The only time it will cause things to get out of sync is if the workers end up performing different collective operations. For example, worker-1 proceeds to the next batch and calls |
I also got this problem when I train my pytorch model using horovod. I partitioned the validation set by ranks, why does it still get this warning? @alsrgv |
This happens to me as well. I hope it would not produce a big problem (i.e. much lower accuracy for the model after training). |
my version is 0.19.0. it still happens when training stage finished and evaluating start for every epoch |
Did you find a better solution? I am trying to run pytorch_imagenet_resnet50.py in docker and get the same error:
any update? |
@alsrgv I have an exact same problem. I am just trying to run pytorch_imagenet_resnet50.py sample from Horovod. It doesn't work in some situation e.g. 4 servers 4 gpus. But it works on 1/2servers 1/2 gpus. Any idea? Does the sample work without any changes? Thanks, |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Hi, Available Frameworks: Available Controllers: Available Tensor Operations: and ` import torch hvd.init() tensor = torch.tensor([1.0, 2.0, 3.0]) scaled_tensor = tensor * hvd.size() summed_tensor = hvd.allreduce(scaled_tensor) average_tensor = summed_tensor / hvd.size() print(f"Rank {hvd.rank()}: Original Tensor: {tensor.numpy()}") print(f"Rank {hvd.rank()}: Scaled Tensor: {scaled_tensor.numpy()}") print(f"Rank {hvd.rank()}: Summed Tensor: {summed_tensor.numpy()}") print(f"Rank {hvd.rank()}: Average Tensor: {average_tensor.numpy()}")
However, a basic broadcast shows following error output = tensor.new(tensor.shape) AttributeError: 'NoneType' object has no attribute 'new' And my existing pytorch code hangs while parameters are being tried to broadcast. ` hvd.broadcast_parameters(model.state_dict(), root_rank=0) print("after hvd bdcast params")
Please let me know if you have any idea. |
Hi, all
I got warning like this. I believe it slow down the GPU calculation.
I am using 4 node and 4 GPUs.
Any suggestions will be highly welcome!
Thank you!
WARNING: One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Stalled ops:
HorovodAllreduce_gradients_197_model_0_0_f1_l_last_add_1_grad_Reshape_1_0 [missing ranks: 0]
HorovodAllreduce_gradients_197_model_0_0_f1_l_last_Conv2D_1_grad_Conv2DBackpropFilter_0 [missing ranks: 0]
HorovodAllreduce_gradients_197_model_0_0_f1_l_last_mul_2_grad_Reshape_0 [missing ranks: 0]
HorovodAllreduce_gradients_197_model_0_0_f1_l_2_actnorm_center_add_1_grad_Reshape_1_0 [missing ranks: 0]
HorovodAllreduce_gradients_197_model_0_0_f1_l_2_actnorm_scale_mul_2_grad_Reshape_0 [missing ranks: 0]
HorovodAllreduce_gradients_197_model_0_0_f1_l_2_Conv2D_1_grad_Conv2DBackpropFilter_0 [missing ranks: 0]
HorovodAllreduce_gradients_197_model_0_0_f1_l_1_actnorm_center_add_1_grad_Reshape_1_0 [missing ranks: 0]
HorovodAllreduce_gradients_197_model_0_0_f1_l_1_actnorm_scale_mul_2_grad_Reshape_0 [missing ranks: 0]
HorovodAllreduce_gradients_197_model_0_0_f1_l_1_Conv2D_1_grad_Conv2DBackpropFilter_0 [missing ranks: 0]
HorovodAllreduce_gradients_197_AddN_2_0 [missing ranks: 0]
The text was updated successfully, but these errors were encountered: