WARNING: One or more tensors were submitted to be reduced, gathered #403

winwinJJiang · 2018-07-26T19:20:54Z

Hi, all

I got warning like this. I believe it slow down the GPU calculation.
I am using 4 node and 4 GPUs.
Any suggestions will be highly welcome!

Thank you!

WARNING: One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Stalled ops:
HorovodAllreduce_gradients_197_model_0_0_f1_l_last_add_1_grad_Reshape_1_0 [missing ranks: 0]
HorovodAllreduce_gradients_197_model_0_0_f1_l_last_Conv2D_1_grad_Conv2DBackpropFilter_0 [missing ranks: 0]
HorovodAllreduce_gradients_197_model_0_0_f1_l_last_mul_2_grad_Reshape_0 [missing ranks: 0]
HorovodAllreduce_gradients_197_model_0_0_f1_l_2_actnorm_center_add_1_grad_Reshape_1_0 [missing ranks: 0]
HorovodAllreduce_gradients_197_model_0_0_f1_l_2_actnorm_scale_mul_2_grad_Reshape_0 [missing ranks: 0]
HorovodAllreduce_gradients_197_model_0_0_f1_l_2_Conv2D_1_grad_Conv2DBackpropFilter_0 [missing ranks: 0]
HorovodAllreduce_gradients_197_model_0_0_f1_l_1_actnorm_center_add_1_grad_Reshape_1_0 [missing ranks: 0]
HorovodAllreduce_gradients_197_model_0_0_f1_l_1_actnorm_scale_mul_2_grad_Reshape_0 [missing ranks: 0]
HorovodAllreduce_gradients_197_model_0_0_f1_l_1_Conv2D_1_grad_Conv2DBackpropFilter_0 [missing ranks: 0]
HorovodAllreduce_gradients_197_AddN_2_0 [missing ranks: 0]

alsrgv · 2018-07-26T19:49:28Z

@winwinJJiang, how often do you see it? Does training progress after it? Oftentimes this warning means that one of the ranks (in your case rank 0) was busy running evaluation while other ranks were waiting for it. In this case, it can be simply ignored - but you would be able to use resources more efficiently if you do a distributed evaluation. You can search Issues for parallel/distributed evaluation to see some examples.

winwinJJiang · 2018-07-26T20:14:29Z

It is happened during training. What I have done is I removed the evaluation during training.
I hope it will gone.

But I do feel the computing power is not scale up though.

winwinJJiang · 2018-07-26T20:17:15Z

I have 4 nodes with 1 GPU in each node. Then totally I have 4 GPUs.
Do you know is there any special configrations I need to use to accelerate the computing?

meteortwinkle · 2018-07-27T08:00:25Z

@alsrgv I have encountered this question on my training , too.
My training environment is ：
4 nodes(one with 8 v100 GPU)
resnet 50
mpirun --allow-run-as-root -np 32 -H node1:8,node2:8,node3:8,node4:8 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x HOROVOD_MPI_THREADS_DISABLE=1 -x LD_LIBRARY_PATH -x PATH -mca plm_rsh_args "-p 13579" python train.py

I will always get this log(mostly rank0):
HVDAllReduce/HorovodAllreduce_gradients_group2_block0_conv1_bn_FusedBatchNormV2_grad_FusedBatchNormGradV2_1 [missing ranks: 0]
HVDAllReduce/HorovodAllreduce_gradients_group2_block0_conv1_bn_FusedBatchNormV2_grad_FusedBatchNormGradV2_2 [missing ranks: 0]

Ant it will hang and always show this waiting log, just like a dead lock.
And on the 16 GPU 4 nodes(with 4 V100 GPU) training ,it will run , but very slowly.
Could you give me some advice? Thanks a lot.

bapriddy · 2018-07-27T14:52:53Z

@winwinJJiang @alsrgv @meteortwinkle

I have a similar problem on 4 or 8 nodes...

Train Epoch #6: 29% 367/1252 [12:15<13:53, 1.06it/s, loss=2.8, accuracy=39.7]WARNING: One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. Stalled ops: allreduce.conv1.weight [missing ranks: 6] allreduce.bn1.weight [missing ranks: 6] allreduce.bn1.bias [missing ranks: 6] ...

I think we had the problem of hanging on 16 nodes as well.... 16 P100 nodes. I posted this earlier with Alex and Travis. Iv'e re-installed eveything with the latest but still get the error. I think it's some sort of mpi issue. I'm not running openmpi 3.0 but 2.0

bapriddy · 2018-07-27T14:55:52Z

@meteortwinkle

Are you using nccl 2.0 ?

winwinJJiang · 2018-07-27T16:32:41Z

@bapriddy I am using nccl-2.1 and openmpi3.0.0

winwinJJiang · 2018-07-27T16:33:14Z

@bapriddy Actually, I do not see the computing accelerating and I am using only 4GPUs.

bapriddy · 2018-07-27T20:49:03Z

@winwinJJiang I was asking @meteortwinkle because his/her sys has 4 nodes(one with 8 v100 GPU).

You have one gpu per node so nccl 2 might not be helpful since it works with mutliple gpus on mutliple nodes (along with NVLink).

alsrgv · 2018-07-28T01:41:41Z

@bapriddy, @winwinJJiang, @meteortwinkle, can you share what kind of model you're training, how many batches per second do you see with 1 node and with 4 nodes, and kind of network hardware do you have?

@winwinJJiang, did the warning go away after you removed evaluation from your script?

@meteortwinkle, in a heterogeneous situation like yours, computation will be bounded by slowest GPU. If it's practical, I would separate training jobs on V100 and other GPUs and run them in parallel for hyperparameter search. If not, you can try to use larger batch on V100 and smaller batch on other GPUs, but if you use batch norm it may mess things up quite a bit.

@bapriddy, do you run distributed evaluation in your training? It seems that in case of your warning rank 6 took longer to run his part of the evaluation and was late to the party.

meteortwinkle · 2018-07-28T01:51:34Z

@winwinJJiang
nccl2.12.2
openmpi 3.1.1
and enable gdr

winwinJJiang · 2018-07-28T03:19:27Z

@alsrgv I am training with the Glow model, as https://github.com/openai/glow.

I am traing with batch size of 64 and it still there after I remove evaluation script.

alsrgv · 2018-07-28T06:17:47Z

@winwinJJiang, how often does it happen? Does training progress after the warning? Can you share answers to other questions - performance with 1 GPU, multiple GPUs, what network hardware are you using?

meteortwinkle · 2018-07-30T01:08:35Z

@alsrgv @bapriddy @winwinJJiang
Our model is Resnet50, and mini-batch is 32(real data). But the waiting problem often came out.
What is the other GPU?Our GPUs are V100.
Also sometimes the training will run, however, the scaling efficiency was bad. A step may cost several seconds, even.
During the training process, I found the occupancy rate of some CPUs was 100% , even 200%.
Is the rank 0 process too busy to process the mpi requests in horovod ? Or openmpi lose some mpi message? Or something am I lost?
Is there any openmpi perf-tool can be used to get the performance of mpi?
Look forward your reply.

shenqixiaojiang · 2018-08-03T03:34:27Z

@meteortwinkle Hi, you can try the following command:

mpirun -np 16 \
    -H server1:8,server2:8 \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
    -x NCCL_SOCKET_IFNAME=eth01 \
    -mca pml ob1 \
    -mca btl_tcp_if_include eth01 \
    -mca btl ^openib -x NCCL_IB_DISABLE=1 \

alsrgv · 2018-08-03T03:46:24Z

@meteortwinkle, how do you feed the input data - is it coming from local disk, or from some network storage? I see that you were able to run the tests in #416. Do you still see hangs? What if you try with synthetic data?

iRmantou · 2018-08-29T18:06:09Z

@winwinJJiang hi, I train the glow too, I meet the same problem with much dockers. how u fix it！

LuBingtan · 2018-09-05T13:11:50Z

I meet same problem using docker with multimechine and multiple GPUs. The program hangs on at the beginning. And this is my shell cmd:

mpirun -np 3 \
    -H localhost:1,server:2 \
    --allow-run-as-root \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
    -x NCCL_SOCKET_IFNAME=^docker0 \
    -mca pml ob1 -mca btl ^openib \
    -mca btl_tcp_if_exclude lo,docker0 \
    -mca plm_rsh_args "-p 12345" \
    python tensorflow_mnist.py

is there any solutions?

iRmantou · 2018-09-05T13:19:57Z

@LuBingtan It leads by some of my codes using "hvd.size()". if you have samiliar problem, you could check your horovod related code blocks ,such like hvd.size() etc.

denisovlev · 2018-09-07T15:03:00Z

I see the same thing, happens pretty regularly. The data is coming from network storage.

Horovod is running under openmpi 3.1.2, on 2 servers with 4 GPU (Tesla V100) each, without docker.

The training does not progress after the warning.

WARNING: One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
Stalled ops:
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_43_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_44_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_38_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_36_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_34_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_33_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_45_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_32_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_25_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_23_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_21_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_19_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_12_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_10_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_8_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_7_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_20_0 [missing ranks: 6]
training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_6_0 [missing ranks: 6]

After the warning nvidia-smi shows 0% utilization on both machines even though the process is still running

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26                 Driver Version: 396.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000004:04:00.0 Off |                    0 |
| N/A   38C    P0    51W / 300W |   9962MiB / 16128MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000004:05:00.0 Off |                    0 |
| N/A   41C    P0    51W / 300W |   9962MiB / 16128MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000035:03:00.0 Off |                    0 |
| N/A   43C    P0    52W / 300W |  10986MiB / 16128MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000035:04:00.0 Off |                    0 |
| N/A   45C    P0    55W / 300W |   9962MiB / 16128MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    106386      C   python                                      9952MiB |
|    1    106387      C   python                                      9952MiB |
|    2    106388      C   python                                     10976MiB |
|    3    106389      C   python                                      9952MiB |
+-----------------------------------------------------------------------------+

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26                 Driver Version: 396.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000004:04:00.0 Off |                    0 |
| N/A   36C    P0    52W / 300W |  10986MiB / 16128MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000004:05:00.0 Off |                    0 |
| N/A   40C    P0    54W / 300W |  10986MiB / 16128MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000035:03:00.0 Off |                    0 |
| N/A   37C    P0    49W / 300W |   9962MiB / 16128MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000035:04:00.0 Off |                    0 |
| N/A   42C    P0    52W / 300W |  10090MiB / 16128MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     70399      C   python                                     10976MiB |
|    1     70400      C   python                                     10976MiB |
|    2     70402      C   python                                      9952MiB |
|    3     70403      C   python                                     10080MiB |
+-----------------------------------------------------------------------------+

I also get warning from openmpi:

[warn] event_base_loop: reentrant invocation.  Only one event_base_loop can run on each event_base at once.

Not sure if it is related but the data is read from hdf5 with hdf5py (not MPI)

ppwwyyxx · 2019-01-08T19:19:18Z

May be related, but I found that horovod could get stuck like this if GPU memory is not enough.
I have a job which run well last week, but yesterday when I tried to run it I always saw it stuck at the first allreduce, with warning messages like:

HVDAllReduce/HorovodAllreduce_gradients_linear_BiasAdd_grad_BiasAddGrad_0 [missing ranks:[1,0]

together with TensorFlow OOM messages like:

2019-01-08 02:00:37.838302: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 7.62G (8187171072 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

It got stuck only when I run it with 16 machines, but not locally.

At first I ignored the TensorFlow OOM messages, because I know that TensorFlow do not necessarily fail after such messages -- it can often still run. My successful local run also has such messages. And when TensorFlow really met a unrecoverable OOM, it will throw a much larger error with stack trace and allocation status, and exit. However none of the jobs exited.
However, it turned out that memory seems to be the reason of the stuck. All my failure runs have 16GB GPU memory and it succeeded after I switched to GPUs with 32GB memory. My success run last week is on combination of 16GB and 32GB GPUs (with rank0 on 32GB). (However, my successful local runs are also with 16GB memory, perhaps different allreduce implementation costs different amount of memory?).

nicolefinnie · 2019-04-01T14:22:41Z

I hit the very same problem and I realized one thing: if we always make the same worker do heavy-lifting, e.g. evaluating the current model after training for a certain number of epochs, other workers are going to complain about stalled ops and missing ranks: 0 and end up being stuck in this situation without being able to recover (looks like a horovod bug in my eyes), or the recovering slows down the whole process.

My workaround is to make all workers do the same heavy lifting, for example, all workers have to do evaluation even if we only take the results from the first worker. I mean if other workers have to wait for the first guy, it won't hurt too much for all workers to do the same job.

If I find a better solution, I'll update my answer.

tgaddair · 2019-04-01T15:13:58Z

@nicolefinnie, in your case the behavior you're seeing is what we would expect. If one of your workers is doing evaluation, but the other workers have moved on to train in the next epoch, then they will necessarily get stuck because rank 0 is busy doing evaluation.

In the ring-allreduce algorithm, the other workers cannot proceed while one of the workers is behind. You wouldn't want this behavior either, because it would effectively alter the batch size between batches, which invalidates your learning rate.

Your workaround of having all workers perform validation is a good approach. Another option is to use an MPI Barrier to wait for rank 0 to finish the evaluation step before the other workers proceed to the next epoch.

Hope that makes sense.

nicolefinnie · 2019-04-01T20:42:47Z

@tgaddair Thanks for your explanation. However, I still hit this problem after a certain amount of time. I wonder if stretching out memory as much as possible was a cause too. My horovod processes have stuck for hours without consuming any CPU/GPUs. Reducing the batch size didn't help and the deadlock still happens. It seems other workers are still waiting for the worker 0 and then the worker 0 is waiting for other workers. I'll investigate more and keep you guys updated.

(with 4 GPUs in the log)

WARNING: One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
Stalled ops:
HorovodBroadcast_conv6_weights_0_7 [missing ranks: 0]
...
...
DistributedMomentumOptimizer_Allreduce/HorovodAllreduce_tuple_control_dependency_28_0 [missing ranks: 1, 2, 3]

alsrgv · 2019-04-01T22:45:34Z

@nicolefinnie, is it possible that you have unequal number of iterations per epoch on different workers?

nicolefinnie · 2019-04-05T12:46:36Z

@alsrgv Sorry for getting back to you late, no, they're equally sharded (at least the log shows that way) another underlying issue could be one GPU is in one socket and other 3 GPUs are in another socket, so the slow guy has to rely on the CPU to talk to other 3 guys. It's just my pure assumption after having looked into the GPU topology of the server that the job was run on. However, it shows other 3 workers are always waiting for the lonely one during the training. However, the good thing come to those GPU workers who wait.

alsrgv · 2019-04-05T19:39:49Z

@nicolefinnie, your stall warning looks interesting. It basically says that all ranks except 0 ran hvd.broadcast() while rank 0 went ahead and started training. Maybe you have a logic to broadcast on non-zero ranks somewhere in your code? Broadcast should be executed on all ranks.

nicolefinnie · 2019-04-06T22:35:11Z

@alsrgv, Thanks for your quick response. I appreciate that. I did broadcast on all ranks by passing the root rank = 0 as follows:
session.run(hvd.broadcast_global_variables(0)) (this is run on all ranks)
However, interestingly, I happened to call broadcast multiple times in the same session on all ranks, and that's when I hit severe locks that couldn't be resolved by themselves, as you saw in the very first trace. However, after I "fixed" the problem by only broadcasting once on all ranks, the symptom has changed - other ranks always wait for the rank 0, that's why I suspect the topology of GPUs was the culprit because the rank 0 was physically located in another socket, whereas ranks 1, 2, 3 are located in the same socket.

alsrgv · 2019-04-09T00:13:21Z

@nicolefinnie, is it possible to publish a minimum repro of your problem on a github?

Himscipy · 2019-09-13T16:12:09Z

@nicolefinnie I am facing similar issues while training the large network such as VGG16 as you mentioned in your last comment any ideas or suggestions? @alsrgv I am trying following example posted on TFP to run with horovod. But the training stuck stating following shown below. I understand that horovod has been used successfully for large networks in tf_benchmark, I even tried porting the VGG-TFP in the benchmark but there are some limitations with some of the TFP layers, Therefore I was trying to simply add horovod to the TFP code. It will be really helpful if you could suggest what things to try to fix this. Also I tried @tgaddair following suggestion also to check the model consistency on all the ranks.
for i, v in enumerate(tf.global_variables()): print(hvd.rank(), i, v)
I also tried to look into horovod Timeline but I am receiving following error
horovodrun: error: unrecognized arguments: --timeline-filename
my command for submission is
horovodrun -np 2 --timeline-filename ~/timeline.json -H localhost:2 python cifar10_bnn.py

One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. Stalled ops:HorovodBroadcast_global_step_0 [missing ranks: 0] [2019-09-12 21:25:18.467757: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_I_1_kernel_posterior_untransformed_scale_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.467935: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_1_beta_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.468123: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_1_gamma_Adam_1_0 [missing ranks: 0] [2019-09-12 21:25:18.468308: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_1_gamma_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.468493: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_II_0_bias_posterior_loc_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.468680: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_II_0_kernel_posterior_loc_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.468864: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_gamma_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.469051: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_8_gamma_Adam_1_0 [missing ranks: 0] [2019-09-12 21:25:18.469244: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_II_4_kernel_posterior_loc_Adam_1_0 [missing ranks: 0] [2019-09-12 21:25:18.469430: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_7_beta_Adam_1_0 [missing ranks: 0] [2019-09-12 21:25:18.469609: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_II_4_bias_posterior_loc_Adam_1_0 [missing ranks: 0] [2019-09-12 21:25:18.469835: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_gamma_0 [missing ranks: 0] [2019-09-12 21:25:18.470024: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_I_1_kernel_posterior_untransformed_scale_0 [missing ranks: 0] [2019-09-12 21:25:18.470221: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_5_gamma_0 [missing ranks: 0] [2019-09-12 21:25:18.470404: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_II_3_bias_posterior_loc_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.470594: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_7_gamma_Adam_1_0 [missing ranks: 0] [2019-09-12 21:25:18.470781: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_7_gamma_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.470968: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_II_3_bias_posterior_loc_Adam_1_0 [missing ranks: 0] [2019-09-12 21:25:18.471155: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_I_4_kernel_posterior_untransformed_scale_Adam_1_0 [missing ranks: 0] [2019-09-12 21:25:18.471341: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_moving_variance_0 [missing ranks: 0] [2019-09-12 21:25:18.471527: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_I_4_bias_posterior_loc_Adam_1_0 [missing ranks: 0] [2019-09-12 21:25:18.471715: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_6_beta_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.471900: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_2_beta_0 [missing ranks: 0] [2019-09-12 21:25:18.472089: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_II_0_bias_posterior_loc_Adam_1_0 [missing ranks: 0] [2019-09-12 21:25:18.472279: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_6_moving_variance_0 [missing ranks: 0] [2019-09-12 21:25:18.472467: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_II_3_kernel_posterior_loc_Adam_1_0 [missing ranks: 0] [2019-09-12 21:25:18.472656: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_I_3_bias_posterior_loc_Adam_1_0 [missing ranks: 0] [2019-09-12 21:25:18.472842: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_II_3_kernel_posterior_untransformed_scale_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.473029: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_I_3_kernel_posterior_loc_Adam_1_0 [missing ranks: 0] [2019-09-12 21:25:18.473214: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_II_4_kernel_posterior_untransformed_scale_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.473403: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_II_3_kernel_posterior_loc_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.473589: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_2_beta_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.473792: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_2_moving_variance_0 [missing ranks: 0] [2019-09-12 21:25:18.473985: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_6_beta_Adam_1_0 [missing ranks: 0] [2019-09-12 21:25:18.474171: W horovod/common/operations.cc:588] HorovodBroadcast_Dense_I_4_kernel_posterior_untransformed_scale_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.474361: W horovod/common/operations.cc:588] HorovodBroadcast_batch_normalization_v1_3_gamma_Adam_0 [missing ranks: 0] [2019-09-12 21:25:18.474533: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_I_4_kernel_posterior_loc_0 [missing ranks: 0] [2019-09-12 21:25:18.474730: W horovod/common/operations.cc:588] HorovodBroadcast_ConV_II_2_bias_posterior_loc_Adam_0 [missing ranks: 0]

tgaddair · 2019-09-13T23:32:44Z

Hey @Himscipy, looks like your version of Horovod is out of date. Can you try upgrading to the latest version? If for whatever reason you're unable to upgrade, you can generate a timeline by using the following environment variable:

HOROVOD_TIMELINE=/path/to/timeline.json horovodrun ...

Himscipy · 2019-09-14T05:32:44Z

Hey, @tgaddair Thank you for the reply. I worked your recommendation earlier to fix the horovod timeline what are your thoughts on the other error. I tried to do the trace plot for 2 ranks and 1 rank run, as you can see for the 1 rank run negotiate broadcast is negligible but for 2 ranks it takes lot of time. Any thoughts on that. The 1 rank run successfully, while other hangs up with the broadcast negotiate.

tgaddair · 2019-09-14T18:23:02Z

Hey @Himscipy, can you create a gist of the code with your modifications? I'll see if I can reproduce the hang in my environment.

Also, to clarify, you were able to successfully run the Horovod examples like tensorflow_mnist.py, correct?

Himscipy · 2019-09-16T21:28:30Z

Hey @tgaddair, I cannot share the code publicly but can create a gist but in that case as well I will not be able to post the link in public since if I create a secret gist and pasting the url here will make the code accessible to anyone having url. So let me know how to share the gist with you. Thank you.

Himscipy · 2019-09-19T15:32:12Z

Hey @tgaddair, I created a minimalist code gist for you, my horovod and tensorflow version is as follows Tensorflow 1.13.1 and horovod 0.16.3.
Also the tensorflow_mnist.py works fine for me , the broadcast is taking a lot of time and I get stalled rank even with 2 Ranks. I am using CPU's for the run. Let me know if you could reproduce the issue.
thank you

Himscipy · 2019-09-24T21:43:29Z

Hello @alsrgv @tgaddair were you able to reproduce the issue from the gist I shared earlier you, I am still facing this issue. that the code stalls will Allreduce operations. Below is the snippet of the issue again.

[2019-09-24 16:48:59.102525: W horovod/common/operations.cc:588] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
Stalled ops:train/DistributedAdamOptimizer_Allreduce/HorovodAllreduce_train_gradients_AddN_1_0 [missing ranks: 0, 2, 3]
[2019-09-24 16:48:59.102792: W horovod/common/operations.cc:588] train/DistributedAdamOptimizer_Allreduce/HorovodAllreduce_train_gradients_AddN_4_0 [missing ranks: 1, 2]
[2019-09-24 16:48:59.103007: W horovod/common/operations.cc:588] train/DistributedAdamOptimizer_Allreduce/HorovodAllreduce_train_gradients_AddN_2_0 [missing ranks: 0, 1, 3]
[2019-09-24 16:49:59.104858: W horovod/common/operations.cc:588] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
Stalled ops:train/DistributedAdamOptimizer_Allreduce/HorovodAllreduce_train_gradients_AddN_1_0 [missing ranks: 0, 2, 3]
[2019-09-24 16:49:59.105124: W horovod/common/operations.cc:588] train/DistributedAdamOptimizer_Allreduce/HorovodAllreduce_train_gradients_AddN_4_0 [missing ranks: 1, 2]
[2019-09-24 16:49:59.105323: W horovod/common/operations.cc:588] train/DistributedAdamOptimizer_Allreduce/HorovodAllreduce_train_gradients_AddN_2_0 [missing ranks: 0, 1, 3]
[2019-09-24 16:50:59.106116: W horovod/common/operations.cc:588] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
Stalled ops:train/DistributedAdamOptimizer_Allreduce/HorovodAllreduce_train_gradients_AddN_1_0 [missing ranks: 0, 2, 3]
[2019-09-24 16:50:59.106378: W horovod/common/operations.cc:588] train/DistributedAdamOptimizer_Allreduce/HorovodAllreduce_train_gradients_AddN_4_0 [missing ranks: 1, 2]
[2019-09-24 16:50:59.106583: W horovod/common/operations.cc:588] train/DistributedAdamOptimizer_Allreduce/HorovodAllreduce_train_gradients_AddN_2_0 [missing ranks: 0, 1, 3]
[2019-09-24 16:51:59.111789: W horovod/common/operations.cc:588] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.

tgaddair · 2019-09-24T22:34:06Z

Hey @Himscipy thanks for putting the gist together. Looking it over, there are two possible issues I see. The first is that it looks like rank 0 is doing a lot of extra work, including validation, which may be slowing it down compared to the other workers. That would be consistent with the first error message you reported where rank 0 was the one missing from the broadcast.

The second possible issue is that you're training with uneven sample counts per worker somehow. Though it looks like that shouldn't be the case given that you have a hook to stop after a pre-determined step. Can you verify that every worker has more than that number of steps available to process?

I would suggest as a first test to try commenting out much of the if hvd.rank() == 0 code to see if that fixes some of the issues with log broadcast negotiations.

Himscipy · 2019-09-25T15:08:05Z

Hi @tgaddair Thank you for the response. I have tried commenting most of the hvd.rank() == 0 operations in the code, but it didn't helped, I still find the run stalling. I also changed the data-pipeline to a simple batch generator and modified the network with less number of parameter to train. The actual number of parameters are ~19 Million, I modified the arch to have 17M parameter. For the reduced parameter the distributed training works fine with 2 ranks even without putting barriers, but stalls with higher number of MPI ranks (4,8,16) and for the case of 19M parameter it get's stalled for all the rank runs. I did some profiling for the 2 rank, 17M parameter case the profiler reported that the case is backend bound that means the following
Superscalar processors can be conceptually divided into the front-end, where instructions are fetched and decoded into the operations that constitute them and the back-end, where the required computation is performed. During each cycle, the front-end generates up to two of these operations, places them into pipeline slots and moves them through the back-end. The actual number of retired pipeline slots containing useful work rarely equals this maximum. This can be because the back-end was not prepared to accept more operations of a certain kind (Back-end bound execution). Back-end bound execution may be due to long-latency operations or other contention for execution resources like too many operations being directed to a single execution port.
What do you think is it because the rank(0) is doing the all reduce operations for large tensor's(parameters to train) leading to too many operations and this can get aggravated with increase in number of workers and with large number of parameter?

Himscipy · 2019-09-30T16:12:12Z

Hi @tgaddair, does HOROVOD_STALL_CHECK_TIME_SECONDS affects the time all_reduce stall check also, as I not finding any change in the Stall ops error message which say's time as 60 sec only even after setting the HOROVOD_STALL_CHECK_TIME_SECONDS= 600 sec

Himscipy · 2019-10-03T19:56:16Z

Hi @tgaddair ,
I am still receiving the error of stall ops train/DistributedAdamOptimizer_Allreduce/HorovodAllreduce_train_gradients_Add, I checked the number of total flops operations needed for my network ( of ~18M parameters) are about (564.46millon flops. The total number of times the operation train/DistributedAdamOptimizer_Allreduce/ called in the graph is 34 times. As you mentioned in your earlier comment that rank 0 is doing extra work considering the FLOPS could it be the case since the ops are really high. Also is there any way that I can do MPI_FINALIZE within my scripts so that I can collect the total MPI_TRACE to get some more insight on the problem.
I also run my case with HOROVOD_LOG_LEVEL=DEBUG to get some more insight, I see that the last message size it prints is following

[2019-10-01 20:44:07.685793: D horovod/common/operations.cc:1190] Created response of size 42213376
[2019-10-01 20:44:07.686039: D horovod/common/operations.cc:1229] [0]: Processing 42 tensors
[2019-10-01 20:44:07.686394: D horovod/common/operations.cc:1288] [1]: Processing 42 tensors

Considering the message size the 42MB, I don't think there is an issue, since we have a high bandwidth network, but still I don't know if this could cause an issue. I tried profiling the run also but couldn't get much information since the MPI_FINALIZE was never reached. Let me know if there could be any way to device MPI_FINALIZE if there is a stall in the ops. I have also limited the number of hvd.rank() == 0 operations. Also I have tried HOROVOD_FUSION_THRESHOLD settings as well (8,16,64,128,256,512) keeping the cycle time fixed, but no change in the result.
Thank you.

Himscipy · 2019-10-09T21:51:50Z

Hello @tgaddair @alsrgv ,
I made slight changes in my program which could affect the distributed optimizer, the changes brought me success with 2 rank run, but I still face same issue of train/DistributedAdamOptimizer_Allreduce/HorovodAllreduce_train_gradients_AddN_4_0 stalled ops for higher rank run. I have tried changing HOROVOD_STALL_CHECK_TIME_SECONDS but non has affected the timeout time of '60 second' in the following message

One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. Stalled ops:train/DistributedAdamOptimizer_Allreduce/HorovodAllreduce_train_gradients_AddN_1_0

I read the source code of the message and found that this time is predefined. Is there any way to change the timeout time.

Himscipy · 2019-11-11T23:44:56Z

For future users coming to this thread. The issue was resolved using the latest Horovod-0.18.2 version.
The limitations of Horovod were reported in following paper (link below) as well and the new release has the fix for it. My rank stall has been fixed by using the latest version.
https://arxiv.org/pdf/1909.11150.pdf

apeforest · 2019-11-12T00:49:23Z

For future users coming to this thread. The issue was resolved using the latest Horovod-0.18.2 version.
The limitations of Horovod were reported in following paper (link below) as well and the new release has the fix for it. My rank stall has been fixed by using the latest version.
https://arxiv.org/pdf/1909.11150.pdf

Does that mean Horovod can support unequal work load on workers now? Meaning one can run evaluation only on rank 0?

tgaddair · 2019-11-12T01:00:03Z

Hey @apeforest, I think @Himscipy's issue was caused by a lagging worker. For training with uneven batches, @kit1980 recently landed #1058, which adds a "Join" operation for PyTorch. Would be awesome to add an implementation for MXNet as well!

ifed-ucsd · 2020-01-17T21:33:36Z

@winwinJJiang, how often do you see it? Does training progress after it? Oftentimes this warning means that one of the ranks (in your case rank 0) was busy running evaluation while other ranks were waiting for it. In this case, it can be simply ignored - but you would be able to use resources more efficiently if you do a distributed evaluation. You can search Issues for parallel/distributed evaluation to see some examples.

Can someone please comment on whether it is safe to ignore the horovod warning in the case that it appears when one of the workers is doing something that the others aren't (for instance, evaluation)? Does this cause the entire training process to go out of sync?

tgaddair · 2020-01-17T22:17:10Z

Hey @ifed-ucsd, usually it's not a problem. The only time it will cause things to get out of sync is if the workers end up performing different collective operations. For example, worker-1 proceeds to the next batch and calls hvd.allreduce while worker-0 does validation and calls hvd.broadcast. As long as all the workers will eventually meet back up at the same point, it will be okay.

czs1886 · 2020-02-19T07:40:08Z

I also got this problem when I train my pytorch model using horovod. I partitioned the validation set by ranks, why does it still get this warning? @alsrgv

hoangcuong2011 · 2020-03-05T07:50:03Z

This happens to me as well. I hope it would not produce a big problem (i.e. much lower accuracy for the model after training).
I have a quick question - how to know which version of Horovod I am using? I googled but could not find any command. I am confident my version is > Horovod-0.18.2 but I just wanted to make one more check. Thx.

iou2much · 2020-03-27T04:39:48Z

my version is 0.19.0. it still happens when training stage finished and evaluating start for every epoch

ehsab · 2020-04-20T22:05:00Z

I hit the very same problem and I realized one thing: if we always make the same worker do heavy-lifting, e.g. evaluating the current model after training for a certain number of epochs, other workers are going to complain about stalled ops and missing ranks: 0 and end up being stuck in this situation without being able to recover (looks like a horovod bug in my eyes), or the recovering slows down the whole process.

My workaround is to make all workers do the same heavy lifting, for example, all workers have to do evaluation even if we only take the results from the first worker. I mean if other workers have to wait for the first guy, it won't hurt too much for all workers to do the same job.

If I find a better solution, I'll update my answer.

Did you find a better solution? I am trying to run pytorch_imagenet_resnet50.py in docker and get the same error:

One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.

any update?

ehsab · 2020-04-20T22:17:54Z

@winwinJJiang, how often do you see it? Does training progress after it? Oftentimes this warning means that one of the ranks (in your case rank 0) was busy running evaluation while other ranks were waiting for it. In this case, it can be simply ignored - but you would be able to use resources more efficiently if you do a distributed evaluation. You can search Issues for parallel/distributed evaluation to see some examples.

@alsrgv I have an exact same problem. I am just trying to run pytorch_imagenet_resnet50.py sample from Horovod. It doesn't work in some situation e.g. 4 servers 4 gpus. But it works on 1/2servers 1/2 gpus. Any idea? Does the sample work without any changes?

Thanks,

stale · 2020-11-06T20:26:29Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

rrrongon · 2023-12-25T21:58:23Z

Hi,
`horovodrun --check
Horovod v0.28.1:

Available Frameworks:
[ ] TensorFlow
[X] PyTorch
[ ] MXNet

Available Controllers:
[X] MPI
[X] Gloo

Available Tensor Operations:
[X] NCCL
[ ] DDL
[ ] CCL
[X] MPI
[X] Gloo`

and mpiexec --version is 4.1.6. After installation if I run my code in a single process using the command
horovodrun -np 1 python mnist_test.py it works.
However, if I run in more than 1 process, it does not work anymore. I am trying to run the code that used to run using horovod. Now it is not working anymore specifically for broadcast. Because, I have tested 'allreduce' in multiple process

`
import horovod.torch as hvd

import torch

hvd.init()

tensor = torch.tensor([1.0, 2.0, 3.0])

scaled_tensor = tensor * hvd.size()

summed_tensor = hvd.allreduce(scaled_tensor)

average_tensor = summed_tensor / hvd.size()

print(f"Rank {hvd.rank()}: Original Tensor: {tensor.numpy()}")

print(f"Rank {hvd.rank()}: Scaled Tensor: {scaled_tensor.numpy()}")

print(f"Rank {hvd.rank()}: Summed Tensor: {summed_tensor.numpy()}")

print(f"Rank {hvd.rank()}: Average Tensor: {average_tensor.numpy()}")
`

horovodrun -n 4 python hvd_allreduce.py
above code works using the command.

However, a basic broadcast shows following error
`
/horovod/torch/mpi_ops.py", line 809, in broadcast_async

output = tensor.new(tensor.shape)

AttributeError: 'NoneType' object has no attribute 'new'
`

And my existing pytorch code hangs while parameters are being tried to broadcast.

`
print("before hvd bdcast")

hvd.broadcast_parameters(model.state_dict(), root_rank=0)

print("after hvd bdcast params")
`

after hvd bdcast params
never gets printed.

Please let me know if you have any idea.

alsrgv added the question label Jul 26, 2018

tboo mentioned this issue Aug 28, 2019

Forcing distributed Evaluation ludwig-ai/ludwig#507

Closed

iou2much mentioned this issue Mar 27, 2020

seems deadlock occurs when using multithread athena-team/athena#75

Open

stale bot added the wontfix label Nov 6, 2020

stale bot closed this as completed Nov 13, 2020

andrevitorelli mentioned this issue May 21, 2021

multiple_communicators branch gets deadlock on Alltoall DifferentiableUniverseInitiative/horovod#5

Open

dxli94 mentioned this issue Apr 19, 2022

Issues loading CC3M salesforce/ALPRO#9

Closed

WARNING: One or more tensors were submitted to be reduced, gathered #403

WARNING: One or more tensors were submitted to be reduced, gathered #403

Comments

winwinJJiang commented Jul 26, 2018

alsrgv commented Jul 26, 2018

winwinJJiang commented Jul 26, 2018

winwinJJiang commented Jul 26, 2018

meteortwinkle commented Jul 27, 2018 • edited

bapriddy commented Jul 27, 2018 • edited

bapriddy commented Jul 27, 2018

winwinJJiang commented Jul 27, 2018

winwinJJiang commented Jul 27, 2018

bapriddy commented Jul 27, 2018

alsrgv commented Jul 28, 2018

meteortwinkle commented Jul 28, 2018

winwinJJiang commented Jul 28, 2018

alsrgv commented Jul 28, 2018 • edited

meteortwinkle commented Jul 30, 2018 • edited

shenqixiaojiang commented Aug 3, 2018

alsrgv commented Aug 3, 2018

iRmantou commented Aug 29, 2018

LuBingtan commented Sep 5, 2018 • edited

iRmantou commented Sep 5, 2018

denisovlev commented Sep 7, 2018 • edited

ppwwyyxx commented Jan 8, 2019

nicolefinnie commented Apr 1, 2019

tgaddair commented Apr 1, 2019 • edited

nicolefinnie commented Apr 1, 2019 • edited

alsrgv commented Apr 1, 2019

nicolefinnie commented Apr 5, 2019

alsrgv commented Apr 5, 2019

nicolefinnie commented Apr 6, 2019

alsrgv commented Apr 9, 2019

Himscipy commented Sep 13, 2019 • edited

tgaddair commented Sep 13, 2019 • edited

Himscipy commented Sep 14, 2019 • edited

tgaddair commented Sep 14, 2019

Himscipy commented Sep 16, 2019 • edited

Himscipy commented Sep 19, 2019

Himscipy commented Sep 24, 2019

tgaddair commented Sep 24, 2019

Himscipy commented Sep 25, 2019

Himscipy commented Sep 30, 2019

Himscipy commented Oct 3, 2019 • edited

Himscipy commented Oct 9, 2019 • edited

Himscipy commented Nov 11, 2019 • edited

apeforest commented Nov 12, 2019

tgaddair commented Nov 12, 2019

ifed-ucsd commented Jan 17, 2020

tgaddair commented Jan 17, 2020

czs1886 commented Feb 19, 2020

hoangcuong2011 commented Mar 5, 2020

iou2much commented Mar 27, 2020

ehsab commented Apr 20, 2020

ehsab commented Apr 20, 2020

stale bot commented Nov 6, 2020

rrrongon commented Dec 25, 2023 • edited

meteortwinkle commented Jul 27, 2018 •

edited

bapriddy commented Jul 27, 2018 •

edited

alsrgv commented Jul 28, 2018 •

edited

meteortwinkle commented Jul 30, 2018 •

edited

LuBingtan commented Sep 5, 2018 •

edited

denisovlev commented Sep 7, 2018 •

edited

tgaddair commented Apr 1, 2019 •

edited

nicolefinnie commented Apr 1, 2019 •

edited

Himscipy commented Sep 13, 2019 •

edited

tgaddair commented Sep 13, 2019 •

edited

Himscipy commented Sep 14, 2019 •

edited

Himscipy commented Sep 16, 2019 •

edited

Himscipy commented Oct 3, 2019 •

edited

Himscipy commented Oct 9, 2019 •

edited

Himscipy commented Nov 11, 2019 •

edited

rrrongon commented Dec 25, 2023 •

edited