New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running Pytorch with Horovod #492
Comments
Did somebody try Pytorch resnet50 example in a distributed environment? Just want to ensure that Pytorch is not giving me this problem. |
Does it has something to do with the follwoing message:A process has executed an operation involving a call to the The process that invoked fork was: Local host: [[21169,1],1] (PID 62029) If you are absolutely sure that your application will successfully
|
Hey @abidmalikwaterloo, this issue is similar to another (#489) we're just now seeing. Which version of Horovod are you running? There were a number of changes made to PyTorch in the latest version, might want to try using |
@tgaddair I am using torch 0.4.1.post2 and Horovod 0.14.1. Does 0.13.10 work? downloaded Pytorch from here: |
|
@tgaddair where can I get 0.13.10? I could not find it on the Pytorch web site. |
You should be able to get it with pip:
But on the other thread it was suggested to try running with |
@abidmalikwaterloo, can you check suggestions from #489 and see if any of them help your use case? |
@abidmalikwaterloo, I don't think this issue is related to Horovod version, so you should be OK using latest 0.14.1. Can you share an output of |
ifconfig:docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 eno2: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 eno3d1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 ibv_devinfo:hca_id: mlx4_0
|
@abidmalikwaterloo, great, can you try using |
@alsrgv It is hanging up. Here how I am using it: mpirun -np 2 -npernode 1 -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=ib0 -x NCCL_IB_HCA=mlx4_0 -x LD_LIBRARY_PATH -mca btl ^openib -mca btl_tcp_if_include ib0 python horovod_main_testing.py --train-dir=/home/amalik/NEWIMAGENETDATA/raw-data/train/ --val-dir=/home/amalik/NEWIMAGENETDATA/raw-data/val/ |
@abidmalikwaterloo, is there any output before the hang? |
@alsrgv No. Both error and output files are empty. I am submitting the job through a batch file. I am also trying to figure out what it is doing. FYI, I am using resent50 Pytorch example for Horovod from the GitHub. |
@alsrgv I am able to run it. How can I confirm that it is using Horovod Ring? I don't see any ring information? Is there any flag similar to the tensorflow implementation? |
err_4258.log.pdf |
@alsrgv @tgaddair I am getting a segmentation fault at the end. I am getting the following message: node02:14238:14262 [0] INFO NET : Using interface ib0:172.31.11.2<0>
|
Is this an OpenMPI issue? |
When I used three nodes ( three GPUs on separate nodes): Train Epoch #1: 19%|#9 | 2567/13346 [08:47<36:32, 4.92it/s, loss=6.87, accuracy=0.168]
|
would like to share the following as well: |
@tgaddair I tried with 2,3,4, and 5 GPUs on five nodes. I can see the formation of the ring. However, I can see only two models on two GPUs and then get the message of Horovod crashing. |
I would be grateful if you can give some hint or guidance on this so I can move on. |
Hey @abidmalikwaterloo, I'm seeing |
@tgaddair I saw this error as well. On single GPU (K20) 32 batch size works well and also with two GPUs for complete one epoch and then it crashes. I changed it to 16 as well and got the same error. There is a similar situation here as well: |
Looks like the suggestion there was to try rolling back to an older version of PyTorch. Are you able to try that and see if it works? |
@tgaddair Yes I tried I and got new errors with horovod Traceback (most recent call last):
|
@tgaddair Did you try resnet50 example with any big cluster? did you try with any specific version? |
Did anyone try pytorcuh_mnist.py? It is not working properly. It seems there is a deadlock. The application hangs after the first epoch. |
@abidmalikwaterloo, we routinely run I think the segfault/deadlock you're seeing at the end of the first epoch may be related to malloc hooks. Can you try using |
@alsrgv I see the following message: horovod_mnist.py:80: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument. |
@abidmalikwaterloo, is it happening on one of the servers, or all of them? Sounds like one of the servers may be overloaded or have stale file handles. |
@alsrgv It doesn't happen with or less than 3 GPU nodes and for more, I get this information. I received this message from each node. I just got the following message from another run with 4 GPU nodes:
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 227, in handler I am showing part of the message. But I am getting from each node. I think the dataloader is also responsible for my resnet50 crashing. I am trying to find a solution but still unable to find one. |
@abidmalikwaterloo, data loader issue is likely caused by malloc hooks. What's your full |
Here is the full script module load anaconda2 cd /home/amalik/Pytorch_virtual_enviornment/Examples/examples/mnist export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/amalik/nccl_2.1.15-1+cuda9.0_x86_64/lib:/software/cuda/9.0/extras/CUPTI/lib64/ source activate /home/amalik/Pytorch_virtual_enviornment/ mpirun -np 4 -npernode 1 -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=ib0 -x NCCL_IB_HCA=mlx4_0 -x LD_LIBRARY_PATH -mca btl ^openib -x UCX_MEM_MALLOC_HOOKS=no -mca btl_tcp_if_include ib0 python horovod_mnist.py source deactivate |
@alsrgv Does it have something to do with the python version? |
I change the following part of the code; train_sampler = torch.utils.data.distributed.DistributedSampler( train_loader = torch.utils.data.DataLoader( test_dataset = test_loader = torch.utils.data.DataLoader(test_dataset, When I used the following for the loading:train_sampler = torch.utils.data.distributed.DistributedSampler( It gave me an error. However, when I replace it with the following:train_loader = torch.utils.data.DataLoader( It works without any error. |
However, the above trick didn't work for ResNet50 example :( |
For ResNet I used the following , mpirun -np 4 -npernode 1 -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=ib0 -x NCCL_IB_HCA=mlx4_0 -x LD_LIBRARY_PATH -mca btl ^openib -x UCX_MEM_MALLOC_HOOKS=no -mca btl_tcp_if_include ib0 python horovod_main_testing.py --train-dir=/home/amalik/NEWIMAGENETDATA/raw-data/train/ --val-dir=/home/amalik/NEWIMAGENETDATA/raw-data/val/ I can see the ring of four in the output log file: node07:2451:2473 [0] INFO Using internal Network IB node01:62540:62564 [0] INFO Ring 00 : 0 1 2 3node06:13488:13510 [0] INFO 1 -> 2 via NET/IB/0 Then the application runs for a while and then crashes. I am pasting part of the output: Train Epoch #1: 25%|##4 | 6567/26691 [22:30<1:08:06, 4.92it/s, loss=6.78, accuracy=0.271] mpirun detected that one or more processes exited with non-zero status, thus causing Process name: [[57108,1],1] There should be four models active during the run. However, I see only two. I am using a batch size of 12 . which is pretty small. Even then I getting CUDA error: out memory error. |
@abidmalikwaterloo, what if you replace |
@alsrgv Ok it is running now. I am reproducing some part of the output file: Train Epoch #1: 72%|#######1 | 19185/26691 [4:01:11<1:42:27, 1.22it/s, loss=5.79, accuracy=4.2] Why do I see only two models in action? I ran it with 4 nodes and I can see a sing with 4 nodes. There should be four models. The number of iterations (26691 is based on 4 models 1.2 million / (12 *4)) |
@abidmalikwaterloo, sounds like the As for progress bar, you should actually just see one, https://github.com/uber/horovod/blob/master/examples/pytorch_imagenet_resnet50.py#L146 guards progress bar visibility only on rank 0. Do you have that code unchanged? |
@alsrgv it looks that it is working. I didn't change the code yet. I am testing it just varying the variables at the top level. I just had another breakdown. Validate Epoch #1: 100%|#########9| 1039/1042 [09:54<00:01, 1.73it/s, loss=4.21, accuracy=18.5] Train Epoch #2: 0%| | 0/26691 [00:00<?, ?it/s] However, it broke down at 85% of the second epoch due to some CUDA/GPU error: Train Epoch #2: 85%|########5 | 22726/26691 [4:54:20<52:08, 1.27it/s, loss=3.93, accuracy=22.2] Traceback (most recent call last):
|
I see similar issue at #404 but no solution |
@alsrgv Does this has to do with the GPU arch? Going through various options to solve it and found the following: |
Where you able to resolve the issue? I also see that if FYI, this is my setup:
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I am trying to run resnet50 example with Pytorch and Horovod using a cluster. I used the following command in slum script:
mpirun -np 2 -npernode 1 -x NCCL_DEBUG=INFO python horovod_main_testing.py --train-dir=/home/amalik/NEWIMAGENETDATA/raw-data/train/ --val-dir=/home/amalik/NEWIMAGENETDATA/raw-data/val/
I am trying to run it on two nodes each having one GPUs.
I am getting the following message:
Train Epoch #1: 0%| | 0/20019 [00:00<?, ?it/s][node03:62076] *** Process received signal ***
[node03:62076] Signal: Segmentation fault (11)
[node03:62076] Signal code: Address not mapped (1)
[node03:62076] Failing at address: 0x55d9ad6308a8
[node03:62076] [ 0] /usr/lib64/libpthread.so.0(+0xf5e0)[0x7fd3c65f35e0]
[node03:62076] [ 1] /usr/lib64/libc.so.6(+0x7d4a6)[0x7fd3c5b954a6]
[node03:62076] [ 2] /usr/lib64/libc.so.6(__libc_malloc+0x4c)[0x7fd3c5b9810c]
[node03:62076] [ 3] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyThread_allocate_lock+0x16)[0x7fd3c690f9f6]
[node03:62076] [ 4] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyThread_ReInitTLS+0x1f)[0x7fd3c690feef]
[node03:62076] [ 5] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyOS_AfterFork+0x45)[0x7fd3c6916025]
[node03:62076] [ 6] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0x11b1b9)[0x7fd3c691b1b9]
[node03:62076] [ 7] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x774c)[0x7fd3c68e26fc]
[node03:62076] [ 8] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7e9)[0x7fd3c68e44e9]
[node03:62076] [ 9] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0x6d28a)[0x7fd3c686d28a]
[node03:62076] [10] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43)[0x7fd3c68487a3]
[node03:62076] [11] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0x5763d)[0x7fd3c685763d]
[node03:62076] [12] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43)[0x7fd3c68487a3]
[node03:62076] [13] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0xa1584)[0x7fd3c68a1584]
[node03:62076] [14] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0x9de3b)[0x7fd3c689de3b]
[node03:62076] [15] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43)[0x7fd3c68487a3]
[node03:62076] [16] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x3bb9)[0x7fd3c68deb69]
[node03:62076] [17] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x7fee)[0x7fd3c68e2f9e]
[node03:62076] [18] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7e9)[0x7fd3c68e44e9]
[node03:62076] [19] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0x6d28a)[0x7fd3c686d28a]
[node03:62076] [20] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43)[0x7fd3c68487a3]
[node03:62076] [21] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0x5763d)[0x7fd3c685763d]
[node03:62076] [22] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43)[0x7fd3c68487a3]
[node03:62076] [23] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0xa1584)[0x7fd3c68a1584]
[node03:62076] [24] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0x9de3b)[0x7fd3c689de3b]
[node03:62076] [25] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43)[0x7fd3c68487a3]
[node03:62076] [26] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x3bb9)[0x7fd3c68deb69]
[node03:62076] [27] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7e9)[0x7fd3c68e44e9]
[node03:62076] [28] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0x6d28a)[0x7fd3c686d28a]
[node03:62076] [29] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43)[0x7fd3c68487a3]
[node03:62076] *** End of error message ***
[node02:20957] *** Process received signal ***
[node02:20957] Signal: Segmentation fault (11)
Exception in thread Thread-2 (most likely raised during interpreter shutdown):
Traceback (most recent call last):
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/threading.py", line 801, in __bootstrap_inner
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/threading.py", line 754, in run
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 120, in _worker_manager_loop
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/multiprocessing/queues.py", line 376, in get
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/multiprocessing/queue.py", line 22, in recv
<type 'exceptions.AttributeError'>: 'NoneType' object has no attribute 'loads'
Exception in thread QueueFeederThread (most likely raised during interpreter shutdown):
Traceback (most recent call last):
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/threading.py", line 801, in __bootstrap_inner
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/threading.py", line 754, in run
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/multiprocessing/queues.py", line 259, in _feed
<type 'exceptions.TypeError'>: 'NoneType' object is not callable
Exception in thread QueueFeederThread (most likely raised during interpreter shutdown):
Traceback (most recent call last):
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/threading.py", line 801, in __bootstrap_inner
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/threading.py", line 754, in run
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/multiprocessing/queues.py", line 259, in _feed
<type 'exceptions.TypeError'>: 'NoneType' object is not callable
Exception in thread QueueFeederThread (most likely raised during interpreter shutdown):
Traceback (most recent call last):
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/threading.py", line 801, in __bootstrap_inner
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/threading.py", line 754, in run
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/multiprocessing/queues.py", line 259, in _feed
<type 'exceptions.TypeError'>: 'NoneType' object is not callable
I had a very good experience with Horovod and TF. I am trying Pytorch now. and used the same script that I used to run alexnet in TF for 256 GPUs. Do I need special flags for Pytorch+Horovod?
The text was updated successfully, but these errors were encountered: