Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running Pytorch with Horovod #492

Closed
abidmalikwaterloo opened this issue Sep 13, 2018 · 49 comments
Closed

Running Pytorch with Horovod #492

abidmalikwaterloo opened this issue Sep 13, 2018 · 49 comments

Comments

@abidmalikwaterloo
Copy link

I am trying to run resnet50 example with Pytorch and Horovod using a cluster. I used the following command in slum script:

mpirun -np 2 -npernode 1 -x NCCL_DEBUG=INFO python horovod_main_testing.py --train-dir=/home/amalik/NEWIMAGENETDATA/raw-data/train/ --val-dir=/home/amalik/NEWIMAGENETDATA/raw-data/val/

I am trying to run it on two nodes each having one GPUs.

I am getting the following message:

Train Epoch #1: 0%| | 0/20019 [00:00<?, ?it/s][node03:62076] *** Process received signal ***
[node03:62076] Signal: Segmentation fault (11)
[node03:62076] Signal code: Address not mapped (1)
[node03:62076] Failing at address: 0x55d9ad6308a8
[node03:62076] [ 0] /usr/lib64/libpthread.so.0(+0xf5e0)[0x7fd3c65f35e0]
[node03:62076] [ 1] /usr/lib64/libc.so.6(+0x7d4a6)[0x7fd3c5b954a6]
[node03:62076] [ 2] /usr/lib64/libc.so.6(__libc_malloc+0x4c)[0x7fd3c5b9810c]
[node03:62076] [ 3] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyThread_allocate_lock+0x16)[0x7fd3c690f9f6]
[node03:62076] [ 4] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyThread_ReInitTLS+0x1f)[0x7fd3c690feef]
[node03:62076] [ 5] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyOS_AfterFork+0x45)[0x7fd3c6916025]
[node03:62076] [ 6] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0x11b1b9)[0x7fd3c691b1b9]
[node03:62076] [ 7] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x774c)[0x7fd3c68e26fc]
[node03:62076] [ 8] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7e9)[0x7fd3c68e44e9]
[node03:62076] [ 9] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0x6d28a)[0x7fd3c686d28a]
[node03:62076] [10] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43)[0x7fd3c68487a3]
[node03:62076] [11] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0x5763d)[0x7fd3c685763d]
[node03:62076] [12] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43)[0x7fd3c68487a3]
[node03:62076] [13] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0xa1584)[0x7fd3c68a1584]
[node03:62076] [14] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0x9de3b)[0x7fd3c689de3b]
[node03:62076] [15] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43)[0x7fd3c68487a3]
[node03:62076] [16] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x3bb9)[0x7fd3c68deb69]
[node03:62076] [17] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x7fee)[0x7fd3c68e2f9e]
[node03:62076] [18] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7e9)[0x7fd3c68e44e9]
[node03:62076] [19] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0x6d28a)[0x7fd3c686d28a]
[node03:62076] [20] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43)[0x7fd3c68487a3]
[node03:62076] [21] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0x5763d)[0x7fd3c685763d]
[node03:62076] [22] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43)[0x7fd3c68487a3]
[node03:62076] [23] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0xa1584)[0x7fd3c68a1584]
[node03:62076] [24] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0x9de3b)[0x7fd3c689de3b]
[node03:62076] [25] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43)[0x7fd3c68487a3]
[node03:62076] [26] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x3bb9)[0x7fd3c68deb69]
[node03:62076] [27] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7e9)[0x7fd3c68e44e9]
[node03:62076] [28] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(+0x6d28a)[0x7fd3c686d28a]
[node03:62076] [29] /home/amalik/Pytorch_virtual_enviornment/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43)[0x7fd3c68487a3]
[node03:62076] *** End of error message ***
[node02:20957] *** Process received signal ***
[node02:20957] Signal: Segmentation fault (11)

Exception in thread Thread-2 (most likely raised during interpreter shutdown):
Traceback (most recent call last):
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/threading.py", line 801, in __bootstrap_inner
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/threading.py", line 754, in run
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 120, in _worker_manager_loop
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/multiprocessing/queues.py", line 376, in get
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/multiprocessing/queue.py", line 22, in recv
<type 'exceptions.AttributeError'>: 'NoneType' object has no attribute 'loads'
Exception in thread QueueFeederThread (most likely raised during interpreter shutdown):
Traceback (most recent call last):
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/threading.py", line 801, in __bootstrap_inner
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/threading.py", line 754, in run
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/multiprocessing/queues.py", line 259, in _feed
<type 'exceptions.TypeError'>: 'NoneType' object is not callable
Exception in thread QueueFeederThread (most likely raised during interpreter shutdown):
Traceback (most recent call last):
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/threading.py", line 801, in __bootstrap_inner
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/threading.py", line 754, in run
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/multiprocessing/queues.py", line 259, in _feed
<type 'exceptions.TypeError'>: 'NoneType' object is not callable
Exception in thread QueueFeederThread (most likely raised during interpreter shutdown):
Traceback (most recent call last):
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/threading.py", line 801, in __bootstrap_inner
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/threading.py", line 754, in run
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/multiprocessing/queues.py", line 259, in _feed
<type 'exceptions.TypeError'>: 'NoneType' object is not callable

I had a very good experience with Horovod and TF. I am trying Pytorch now. and used the same script that I used to run alexnet in TF for 256 GPUs. Do I need special flags for Pytorch+Horovod?

@abidmalikwaterloo
Copy link
Author

Did somebody try Pytorch resnet50 example in a distributed environment? Just want to ensure that Pytorch is not giving me this problem.

@abidmalikwaterloo
Copy link
Author

Does it has something to do with the follwoing message:


A process has executed an operation involving a call to the
"fork()" system call to create a child process. Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption. The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

Local host: [[21169,1],1] (PID 62029)

If you are absolutely sure that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.

I tried using one GPU and the example works. Although, I still see the above message.
Does it have something to do with OpenMPI? I am using OpenMPI-3.0.0-gnu.

@tgaddair
Copy link
Collaborator

Hey @abidmalikwaterloo, this issue is similar to another (#489) we're just now seeing. Which version of Horovod are you running? There were a number of changes made to PyTorch in the latest version, might want to try using 0.13.10 if that's what you're using, and let us know how it goes.

@abidmalikwaterloo
Copy link
Author

@tgaddair I am using torch 0.4.1.post2 and Horovod 0.14.1. Does 0.13.10 work?

downloaded Pytorch from here:
https://pytorch.org/

@tgaddair
Copy link
Collaborator

0.13.10 should be compatible with PyTorch 0.4.1. In theory, they should both work, but this might help us isolate the source of the error (if it's due to a code change or an environment issue).

@abidmalikwaterloo
Copy link
Author

@tgaddair where can I get 0.13.10? I could not find it on the Pytorch web site.

@tgaddair
Copy link
Collaborator

You should be able to get it with pip:

pip install horovod==0.13.10

But on the other thread it was suggested to try running with -mca btl ^openib. Can you try that first and let us know if that solves the issue?

@NO2-yh
Copy link

NO2-yh commented Sep 14, 2018

@tgaddair Hi, #489 is exactly i have encountered. I have tried to reinstall horovod0.13.10 with pytorch0.4.1.
BUT it can not compiled successfully with PYTORCH.
By the way . horovod can be compiled successfully with Pytorch 0.4.0

Do you have any suggestions to this problems? Thanks

@alsrgv
Copy link
Member

alsrgv commented Sep 14, 2018

@abidmalikwaterloo, can you check suggestions from #489 and see if any of them help your use case?

@abidmalikwaterloo
Copy link
Author

@tgaddair @alsrgv When I tried -mca btl ^openlib, the processors just hang up. I can see log file and nothing happing in the file . I tried to install horovod=0.13.0 and got the installation error. Information attached. However, I didn't get anything when I installed 0.14.0 version again

error.pdf

@alsrgv
Copy link
Member

alsrgv commented Sep 14, 2018

@abidmalikwaterloo, I don't think this issue is related to Horovod version, so you should be OK using latest 0.14.1.

Can you share an output of ifconfig and ibv_devinfo, so we could check whether you have any non-routed interfaces?

@abidmalikwaterloo
Copy link
Author

@alsrgv

ifconfig:

docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 172.17.0.1 netmask 255.255.0.0 broadcast 0.0.0.0
ether 02:42:2f:f7:0f:73 txqueuelen 0 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.20.19.4 netmask 255.255.0.0 broadcast 10.20.255.255
inet6 fe80::3aea:a7ff:fea2:f8f6 prefixlen 64 scopeid 0x20
ether 38:ea:a7:a2:f8:f6 txqueuelen 1000 (Ethernet)
RX packets 17229978 bytes 15957802380 (14.8 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 8344087 bytes 2397363016 (2.2 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device memory 0xf6d00000-f6dfffff

eno2: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
ether 38:ea:a7:a2:f8:f7 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device memory 0xf6b00000-f6bfffff

eno3d1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
ether 24:be:05:9e:c6:22 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
inet 172.31.11.4 netmask 255.255.255.0 broadcast 172.31.11.255
inet6 fe80::26be:5ff:ff9e:c621 prefixlen 64 scopeid 0x20
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
infiniband 80:00:02:40:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
RX packets 3289957881 bytes 5288382140586 (4.8 TiB)
RX errors 0 dropped 2764 overruns 0 frame 0
TX packets 1353891746 bytes 15566692513079 (14.1 TiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10
loop txqueuelen 1 (Local Loopback)
RX packets 219851 bytes 281353515 (268.3 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 219851 bytes 281353515 (268.3 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

ibv_devinfo:

hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.11.1250
node_guid: 24be:05ff:ff9e:c620
sys_image_guid: 24be:05ff:ff9e:c623
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x0
board_id: HP_0230240019
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 20
port_lmc: 0x00
link_layer: InfiniBand

	port:	2
		state:			PORT_DOWN (1)
		max_mtu:		4096 (5)
		active_mtu:		1024 (3)
		sm_lid:			0
		port_lid:		0
		port_lmc:		0x00
		link_layer:		Ethernet

@alsrgv
Copy link
Member

alsrgv commented Sep 14, 2018

@abidmalikwaterloo, great, can you try using -mca btl ^openib -mca btl_tcp_if_include ib0 -x NCCL_SOCKET_IFNAME=ib0 -x NCCL_IB_HCA mlx4_0 and share the results?

@abidmalikwaterloo
Copy link
Author

@alsrgv It is hanging up. Here how I am using it:

mpirun -np 2 -npernode 1 -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=ib0 -x NCCL_IB_HCA=mlx4_0 -x LD_LIBRARY_PATH -mca btl ^openib -mca btl_tcp_if_include ib0 python horovod_main_testing.py --train-dir=/home/amalik/NEWIMAGENETDATA/raw-data/train/ --val-dir=/home/amalik/NEWIMAGENETDATA/raw-data/val/

@alsrgv
Copy link
Member

alsrgv commented Sep 14, 2018

@abidmalikwaterloo, is there any output before the hang?

@abidmalikwaterloo
Copy link
Author

@alsrgv No. Both error and output files are empty. I am submitting the job through a batch file. I am also trying to figure out what it is doing. FYI, I am using resent50 Pytorch example for Horovod from the GitHub.

@abidmalikwaterloo
Copy link
Author

@alsrgv I am able to run it. How can I confirm that it is using Horovod Ring? I don't see any ring information? Is there any flag similar to the tensorflow implementation?

@abidmalikwaterloo
Copy link
Author

err_4258.log.pdf
@alsrgv I am attaching the run. There is a problem with the validation which I can figure. the rest seems went well

@abidmalikwaterloo
Copy link
Author

@alsrgv @tgaddair I am getting a segmentation fault at the end.

I am getting the following message:

node02:14238:14262 [0] INFO NET : Using interface ib0:172.31.11.2<0>
node02:14238:14262 [0] INFO NET/IB : Using interface ib0 for sideband communication
node02:14238:14262 [0] INFO NET/IB: [0] mlx4_0:1/IB
node02:14238:14262 [0] INFO Using internal Network IB
node02:14238:14262 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
NCCL version 2.1.15+cuda9.0
node03:55381:55404 [0] INFO NET : Using interface ib0:172.31.11.3<0>
node03:55381:55404 [0] INFO NET/IB : Using interface ib0 for sideband communication
node03:55381:55404 [0] INFO NET/IB: [0] mlx4_0:1/IB
node03:55381:55404 [0] INFO Using internal Network IB
node03:55381:55404 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
node03:55381:55404 [0] INFO NET/IB: Dev 0 Port 1 qpn 7208 mtu 5 LID 18
node02:14238:14296 [0] INFO NET/IB: Dev 0 Port 1 qpn 14397 mtu 5 LID 6
node03:55381:55404 [0] INFO CUDA Dev 0, IB Ports : mlx4_0/1(PHB)
node02:14238:14262 [0] INFO NET/IB: Dev 0 Port 1 qpn 14398 mtu 5 LID 6
node02:14238:14296 [0] INFO NET/IB: Dev 0 Port 1 qpn 14400 mtu 5 LID 6
node02:14238:14262 [0] INFO CUDA Dev 0, IB Ports : mlx4_0/1(PHB)
node02:14238:14262 [0] INFO Using 512 threads
node02:14238:14262 [0] INFO Min Comp Cap 3
node02:14238:14262 [0] INFO NCCL_SINGLE_RING_THRESHOLD=131072
node02:14238:14262 [0] INFO Ring 00 : 0 1
node02:14238:14262 [0] INFO 1 -> 0 via NET/IB/0
node03:55381:55404 [0] INFO 0 -> 1 via NET/IB/0
node02:14238:14262 [0] INFO NET/IB: Dev 0 Port 1 qpn 14402 mtu 5 LID 6
node03:55381:55404 [0] INFO NET/IB: Dev 0 Port 1 qpn 7210 mtu 5 LID 18
node02:14238:14262 [0] INFO Launch mode Parallel

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

@abidmalikwaterloo
Copy link
Author

Is this an OpenMPI issue?

@abidmalikwaterloo
Copy link
Author

When I used three nodes ( three GPUs on separate nodes):

Train Epoch #1: 19%|#9 | 2567/13346 [08:47<36:32, 4.92it/s, loss=6.87, accuracy=0.168]
Train Epoch #1: 19%|#9 | 2568/13346 [08:47<38:41, 4.64it/s, loss=6.87, accuracy=0.168]
Train Epoch #1: 19%|#9 | 2568/13346 [08:48<38:41, 4.64it/s, loss=6.87, accuracy=0.167]
Train Epoch #1: 19%|#9 | 2569/13346 [08:48<40:07, 4.48it/s, loss=6.87, accuracy=0.167]
Traceback (most recent call last):
File "horovod_main_testing.py", line 248, in
train(epoch)
File "horovod_main_testing.py", line 156, in train
loss.backward()
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/autograd/init.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: out of memory
Traceback (most recent call last):
File "horovod_main_testing.py", line 248, in
train(epoch)
File "horovod_main_testing.py", line 157, in train
optimizer.step()
File "/home/amalik/.local/lib/python2.7/site-packages/horovod/torch/init.py", line 88, in step
self.synchronize()
File "/home/amalik/.local/lib/python2.7/site-packages/horovod/torch/init.py", line 84, in synchronize
synchronize(handle)
File "/home/amalik/.local/lib/python2.7/site-packages/horovod/torch/mpi_ops.py", line 417, in synchronize
mpi_lib.horovod_torch_wait_and_clear(handle)
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/utils/ffi/init.py", line 202, in safe_call
result = torch._C._safe_call(*args, **kwargs)
torch.FatalError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
Traceback (most recent call last):
File "horovod_main_testing.py", line 248, in
train(epoch)
File "horovod_main_testing.py", line 157, in train
optimizer.step()
File "/home/amalik/.local/lib/python2.7/site-packages/horovod/torch/init.py", line 88, in step
self.synchronize()
File "/home/amalik/.local/lib/python2.7/site-packages/horovod/torch/init.py", line 84, in synchronize
synchronize(handle)
File "/home/amalik/.local/lib/python2.7/site-packages/horovod/torch/mpi_ops.py", line 417, in synchronize
mpi_lib.horovod_torch_wait_and_clear(handle)
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/utils/ffi/init.py", line 202, in safe_call
result = torch._C._safe_call(*args, **kwargs)
torch.FatalError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[37034,1],2]
Exit code: 1

@abidmalikwaterloo
Copy link
Author

@abidmalikwaterloo
Copy link
Author

@tgaddair I tried with 2,3,4, and 5 GPUs on five nodes. I can see the formation of the ring. However, I can see only two models on two GPUs and then get the message of Horovod crashing.

@abidmalikwaterloo
Copy link
Author

I would be grateful if you can give some hint or guidance on this so I can move on.

@tgaddair
Copy link
Collaborator

Hey @abidmalikwaterloo, I'm seeing RuntimeError: CUDA error: out of memory, which suggests that you're running out of GPU memory. Can you try reducing the batch size and trying again?

@abidmalikwaterloo
Copy link
Author

@tgaddair I saw this error as well. On single GPU (K20) 32 batch size works well and also with two GPUs for complete one epoch and then it crashes. I changed it to 16 as well and got the same error.

There is a similar situation here as well:
SeanNaren/deepspeech.pytorch#304

@tgaddair
Copy link
Collaborator

Looks like the suggestion there was to try rolling back to an older version of PyTorch. Are you able to try that and see if it works?

@abidmalikwaterloo
Copy link
Author

@tgaddair Yes I tried I and got new errors with horovod

Traceback (most recent call last):
File "horovod_main_testing.py", line 69, in
resume_from_epoch = hvd.broadcast(torch.tensor(resume_from_epoch), root_rank=0,
TypeError: 'module' object is not callable
Traceback (most recent call last):
File "horovod_main_testing.py", line 69, in
resume_from_epoch = hvd.broadcast(torch.tensor(resume_from_epoch), root_rank=0,
TypeError: 'module' object is not callable
Traceback (most recent call last):
File "horovod_main_testing.py", line 69, in
Traceback (most recent call last):
File "horovod_main_testing.py", line 69, in
resume_from_epoch = hvd.broadcast(torch.tensor(resume_from_epoch), root_rank=0,
resume_from_epoch = hvd.broadcast(torch.tensor(resume_from_epoch), root_rank=0,
TypeError: 'module' object is not callable
TypeError: 'module' object is not callable

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[62097,1],0]
Exit code: 1

@abidmalikwaterloo
Copy link
Author

@tgaddair Did you try resnet50 example with any big cluster? did you try with any specific version?

@abidmalikwaterloo
Copy link
Author

Did anyone try pytorcuh_mnist.py? It is not working properly. It seems there is a deadlock. The application hangs after the first epoch.

@alsrgv
Copy link
Member

alsrgv commented Sep 19, 2018

@abidmalikwaterloo, we routinely run pytorch_resnet50.py and pytorch_mnist.py as part of the acceptance test for any Horovod release on ~128 GPUs.

I think the segfault/deadlock you're seeing at the end of the first epoch may be related to malloc hooks.

Can you try using -mca btl ^openib -x UCX_MEM_MALLOC_HOOKS =no in your mpirun command if you're using Open MPI?

@abidmalikwaterloo
Copy link
Author

@alsrgv I see the following message:

horovod_mnist.py:80: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
return F.log_softmax(x)
horovod_mnist.py:80: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
return F.log_softmax(x)
horovod_mnist.py:116: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
100. * batch_idx / len(train_loader), loss.data[0]))
horovod_mnist.py:116: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
100. * batch_idx / len(train_loader), loss.data[0]))
Traceback (most recent call last):
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/multiprocessing/queues.py", line 268, in _feed
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/multiprocessing/queue.py", line 17, in send
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/pickle.py", line 224, in dump
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/pickle.py", line 286, in save
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/pickle.py", line 554, in save_tuple
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/pickle.py", line 286, in save
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/pickle.py", line 606, in save_list
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/pickle.py", line 639, in _batch_appends
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/pickle.py", line 286, in save
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/multiprocessing/forking.py", line 67, in dispatcher
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/pickle.py", line 401, in save_reduce
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/pickle.py", line 286, in save
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/pickle.py", line 554, in save_tuple
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/pickle.py", line 286, in save
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/multiprocessing/forking.py", line 66, in dispatcher
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/multiprocessing/reductions.py", line 192, in reduce_storage
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/multiprocessing/reduction.py", line 145, in reduce_handle
OSError: [Errno 24] Too many open files

@alsrgv
Copy link
Member

alsrgv commented Sep 19, 2018

@abidmalikwaterloo, is it happening on one of the servers, or all of them? Sounds like one of the servers may be overloaded or have stale file handles.

@abidmalikwaterloo
Copy link
Author

@alsrgv It doesn't happen with or less than 3 GPU nodes and for more, I get this information. I received this message from each node. I just got the following message from another run with 4 GPU nodes:

buf = self.recv_bytes()

File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 227, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 30441) is killed by signal: Segmentation fault. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.
Traceback (most recent call last):
File "horovod_mnist.py", line 159, in
test()
File "horovod_mnist.py", line 149, in test
test_loss = metric_average(test_loss, 'avg_loss')
File "horovod_mnist.py", line 127, in metric_average
Trat_loss += F.nll_loss(output, target, size_average=False).data[0]
Traceback (most recent call last):
File "horovod_mnist.py", line 159, in
test()
File "horovod_mnist.py", line 135, in test
for data, target in test_loader:
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 330, in next
idx, batch = self._get_batch()
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 309, in _get_batch
return self.data_queue.get()
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/multiprocessing/queues.py", line 376, in get
return recv()
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/multiprocessing/queue.py", line 21, in recv
buf = self.recv_bytes()
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 227, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 30441) is killed by signal: Segmentation fault. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.
Traceback (most recent call last):
File "horovod_mnist.py", line 159, in

I am showing part of the message. But I am getting from each node. I think the dataloader is also responsible for my resnet50 crashing. I am trying to find a solution but still unable to find one.

@alsrgv
Copy link
Member

alsrgv commented Sep 19, 2018

@abidmalikwaterloo, data loader issue is likely caused by malloc hooks. What's your full mpirun command?

@abidmalikwaterloo
Copy link
Author

@alsrgv

Here is the full script

module load anaconda2
module load openmpi/3.0.0-gnu
module load cuda/9.0
module load gcc/7.2.0

cd /home/amalik/Pytorch_virtual_enviornment/Examples/examples/mnist

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/amalik/nccl_2.1.15-1+cuda9.0_x86_64/lib:/software/cuda/9.0/extras/CUPTI/lib64/

source activate /home/amalik/Pytorch_virtual_enviornment/

mpirun -np 4 -npernode 1 -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=ib0 -x NCCL_IB_HCA=mlx4_0 -x LD_LIBRARY_PATH -mca btl ^openib -x UCX_MEM_MALLOC_HOOKS=no -mca btl_tcp_if_include ib0 python horovod_mnist.py

source deactivate

@abidmalikwaterloo
Copy link
Author

@alsrgv Does it have something to do with the python version?

@abidmalikwaterloo
Copy link
Author

abidmalikwaterloo commented Sep 20, 2018

I change the following part of the code;

train_sampler = torch.utils.data.distributed.DistributedSampler(
train_dataset, num_replicas=hvd.size(), rank=hvd.rank())

train_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=args.batch_size, shuffle=True, num_workers=1)

test_dataset =
datasets.MNIST('data-%d' % hvd.rank(), train=False, transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
]))
test_sampler = torch.utils.data.distributed.DistributedSampler(
test_dataset, num_replicas=hvd.size(), rank=hvd.rank())

test_loader = torch.utils.data.DataLoader(test_dataset,
batch_size=args.test_batch_size, shuffle=True,
num_workers=1)

When I used the following for the loading:

train_sampler = torch.utils.data.distributed.DistributedSampler(
train_dataset, num_replicas=hvd.size(), rank=hvd.rank())
train_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=args.batch_size, sampler=train_sampler, num_workers=1)

It gave me an error. However, when I replace it with the following:

train_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=args.batch_size, shuffle=True, num_workers=1)

It works without any error.

@abidmalikwaterloo
Copy link
Author

However, the above trick didn't work for ResNet50 example :(

@abidmalikwaterloo
Copy link
Author

For ResNet I used the following ,

mpirun -np 4 -npernode 1 -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=ib0 -x NCCL_IB_HCA=mlx4_0 -x LD_LIBRARY_PATH -mca btl ^openib -x UCX_MEM_MALLOC_HOOKS=no -mca btl_tcp_if_include ib0 python horovod_main_testing.py --train-dir=/home/amalik/NEWIMAGENETDATA/raw-data/train/ --val-dir=/home/amalik/NEWIMAGENETDATA/raw-data/val/

I can see the ring of four in the output log file:

node07:2451:2473 [0] INFO Using internal Network IB
node07:2451:2473 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
node05:1992:2014 [0] INFO NET/IB: [0] mlx4_0:1/IB
node05:1992:2014 [0] INFO Using internal Network IB
node05:1992:2014 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
node06:13488:13510 [0] INFO NET/IB: [0] mlx4_0:1/IB
node06:13488:13510 [0] INFO Using internal Network IB
node06:13488:13510 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
node07:2451:2473 [0] INFO NET/IB: Dev 0 Port 1 qpn 3904 mtu 5 LID 4
node01:62540:62586 [0] INFO NET/IB: Dev 0 Port 1 qpn 11706 mtu 5 LID 3
node07:2451:2473 [0] INFO CUDA Dev 0, IB Ports : mlx4_0/1(PHB)
node05:1992:2014 [0] INFO NET/IB: Dev 0 Port 1 qpn 7929 mtu 5 LID 5
node01:62540:62586 [0] INFO NET/IB: Dev 0 Port 1 qpn 11708 mtu 5 LID 3
node01:62540:62564 [0] INFO NET/IB: Dev 0 Port 1 qpn 11709 mtu 5 LID 3
node05:1992:2014 [0] INFO CUDA Dev 0, IB Ports : mlx4_0/1(PHB)
node01:62540:62586 [0] INFO NET/IB: Dev 0 Port 1 qpn 11711 mtu 5 LID 3
node01:62540:62564 [0] INFO CUDA Dev 0, IB Ports : mlx4_0/1(PHB)
node06:13488:13510 [0] INFO NET/IB: Dev 0 Port 1 qpn 7487 mtu 5 LID 21
node01:62540:62586 [0] INFO NET/IB: Dev 0 Port 1 qpn 11714 mtu 5 LID 3
node06:13488:13510 [0] INFO CUDA Dev 0, IB Ports : mlx4_0/1(PHB)
node01:62540:62564 [0] INFO Using 512 threads
node01:62540:62564 [0] INFO Min Comp Cap 3
node01:62540:62564 [0] INFO NCCL_SINGLE_RING_THRESHOLD=131072

node01:62540:62564 [0] INFO Ring 00 : 0 1 2 3

node06:13488:13510 [0] INFO 1 -> 2 via NET/IB/0
node01:62540:62564 [0] INFO 3 -> 0 via NET/IB/0
node07:2451:2473 [0] INFO 2 -> 3 via NET/IB/0
node05:1992:2014 [0] INFO 0 -> 1 via NET/IB/0
node01:62540:62564 [0] INFO NET/IB: Dev 0 Port 1 qpn 11715 mtu 5 LID 3
node05:1992:2014 [0] INFO NET/IB: Dev 0 Port 1 qpn 7931 mtu 5 LID 5
node06:13488:13510 [0] INFO NET/IB: Dev 0 Port 1 qpn 7489 mtu 5 LID 21
node07:2451:2473 [0] INFO NET/IB: Dev 0 Port 1 qpn 3906 mtu 5 LID 4
node01:62540:62564 [0] INFO Launch mode Parallel

Then the application runs for a while and then crashes. I am pasting part of the output:

Train Epoch #1: 25%|##4 | 6567/26691 [22:30<1:08:06, 4.92it/s, loss=6.78, accuracy=0.271]
Train Epoch #1: 25%|##4 | 6567/26691 [22:30<1:08:06, 4.92it/s, loss=6.78, accuracy=0.271]
Train Epoch #1: 25%|##4 | 6568/26691 [22:30<1:08:42, 4.88it/s, loss=6.78, accuracy=0.271]
Train Epoch #1: 25%|##4 | 6568/26691 [22:30<1:08:42, 4.88it/s, loss=6.78, accuracy=0.271]
Train Epoch #1: 25%|##4 | 6569/26691 [22:30<1:08:36, 4.89it/s, loss=6.78, accuracy=0.271]
Train Epoch #1: 25%|##4 | 6569/26691 [22:30<1:08:36, 4.89it/s, loss=6.78, accuracy=0.272]
Train Epoch #1: 25%|##4 | 6570/26691 [22:30<1:09:05, 4.85it/s, loss=6.78, accuracy=0.272]
Train Epoch #1: 25%|##4 | 6570/26691 [22:31<1:09:05, 4.85it/s, loss=6.78, accuracy=0.272]
Train Epoch #1: 25%|##4 | 6571/26691 [22:31<1:08:14, 4.91it/s, loss=6.78, accuracy=0.272]
Train Epoch #1: 25%|##4 | 6571/26691 [22:31<1:08:14, 4.91it/s, loss=6.78, accuracy=0.272]
Train Epoch #1: 25%|##4 | 6572/26691 [22:31<1:08:54, 4.87it/s, loss=6.78, accuracy=0.272]
Train Epoch #1: 25%|##4 | 6572/26691 [22:31<1:08:54, 4.87it/s, loss=6.78, accuracy=0.272]
Train Epoch #1: 25%|##4 | 6573/26691 [22:31<1:10:13, 4.78it/s, loss=6.78, accuracy=0.272]
Train Epoch #1: 25%|##4 | 6573/26691 [22:31<1:10:13, 4.78it/s, loss=6.78, accuracy=0.272]
Train Epoch #1: 25%|##4 | 6574/26691 [22:31<1:09:02, 4.86it/s, loss=6.78, accuracy=0.272]
Train Epoch #1: 25%|##4 | 6574/26691 [22:31<1:09:02, 4.86it/s, loss=6.78, accuracy=0.273]
Train Epoch #1: 25%|##4 | 6575/26691 [22:31<1:09:36, 4.82it/s, loss=6.78, accuracy=0.273]
Traceback (most recent call last):
File "horovod_main_testing.py", line 258, in
train(epoch)
File "horovod_main_testing.py", line 166, in train
loss.backward()
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/autograd/init.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: out of memory
Traceback (most recent call last):
File "horovod_main_testing.py", line 258, in
train(epoch)
File "horovod_main_testing.py", line 167, in train
optimizer.step()
File "/home/amalik/.local/lib/python2.7/site-packages/horovod/torch/init.py", line 88, in step
self.synchronize()
File "/home/amalik/.local/lib/python2.7/site-packages/horovod/torch/init.py", line 84, in synchronize
synchronize(handle)
File "/home/amalik/.local/lib/python2.7/site-packages/horovod/torch/mpi_ops.py", line 417, in synchronize
mpi_lib.horovod_torch_wait_and_clear(handle)
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/utils/ffi/init.py", line 202, in safe_call
result = torch._C._safe_call(*args, **kwargs)
torch.FatalError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
Traceback (most recent call last):
File "horovod_main_testing.py", line 258, in
train(epoch)
File "horovod_main_testing.py", line 167, in train
optimizer.step()
File "/home/amalik/.local/lib/python2.7/site-packages/horovod/torch/init.py", line 88, in step
self.synchronize()
File "/home/amalik/.local/lib/python2.7/site-packages/horovod/torch/init.py", line 84, in synchronize
synchronize(handle)
File "/home/amalik/.local/lib/python2.7/site-packages/horovod/torch/mpi_ops.py", line 417, in synchronize
mpi_lib.horovod_torch_wait_and_clear(handle)
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/utils/ffi/init.py", line 202, in safe_call
result = torch._C._safe_call(*args, **kwargs)
torch.FatalError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
Traceback (most recent call last):
File "horovod_main_testing.py", line 258, in
train(epoch)
File "horovod_main_testing.py", line 167, in train
optimizer.step()
File "/home/amalik/.local/lib/python2.7/site-packages/horovod/torch/init.py", line 88, in step
self.synchronize()
File "/home/amalik/.local/lib/python2.7/site-packages/horovod/torch/init.py", line 84, in synchronize
synchronize(handle)
File "/home/amalik/.local/lib/python2.7/site-packages/horovod/torch/mpi_ops.py", line 417, in synchronize
mpi_lib.horovod_torch_wait_and_clear(handle)
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/utils/ffi/init.py", line 202, in safe_call
result = torch._C._safe_call(*args, **kwargs)
torch.FatalError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[57108,1],1]
Exit code: 1

There should be four models active during the run. However, I see only two. I am using a batch size of 12 . which is pretty small. Even then I getting CUDA error: out memory error.

@alsrgv
Copy link
Member

alsrgv commented Sep 21, 2018

@abidmalikwaterloo, what if you replace num_workers=1 to num_workers=0 in the original code, and set pin_memory=False if it's set to True in ResNet50 example?

@abidmalikwaterloo
Copy link
Author

@alsrgv Ok it is running now. I am reproducing some part of the output file:

Train Epoch #1: 72%|#######1 | 19185/26691 [4:01:11<1:42:27, 1.22it/s, loss=5.79, accuracy=4.2]
Train Epoch #1: 72%|#######1 | 19186/26691 [4:01:11<1:51:55, 1.12it/s, loss=5.79, accuracy=4.2]
Train Epoch #1: 72%|#######1 | 19186/26691 [4:01:12<1:51:55, 1.12it/s, loss=5.79, accuracy=4.2]
Train Epoch #1: 72%|#######1 | 19187/26691 [4:01:12<1:46:53, 1.17it/s, loss=5.79, accuracy=4.2]
Train Epoch #1: 72%|#######1 | 19187/26691 [4:01:12<1:46:53, 1.17it/s, loss=5.79, accuracy=4.2]
Train Epoch #1: 72%|#######1 | 19188/26691 [4:01:12<1:40:31, 1.24it/s, loss=5.79, accuracy=4.2]
Train Epoch #1: 72%|#######1 | 19188/26691 [4:01:13<1:40:31, 1.24it/s, loss=5.79, accuracy=4.2]
Train Epoch #1: 72%|#######1 | 19189/26691 [4:01:13<1:35:51, 1.30it/s, loss=5.79, accuracy=4.2]
Train Epoch #1: 72%|#######1 | 19189/26691 [4:01:14<1:35:51, 1.30it/s, loss=5.79, accuracy=4.2]
Train Epoch #1: 72%|#######1 | 19190/26691 [4:01:14<1:46:30, 1.17it/s, loss=5.79, accuracy=4.2]
Train Epoch #1: 72%|#######1 | 19190/26691 [4:01:15<1:46:30, 1.17it/s, loss=5.79, accuracy=4.2]
Train Epoch #1: 72%|#######1 | 19191/26691 [4:01:15<2:00:59, 1.03it/s, loss=5.79, accuracy=4.2]
Train Epoch #1: 72%|#######1 | 19191/26691 [4:01:16<2:00:59, 1.03it/s, loss=5.79, accuracy=4.21]
Train Epoch #1: 72%|#######1 | 19192/26691 [4:01:16<1:55:26, 1.08it/s, loss=5.79, accuracy=4.21]
Train Epoch #1: 72%|#######1 | 19192/26691 [4:01:17<1:55:26, 1.08it/s, loss=5.79, accuracy=4.2]
Train Epoch #1: 72%|#######1 | 19193/26691 [4:01:17<1:52:32, 1.11it/s, loss=5.79, accuracy=4.2]
Train Epoch #1: 72%|#######1 | 19193/26691 [4:01:18<1:52:32, 1.11it/s, loss=5.79, accuracy=4.21]
Train Epoch #1: 72%|#######1 | 19194/26691 [4:01:18<1:45:55, 1.18it/s, loss=5.79, accuracy=4.21]
Train Epoch #1: 72%|#######1 | 19194/26691 [4:01:18<1:45:55, 1.18it/s, loss=5.79, accuracy=4.21]
Train Epoch #1: 72%|#######1 | 19195/26691 [4:01:18<1:43:33, 1.21it/s, loss=5.79, accuracy=4.21]
Train Epoch #1: 72%|#######1 | 19195/26691 [4:01:19<1:43:33, 1.21it/s, loss=5.79, accuracy=4.21]
Train Epoch #1: 72%|#######1 | 19196/26691 [4:01:19<1:39:39, 1.25it/s, loss=5.79, accuracy=4.21]
Train Epoch #1: 72%|#######1 | 19196/26691 [4:01:20<1:39:39, 1.25it/s, loss=5.79, accuracy=4.21]
Train Epoch #1: 72%|#######1 | 19197/26691 [4:01:20<1:39:40, 1.25it/s, loss=5.79, accuracy=4.21]
Train Epoch #1: 72%|#######1 | 19197/26691 [4:01:21<1:39:40, 1.25it/s, loss=5.79, accuracy=4.21]
Train Epoch #1: 72%|#######1 | 19198/26691 [4:01:21<1:39:15, 1.26it/s, loss=5.79, accuracy=4.21]
Train Epoch #1: 72%|#######1 | 19198/26691 [4:01:22<1:39:15, 1.26it/s, loss=5.79, accuracy=4.21]
Train Epoch #1: 72%|#######1 | 19199/26691 [4:01:22<1:38:23, 1.27it/s, loss=5.79, accuracy=4.21]
Train Epoch #1: 72%|#######1 | 19199/26691 [4:01:22<1:38:23, 1.27it/s, loss=5.79, accuracy=4.21]
Train Epoch #1: 72%|#######1 | 19200/26691 [4:01:22<1:38:18, 1.27it/s, loss=5.79, accuracy=4.21]
Train Epoch #1: 72%|#######1 | 19200/26691 [4:01:23<1:38:18, 1.27it/s, loss=5.79, accuracy=4.21]

Why do I see only two models in action? I ran it with 4 nodes and I can see a sing with 4 nodes. There should be four models. The number of iterations (26691 is based on 4 models 1.2 million / (12 *4))

@alsrgv
Copy link
Member

alsrgv commented Sep 24, 2018

@abidmalikwaterloo, sounds like the num_workers fix help in your environment?

As for progress bar, you should actually just see one, https://github.com/uber/horovod/blob/master/examples/pytorch_imagenet_resnet50.py#L146 guards progress bar visibility only on rank 0. Do you have that code unchanged?

@abidmalikwaterloo
Copy link
Author

@alsrgv it looks that it is working. I didn't change the code yet. I am testing it just varying the variables at the top level. I just had another breakdown.
The first epoch went well and had a smooth transition from epoch 1 to 2:

Validate Epoch #1: 100%|#########9| 1039/1042 [09:54<00:01, 1.73it/s, loss=4.21, accuracy=18.5]
Validate Epoch #1: 100%|#########9| 1040/1042 [09:54<00:01, 1.79it/s, loss=4.21, accuracy=18.5]
Validate Epoch #1: 100%|#########9| 1040/1042 [09:55<00:01, 1.79it/s, loss=4.21, accuracy=18.5]
Validate Epoch #1: 100%|#########9| 1041/1042 [09:55<00:00, 1.59it/s, loss=4.21, accuracy=18.5]
Validate Epoch #1: 100%|#########9| 1041/1042 [09:59<00:00, 1.59it/s, loss=4.21, accuracy=18.5]
Validate Epoch #1: 100%|##########| 1042/1042 [09:59<00:00, 1.62s/it, loss=4.21, accuracy=18.5]

Train Epoch #2: 0%| | 0/26691 [00:00<?, ?it/s]
Train Epoch #2: 0%| | 0/26691 [00:04<?, ?it/s, loss=4.42, accuracy=14.6]
Train Epoch #2: 0%| | 1/26691 [00:04<34:41:28, 4.68s/it, loss=4.42, accuracy=14.6]
Train Epoch #2: 0%| | 1/26691 [00:05<34:41:28, 4.68s/it, loss=4.61, accuracy=14.6]
Train Epoch #2: 0%| | 2/26691 [00:05<25:59:46, 3.51s/it, loss=4.61, accuracy=14.6]
Train Epoch #2: 0%| | 2/26691 [00:06<25:59:46, 3.51s/it, loss=4.53, accuracy=15.3]
Train Epoch #2: 0%| | 3/26691 [00:06<19:49:44, 2.67s/it, loss=4.53, accuracy=15.3]

However, it broke down at 85% of the second epoch due to some CUDA/GPU error:

Train Epoch #2: 85%|########5 | 22726/26691 [4:54:20<52:08, 1.27it/s, loss=3.93, accuracy=22.2]
Train Epoch #2: 85%|########5 | 22726/26691 [4:54:20<52:08, 1.27it/s, loss=3.93, accuracy=22.2]
Train Epoch #2: 85%|########5 | 22727/26691 [4:54:20<52:57, 1.25it/s, loss=3.93, accuracy=22.2]Traceback (most recent call last):
File "horovod_main_testing.py", line 258, in
train(epoch)
File "horovod_main_testing.py", line 167, in train
optimizer.step()
File "/home/amalik/.local/lib/python2.7/site-packages/horovod/torch/init.py", line 88, in step
self.synchronize()
File "/home/amalik/.local/lib/python2.7/site-packages/horovod/torch/init.py", line 84, in synchronize
synchronize(handle)
File "/home/amalik/.local/lib/python2.7/site-packages/horovod/torch/mpi_ops.py", line 417, in synchronize
mpi_lib.horovod_torch_wait_and_clear(handle)
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/utils/ffi/init.py", line 202, in safe_call
result = torch._C._safe_call(*args, **kwargs)
torch.FatalError: cudaMemcpyAsync failed: invalid argument
Traceback (most recent call last):
File "horovod_main_testing.py", line 258, in
train(epoch)
File "horovod_main_testing.py", line 167, in train
optimizer.step()
File "/home/amalik/.local/lib/python2.7/site-packages/horovod/torch/init.py", line 88, in step
self.synchronize()
File "/home/amalik/.local/lib/python2.7/site-packages/horovod/torch/init.py", line 84, in synchronize
synchronize(handle)
File "/home/amalik/.local/lib/python2.7/site-packages/horovod/torch/mpi_ops.py", line 417, in synchronize
mpi_lib.horovod_torch_wait_and_clear(handle)
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/utils/ffi/init.py", line 202, in safe_call
result = torch._C._safe_call(*args, **kwargs)
torch.FatalError: cudaMemcpyAsync failed: invalid argument

Traceback (most recent call last):
File "horovod_main_testing.py", line 258, in
train(epoch)
File "horovod_main_testing.py", line 167, in train
optimizer.step()
File "/home/amalik/.local/lib/python2.7/site-packages/horovod/torch/init.py", line 88, in step
Traceback (most recent call last):
File "horovod_main_testing.py", line 258, in
train(epoch)
File "horovod_main_testing.py", line 167, in train
optimizer.step()
File "/home/amalik/.local/lib/python2.7/site-packages/horovod/torch/init.py", line 88, in step
self.synchronize()
File "/home/amalik/.local/lib/python2.7/site-packages/horovod/torch/init.py", line 84, in synchronize
synchronize(handle)
File "/home/amalik/.local/lib/python2.7/site-packages/horovod/torch/mpi_ops.py", line 417, in synchronize
mpi_lib.horovod_torch_wait_and_clear(handle)
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/utils/ffi/init.py", line 202, in safe_call
result = torch._C._safe_call(*args, **kwargs)
torch.FatalError: cudaMemcpyAsync failed: invalid argument
self.synchronize()
File "/home/amalik/.local/lib/python2.7/site-packages/horovod/torch/init.py", line 84, in synchronize
synchronize(handle)
File "/home/amalik/.local/lib/python2.7/site-packages/horovod/torch/mpi_ops.py", line 417, in synchronize
mpi_lib.horovod_torch_wait_and_clear(handle)
File "/home/amalik/Pytorch_virtual_enviornment/lib/python2.7/site-packages/torch/utils/ffi/init.py", line 202, in safe_call
result = torch._C._safe_call(*args, **kwargs)
torch.FatalError: cudaMemcpyAsync failed: invalid argument

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[59646,1],3]
Exit code: 1

@abidmalikwaterloo
Copy link
Author

abidmalikwaterloo commented Sep 25, 2018

I see similar issue at #404 but no solution

@abidmalikwaterloo
Copy link
Author

abidmalikwaterloo commented Sep 25, 2018

@alsrgv Does this has to do with the GPU arch? Going through various options to solve it and found the following:

fxia22/stn.pytorch#12

@abidmalikwaterloo
Copy link
Author

@tgaddair @alsrgv I got breakdown during the training phase at different epoch numbers on different clusters as mentioned in #404

@lento234
Copy link

lento234 commented Jan 9, 2020

Where you able to resolve the issue?

I also see that if num_workers=0 and pin_memory=False, the issue does not appear. It indicates that it is an issue with multiprocessing.

FYI, this is my setup:

PyTorch version: 1.3.0+cu100
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.10.2

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration: GPU 0: Tesla P100-PCIE-16GB
Nvidia driver version: 418.39
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.0

Versions of relevant libraries:
[pip3] numpy==1.18.1
[pip3] torch==1.3.0+cu100
[pip3] torchvision==0.4.1+cu100
[conda] Could not collect

@stale
Copy link

stale bot commented Nov 6, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Nov 6, 2020
@stale stale bot closed this as completed Nov 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants