New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
horovod can't work in distributed. #110
Comments
Hi @BIGBALLON, a couple of questions:
|
hello, @alsrgv
|
@BIGBALLON, how did you install Open MPI, do you happen to have multiple version of it installed? |
yep, @alsrgv .
rotect(0x7f3a8b1c6000, 4096, PROT_READ) = 0
mprotect(0x7f3a8b3d8000, 4096, PROT_READ) = 0
munmap(0x7f3a97212000, 110161) = 0
open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 24
lseek(24, 0, SEEK_CUR) = 0
fstat(24, {st_mode=S_IFREG|0644, st_size=2368, ...}) = 0
mmap(NULL, 2368, PROT_READ, MAP_SHARED, 24, 0) = 0x7f3a972dc000
lseek(24, 2368, SEEK_SET) = 2368
munmap(0x7f3a972dc000, 2368) = 0
close(24) = 0
getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f3a972bf9d0) = 28692
setpgid(28692, 28692) = 0
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=23, events=POLLIN}], 4, -1) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_STOPPED, si_pid=28692, si_uid=1000, si_status=SIGTTOU, si_utime=0, si_stime=0} ---
sendto(3, "\21", 1, 0, NULL, 0) = 1
rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=23, events=POLLIN}], 4, -1) = 1 ([{fd=4, revents=POLLIN}])
recvfrom(4, "\21", 1024, 0, NULL, NULL) = 1
recvfrom(4, 0x7f3a96e04320, 1024, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
wait4(-1, 0x7fff68c62424, WNOHANG, NULL) = 0
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=23, events=POLLIN}], 4, -1) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted poll ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>
|
I think there something problem when I run mpirun -np 2 \
-H 192.168.2.243:1,192.168.3.246:1 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
python3 keras_mnist_advanced.py in an empty folder, it still hangs up & doesn't output any error. |
@BIGBALLON, was that output above from simple
? Can you make a gist will the full output? |
after running the cmd strace mpirun -np 2 \
-H 192.168.2.243:1,192.168.3.246:1 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
python3 keras_mnist_advanced.py the full output is here: https://gist.github.com/BIGBALLON/dc617eeafa0160aa912537088fe82d75 |
@BIGBALLON, one quick thing to try:
|
@alsrgv it still doesn't work. and have the save issue. see https://gist.github.com/BIGBALLON/3968f94527c5b21381d10f77eccd706c |
@BIGBALLON, no problem. It seems that Can you try this instead and update the gist:
|
Hi, @alsrgv
full output is here: |
@BIGBALLON, one more idea, can you If that doesn't help, can you update the gist with the full output? |
@alsrgv yes. dl2017@mtk:~/Desktop/horovod/examples$ ssh 192.168.2.243
The authenticity of host '192.168.2.243 (192.168.2.243)' can't be established.
ECDSA key fingerprint is SHA256:AHuuh0g+DN2n0WYBf9qQxlgQKRA1usuo0N8osgHvH6A.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.2.243' (ECDSA) to the list of known hosts.
dl2017@192.168.2.243's password:
Welcome to Ubuntu 16.04.3 LTS (GNU/Linux 4.10.0-40-generic x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
10 packages can be updated.
7 updates are security updates.
|
@BIGBALLON, try saying yes, and then running the |
@alsrgv dl2017@mtk:~$ mpirun -np 2 \
> -H 192.168.2.243:1,192.168.3.246:1 \
> -bind-to none -map-by slot \
> -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
> python3 keras_mnist_advanced.py
^Cdl2017@192.168.2.243's password:
dl2017@mtk:~$ Permission denied, please try again.
dl2017@192.168.2.243's password:
Permission denied, please try again.
dl2017@192.168.2.243's password:
Permission denied (publickey,password).
^C When I using
|
@BIGBALLON, see this guide for passwordless ssh. |
@alsrgv But output (dp) dl2017@mtk:~/Desktop/horovod/examples$ mpirun -np 2 \
> -H 192.168.2.249:1,192.168.3.246:1 \
> -bind-to none -map-by slot \
> -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
> python keras_mnist_advanced.py
Using TensorFlow backend.
Traceback (most recent call last):
File "keras_mnist_advanced.py", line 10, in <module>
import horovod.keras as hvd
ImportError: No module named horovod.keras
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[30732,1],1]
Exit code: 1
--------------------------------------------------------------------------
but single machine is fine: (dp) dl2017@mtk:~/Desktop/horovod/examples$ mpirun -np 1 \
> -H localhost:1 \
> -bind-to none -map-by slot \
> -x LD_LIBRARY_PATH \
> python3 keras_mnist_advanced.py
2017-11-29 11:57:28.714190: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2017-11-29 11:57:28.825021: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-11-29 11:57:28.825375: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:01:00.0
totalMemory: 10.91GiB freeMemory: 10.50GiB
2017-11-29 11:57:28.825387: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Epoch 1/24
82/468 [====>.........................] - ETA: 7s |
@BIGBALLON, did you install Horovod on the second machine, too? Are you using a virtualenv? |
@alsrgv |
@BIGBALLON, can you try this (note
|
@alsrgv another problem...XD 😿 dl2017@mtk:~/Desktop/horovod/examples$ mpirun -np 2 \
> -H 192.168.2.249:1,192.168.3.246:1 \
> -bind-to none -map-by slot \
> -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
> /home/dl2017/Desktop/dp/bin/python3 keras_mnist_advanced.py
--------------------------------------------------------------------------
Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected. This is highly unusual.
The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).
Local host: mtk
Local PID: 28422
Peer hostname: mtk ([[26223,1],0])
Source IP of socket: 192.168.2.1
Known IPs of peer:
192.168.3.246
--------------------------------------------------------------------------
[mtk:04140] 1 more process has sent help message help-mpi-btl-tcp.txt / dropped inbound connection
[mtk:04140] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
|
@BIGBALLON, before you were using |
@alsrgv I make sure the IP address is correct
|
@BIGBALLON, can you make |
@alsrgv mpirun -np 2 \
-H 192.168.2.249:1,192.168.2.243:1 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
/home/dl2017/Desktop/dp/bin/python3 keras_mnist_advanced.py it runs, but.....:
|
@BIGBALLON, sounds like NCCL is also confused about hostnames being the same, since it's trying to use IPC to talk between machines:
|
@alsrgv BTW, it seem two machines(2 GPUs) slow than one machine(1 GPU)?? |
I am positive that these two machines have different hostname. |
otherwise there is no massage related to p2p or shm I think |
@haNa-meister, nothing really stands out in that log. Can you try to upgrade to latest NCCL 2.2? If that doesn't help, you can raise an issue at NVIDIA Dev Zone against NCCL. |
OK, actually I have tried NCCL2.2.12 before. Thanks for your help! |
Hi alex, |
@haNa-meister, excellent! :-) |
I have the same type of problem. Blank screen , nothing happened I can run remotely from one computer to another:
|
When I run it one by one from one machine to another:
When I run both:
As I mentioned mpi working fine with python in cluster on both machines, I tested it by:
|
This is my strrace .. . S much missed. Do you know what library I have add ?
|
@zeroprg, I think the issue is with those docker interfaces. Can you try using |
Finally fixed and could run horovod on 2 OrangePi ARM computers. I had 2 different problem. Both problem related to MPI.
|
@zeroprg, great, looking forward to your blog post! |
Hi, I have the same type of problem. the output is : thank you for your help!!! |
@cdaningWings, feels like you have NCCL 1.x on the second node since NCCL output looks different. |
@alsrgv Thanks for your help! |
@alsrgv Hi ,when i run the code in the second node[gpu16, the above code ran in gpu13], the error as follows: lf-solar-gpu13-pm:487531:487559 [1] include/socket.h:345 NCCL WARN Net : Socket creation failed : Address family not supported by protocol so, it means the gpu13 have some problems? |
@cdaningWings, can you try |
@alsrgv I have tried, but the mistake is the same, details as follows: https://gist.github.com/cdaningWings/df60514312957cb7dfa6b4c99fa7cb45 |
@cdaningWings, a couple of things:
|
@alsrgv And I still have some question. when i run the code in the gpu13, and the second node is gpu16. the INFO is :NCCL INFO Could not find real path of /sys/class/net/br0/device。 but if i run the code in the gpu16, and the second node is gpu13, the INFO is: NCCL WARN Net : Socket creation failed : Address family not supported by protocol... |
@cdaningWings, can you paste the output of |
@alsrgv ok br0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 eno1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500 eno2: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 The gpu13: br0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 eno1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500 eno2: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 |
@alsrgv hi I changed the gpu16 to gpu9. and I run the code in the gpu 9, I have got the exception as follows:lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO NET : Using interface br0:10.9.22.33<0> lf-solar-gpu13-pm:504059:504159 [0] include/socket.h:361 NCCL WARN Call to connect failed : Connection timed out lf-solar-gpu9-pm:285594:285607 [0] include/socket.h:361 NCCL WARN Call to connect failed : Connection timed out lf-solar-gpu13-pm:504059:504159 [0] include/socket.h:361 NCCL WARN Call to connect failed : Connection timed out lf-solar-gpu9-pm:285594:285607 [0] include/socket.h:361 NCCL WARN Call to connect failed : Connection timed out During handling of the above exception, another exception occurred: Traceback (most recent call last): Caused by op 'DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_fully_connected_1_BiasAdd_grad_tuple_control_dependency_1_0', defined at: UnknownError (see above for traceback): ncclCommInitRank failed: unhandled system error Traceback (most recent call last): During handling of the above exception, another exception occurred: Traceback (most recent call last): Caused by op 'DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_fully_connected_1_BiasAdd_grad_tuple_control_dependency_1_0', defined at: UnknownError (see above for traceback): ncclCommInitRank failed: unhandled system error Primary job terminated normally, but 1 process returned
|
It works by using the following cmd in an single machine:
But it seem doesn't work by using the following cmd in two machines :
It seems the program hangs up and output nothing.
#106 said that the reason(mpirun will hang, no error, no output) is mpi don't known which card to use.
So I try to use
But it still doesn't work? (mpirun still hang, no error, no output, no threads run)
Is somthing wrong ??
Can someone help me?
The text was updated successfully, but these errors were encountered: