Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

horovod can't work in distributed. #110

Closed
BIGBALLON opened this issue Nov 28, 2017 · 63 comments
Closed

horovod can't work in distributed. #110

BIGBALLON opened this issue Nov 28, 2017 · 63 comments
Labels

Comments

@BIGBALLON
Copy link

BIGBALLON commented Nov 28, 2017

horovod (0.11.1)
Keras (2.1.1)
tensorflow-gpu (1.4.0)
openmpi(3.0.0)

It works by using the following cmd in an single machine:

mpirun -np 1 \
    -H 192.168.2.243:1 \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
    python3 keras_mnist_advanced.py
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Epoch 1/24
467/468 [============================>.] - ETA: 0s - loss: 0.6053 - acc: 0.8081mtk:21122:21129 [0] INFO NET : Using interface enp0s31f6:192.168.2.243<0>
mtk:21122:21129 [0] INFO NET/IB : Using interface enp0s31f6 for sideband communication
mtk:21122:21129 [0] INFO Using internal Network Socket
mtk:21122:21129 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
mtk:21122:21129 [0] INFO NET : Using interface enp0s31f6:192.168.2.243<0>
mtk:21122:21129 [0] INFO NET/Socket : 1 interfaces found
NCCL version 2.1.2+cuda8.0
mtk:21122:21129 [0] INFO Using 256 threads
mtk:21122:21129 [0] INFO Min Comp Cap 6
mtk:21122:21129 [0] INFO NCCL_SINGLE_RING_THRESHOLD=131072
469/468 [==============================] - 8s 16ms/step - loss: 0.6027 - acc: 0.8088 - val_loss: 0.0706 - val_acc: 0.9776
Epoch 2/24
469/468 [==============================] - 7s 15ms/step - loss: 0.2589 - acc: 0.9224 - val_loss: 0.0469 - val_acc: 0.9854
Epoch 3/24
469/468 [==============================] - 7s 14ms/step - loss: 0.2044 - acc: 0.9385 - val_loss: 0.0376 - val_acc: 0.9892
Epoch 4/24
469/468 [==============================] - 7s 14ms/step - loss: 0.1818 - acc: 0.9460 - val_loss: 0.0362 - val_acc: 0.9880
Epoch 5/24
469/468 [==============================] - 7s 14ms/step - loss: 0.1584 - acc: 0.9520 - val_loss: 0.0291 - val_acc: 0.9909

But it seem doesn't work by using the following cmd in two machines :

(dp) dl2017@mtk:~/Desktop/horovod/examples$ mpirun -np 2 \
>     -H 192.168.2.243:1,192.168.3.246:1 \
>     -bind-to none -map-by slot \
>     -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
>     python3 keras_mnist_advanced.py

It seems the program hangs up and output nothing.

enp0s31f6 Link encap:Ethernet  HWaddr 10:7b:44:16:20:8b  
          inet addr:192.168.2.243  Bcast:192.168.2.255  Mask:255.255.255.0
          inet6 addr: fe80::1de5:985:b555:96d1/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:58080420 errors:0 dropped:0 overruns:0 frame:0
          TX packets:69461209 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:64689742509 (64.6 GB)  TX bytes:87891226474 (87.8 GB)
          Interrupt:16 Memory:f7100000-f7120000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:26940 errors:0 dropped:0 overruns:0 frame:0
          TX packets:26940 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:26761208 (26.7 MB)  TX bytes:26761208 (26.7 MB)

#106 said that the reason(mpirun will hang, no error, no output) is mpi don't known which card to use.
So I try to use

mpirun -np 2 \
    -H 192.168.2.243:1,192.168.3.246:1 \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
    -mca btl_tcp_if_include enp0s31f6 \
    python3 keras_mnist_advanced.py

But it still doesn't work? (mpirun still hang, no error, no output, no threads run)
Is somthing wrong ??
Can someone help me?

@alsrgv
Copy link
Member

alsrgv commented Nov 29, 2017

Hi @BIGBALLON, a couple of questions:

  1. Do both machines have interface enp0s31f6?
  2. If you wait for a long time, say 10 minutes, is there any output?

@BIGBALLON
Copy link
Author

hello, @alsrgv

  1. Do both machines have interface enp0s31f6? yes. all of the ubuntu machines have interface enp0s31f6.
  2. If you wait for a long time, say 10 minutes, is there any output? yes. there are nothing output.

@alsrgv
Copy link
Member

alsrgv commented Nov 29, 2017

@BIGBALLON, how did you install Open MPI, do you happen to have multiple version of it installed?
Also, can you try strace mpirun ... and see where it gets stuck?

@BIGBALLON
Copy link
Author

BIGBALLON commented Nov 29, 2017

yep, @alsrgv .

  1. openmpi installed by the official website's install method.
  2. after running strace mpirun, I get the following messages:
rotect(0x7f3a8b1c6000, 4096, PROT_READ) = 0
mprotect(0x7f3a8b3d8000, 4096, PROT_READ) = 0
munmap(0x7f3a97212000, 110161)          = 0
open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 24
lseek(24, 0, SEEK_CUR)                  = 0
fstat(24, {st_mode=S_IFREG|0644, st_size=2368, ...}) = 0
mmap(NULL, 2368, PROT_READ, MAP_SHARED, 24, 0) = 0x7f3a972dc000
lseek(24, 2368, SEEK_SET)               = 2368
munmap(0x7f3a972dc000, 2368)            = 0
close(24)                               = 0
getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f3a972bf9d0) = 28692
setpgid(28692, 28692)                   = 0
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=23, events=POLLIN}], 4, -1) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_STOPPED, si_pid=28692, si_uid=1000, si_status=SIGTTOU, si_utime=0, si_stime=0} ---
sendto(3, "\21", 1, 0, NULL, 0)         = 1
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=23, events=POLLIN}], 4, -1) = 1 ([{fd=4, revents=POLLIN}])
recvfrom(4, "\21", 1024, 0, NULL, NULL) = 1
recvfrom(4, 0x7f3a96e04320, 1024, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
wait4(-1, 0x7fff68c62424, WNOHANG, NULL) = 0
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=23, events=POLLIN}], 4, -1) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted poll ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
restart_syscall(<... resuming interrupted restart_syscall ...>

@BIGBALLON
Copy link
Author

BIGBALLON commented Nov 29, 2017

I think there something problem when I run mpirun.
because when I run the cmd

mpirun -np 2 \
    -H 192.168.2.243:1,192.168.3.246:1 \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
    python3 keras_mnist_advanced.py

in an empty folder, it still hangs up & doesn't output any error.

@alsrgv
Copy link
Member

alsrgv commented Nov 29, 2017

@BIGBALLON, was that output above from simple strace mpirun or

strace mpirun -np 2 \
    -H 192.168.2.243:1,192.168.3.246:1 \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
    python3 keras_mnist_advanced.py

?

Can you make a gist will the full output?

@BIGBALLON
Copy link
Author

BIGBALLON commented Nov 29, 2017

@alsrgv

after running the cmd

strace mpirun -np 2 \
    -H 192.168.2.243:1,192.168.3.246:1 \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
    python3 keras_mnist_advanced.py

the full output is here:

https://gist.github.com/BIGBALLON/dc617eeafa0160aa912537088fe82d75

@alsrgv
Copy link
Member

alsrgv commented Nov 29, 2017

@BIGBALLON, one quick thing to try:

mpirun --prefix /usr/local \
    -np 2 \
    -H 192.168.2.243:1,192.168.3.246:1 \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
    python3 keras_mnist_advanced.py

@BIGBALLON
Copy link
Author

BIGBALLON commented Nov 29, 2017

@alsrgv it still doesn't work. and have the save issue.

see https://gist.github.com/BIGBALLON/3968f94527c5b21381d10f77eccd706c

@alsrgv
Copy link
Member

alsrgv commented Nov 29, 2017

@BIGBALLON, no problem. It seems that mpirun does the fork, at which point strace stops following it.

Can you try this instead and update the gist:

strace -f -e 'trace=!poll' \
    mpirun -np 2 \
    -H 192.168.2.243:1,192.168.3.246:1 \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
    python3 keras_mnist_advanced.py

@BIGBALLON
Copy link
Author

BIGBALLON commented Nov 29, 2017

Hi, @alsrgv

[pid  1818] ioctl(4, SNDCTL_TMR_CONTINUE or TCSETSF, {B38400 opost isig icanon echo ...}) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
[pid  1818] --- SIGTTOU {si_signo=SIGTTOU, si_code=SI_KERNEL} ---
[pid  1818] rt_sigreturn({mask=[]})     = -1 EINTR (Interrupted system call)
[pid  1818] rt_sigaction(SIGALRM, {SIG_DFL, [], SA_RESTORER, 0x7f0826c854b0}, NULL, 8) = 0
[pid  1818] rt_sigaction(SIGHUP, {SIG_DFL, [], SA_RESTORER, 0x7f0826c854b0}, NULL, 8) = 0
[pid  1818] rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f0826c854b0}, NULL, 8) = 0
[pid  1818] rt_sigaction(SIGQUIT, {SIG_DFL, [], SA_RESTORER, 0x7f0826c854b0}, NULL, 8) = 0
[pid  1818] rt_sigaction(SIGPIPE, {SIG_IGN, [], SA_RESTORER, 0x7f0826c854b0}, NULL, 8) = 0
[pid  1818] rt_sigaction(SIGTERM, {SIG_DFL, [], SA_RESTORER, 0x7f0826c854b0}, NULL, 8) = 0
[pid  1818] rt_sigaction(SIGTSTP, {SIG_DFL, [], SA_RESTORER, 0x7f0826c854b0}, NULL, 8) = 0
[pid  1818] rt_sigaction(SIGTTIN, {SIG_DFL, [], SA_RESTORER, 0x7f0826c854b0}, NULL, 8) = 0
[pid  1818] rt_sigaction(SIGTTOU, {SIG_DFL, [], SA_RESTORER, 0x7f0826c854b0}, NULL, 8) = 0
[pid  1818] close(4)                    = 0
[pid  1818] kill(1818, SIGTTOU)         = 0
[pid  1818] --- SIGTTOU {si_signo=SIGTTOU, si_code=SI_USER, si_pid=1818, si_uid=1000} ---
[pid  1813] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_STOPPED, si_pid=1818, si_uid=1000, si_status=SIGTTOU, si_utime=0, si_stime=0} ---
[pid  1813] sendto(3, "\21", 1, 0, NULL, 0) = 1
[pid  1813] rt_sigreturn({mask=[]})     = -1 EINTR (Interrupted system call)
[pid  1813] recvfrom(4, "\21", 1024, 0, NULL, NULL) = 1
[pid  1813] recvfrom(4, 0x7f32b07eb320, 1024, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid  1813] wait4(-1, 0x7ffdaecca204, WNOHANG, NULL) = 0
[pid  1818] --- stopped by SIGTTOU ---
[pid  1817] <... select resumed> )      = 0 (Timeout)
[pid  1817] select(22, [20 21], NULL, NULL, {2, 0}) = 0 (Timeout)
[pid  1817] select(22, [20 21], NULL, NULL, {2, 0}) = 0 (Timeout)
[pid  1817] select(22, [20 21], NULL, NULL, {2, 0}) = 0 (Timeout)
[pid  1817] select(22, [20 21], NULL, NULL, {2, 0}) = 0 (Timeout)
[pid  1817] select(22, [20 21], NULL, NULL, {2, 0}) = 0 (Timeout)
[pid  1817] select(22, [20 21], NULL, NULL, {2, 0}) = 0 (Timeout)
[pid  1817] select(22, [20 21], NULL, NULL, {2, 0}) = 0 (Timeout)
[pid  1817] select(22, [20 21], NULL, NULL, {2, 0}) = 0 (Timeout)
[pid  1817] select(22, [20 21], NULL, NULL, {2, 0}) = 0 (Timeout)
[pid  1817] select(22, [20 21], NULL, NULL, {2, 0}) = 0 (Timeout)
[pid  1817] select(22, [20 21], NULL, NULL, {2, 0}) = 0 (Timeout)
[pid  1817] select(22, [20 21], NULL, NULL, {2, 0}) = 0 (Timeout)
[pid  1817] select(22, [20 21], NULL, NULL, {2, 0}) = 0 (Timeout)
[pid  1817] select(22, [20 21], NULL, NULL, {2, 0}) = 0 (Timeout)
[pid  1817] select(22, [20 21], NULL, NULL, {2, 0}) = 0 (Timeout)
[pid  1817] select(22, [20 21], NULL, NULL, {2, 0}) = 0 (Timeout)

full output is here:
https://gist.github.com/BIGBALLON/cac3540428696770d0b3f6c7caccf43b

@alsrgv
Copy link
Member

alsrgv commented Nov 29, 2017

@BIGBALLON, one more idea, can you ssh 192.168.3.246 from 192.168.2.243? Does it ask any question, like "do you want to add this host identity to known hosts", etc?

If that doesn't help, can you update the gist with the full output?

@BIGBALLON
Copy link
Author

BIGBALLON commented Nov 29, 2017

@alsrgv yes.

dl2017@mtk:~/Desktop/horovod/examples$ ssh 192.168.2.243
The authenticity of host '192.168.2.243 (192.168.2.243)' can't be established.
ECDSA key fingerprint is SHA256:AHuuh0g+DN2n0WYBf9qQxlgQKRA1usuo0N8osgHvH6A.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.2.243' (ECDSA) to the list of known hosts.
dl2017@192.168.2.243's password: 
Welcome to Ubuntu 16.04.3 LTS (GNU/Linux 4.10.0-40-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

10 packages can be updated.
7 updates are security updates.

@alsrgv
Copy link
Member

alsrgv commented Nov 29, 2017

@BIGBALLON, try saying yes, and then running the mpirun command again.

@BIGBALLON
Copy link
Author

BIGBALLON commented Nov 29, 2017

@alsrgv
It doesn't work, but i have another discover:

dl2017@mtk:~$ mpirun -np 2 \
>     -H 192.168.2.243:1,192.168.3.246:1 \
>     -bind-to none -map-by slot \
>     -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
>     python3 keras_mnist_advanced.py
^Cdl2017@192.168.2.243's password: 
dl2017@mtk:~$ Permission denied, please try again.
dl2017@192.168.2.243's password: 
Permission denied, please try again.
dl2017@192.168.2.243's password: 
Permission denied (publickey,password).
^C

When I using ctrl + c to stop the cmd, it occurred

^Cdl2017@192.168.2.243's password: 
dl2017@mtk:~$ Permission denied, please try again.
dl2017@192.168.2.243's password: 
Permission denied, please try again.
dl2017@192.168.2.243's password: 
Permission denied (publickey,password).

@alsrgv
Copy link
Member

alsrgv commented Nov 29, 2017

@BIGBALLON, see this guide for passwordless ssh.

@BIGBALLON
Copy link
Author

BIGBALLON commented Nov 29, 2017

@alsrgv
It works now.!! thanks!

But output ImportError: No module named horovod.keras, so sad!! 😢

(dp) dl2017@mtk:~/Desktop/horovod/examples$ mpirun -np 2 \
>     -H 192.168.2.249:1,192.168.3.246:1 \
>     -bind-to none -map-by slot \
>     -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
>     python keras_mnist_advanced.py
Using TensorFlow backend.
Traceback (most recent call last):
  File "keras_mnist_advanced.py", line 10, in <module>
    import horovod.keras as hvd
ImportError: No module named horovod.keras
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[30732,1],1]
  Exit code:    1
--------------------------------------------------------------------------

but single machine is fine:

(dp) dl2017@mtk:~/Desktop/horovod/examples$ mpirun -np 1 \
>     -H localhost:1 \
>     -bind-to none -map-by slot \
>     -x LD_LIBRARY_PATH \
>     python3 keras_mnist_advanced.py
2017-11-29 11:57:28.714190: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2017-11-29 11:57:28.825021: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-11-29 11:57:28.825375: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:01:00.0
totalMemory: 10.91GiB freeMemory: 10.50GiB
2017-11-29 11:57:28.825387: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Epoch 1/24
 82/468 [====>.........................] - ETA: 7s 

@alsrgv
Copy link
Member

alsrgv commented Nov 29, 2017

@BIGBALLON, did you install Horovod on the second machine, too? Are you using a virtualenv?

@BIGBALLON
Copy link
Author

@alsrgv
I install Horovod in both machine. But all the Horovod installed in virtualenv

@alsrgv
Copy link
Member

alsrgv commented Nov 29, 2017

@BIGBALLON, can you try this (note python3 path):

mpirun -np 2 \
    -H 192.168.2.243:1,192.168.3.246:1 \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
    /path/to/virtualenv/bin/python3 keras_mnist_advanced.py

@BIGBALLON
Copy link
Author

BIGBALLON commented Nov 29, 2017

@alsrgv another problem...XD 😿

dl2017@mtk:~/Desktop/horovod/examples$ mpirun -np 2 \
>     -H 192.168.2.249:1,192.168.3.246:1 \
>     -bind-to none -map-by slot \
>     -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
>     /home/dl2017/Desktop/dp/bin/python3 keras_mnist_advanced.py
--------------------------------------------------------------------------
Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected.  This is highly unusual.

The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).

  Local host:          mtk
  Local PID:           28422
  Peer hostname:       mtk ([[26223,1],0])
  Source IP of socket: 192.168.2.1
  Known IPs of peer:   
	192.168.3.246
--------------------------------------------------------------------------
[mtk:04140] 1 more process has sent help message help-mpi-btl-tcp.txt / dropped inbound connection
[mtk:04140] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

@alsrgv
Copy link
Member

alsrgv commented Nov 29, 2017

@BIGBALLON, before you were using -H 192.168.2.243:1,192.168.3.246:1, did you change the IP address of your server?

@BIGBALLON
Copy link
Author

BIGBALLON commented Nov 29, 2017

@alsrgv I make sure the IP address is correct
(There are 8 machines. I used .3.246 & .2.249 to test passwordless ssh, so i change the cmd).

dl2017@mtk:~/.ssh$ ifconfig 
enp0s31f6 Link encap:Ethernet  HWaddr 10:7b:44:16:20:14  
          inet addr:192.168.3.246  Bcast:192.168.3.255  Mask:255.255.255.0

(dp) dl2017@mtk:~/Desktop/horovod/examples$ ifconfig 
enp0s31f6 Link encap:Ethernet  HWaddr 10:7b:44:16:20:e4  
          inet addr:192.168.2.249  Bcast:192.168.2.255  Mask:255.255.255.0

@alsrgv
Copy link
Member

alsrgv commented Nov 29, 2017

@BIGBALLON, can you make hostname different on different servers? Also, which machine has address 192.168.2.1?

@BIGBALLON
Copy link
Author

BIGBALLON commented Nov 29, 2017

@alsrgv
It's seems that 192.168.2.x can only access 192.168.2.x
so i change the cmd:

mpirun -np 2 \
    -H 192.168.2.249:1,192.168.2.243:1 \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
    /home/dl2017/Desktop/dp/bin/python3 keras_mnist_advanced.py

it runs, but.....:

(dp) dl2017@mtk:~/Desktop/horovod/examples$ mpirun -np 2 \
>     -H 192.168.2.249:1,192.168.2.243:1 \
>     -bind-to none -map-by slot \
>     -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
>     /home/dl2017/Desktop/dp/bin/python3 keras_mnist_advanced.py
debug
2017-11-29 12:45:21.683962: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2017-11-29 12:45:21.689121: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2017-11-29 12:45:21.790597: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-11-29 12:45:21.790954: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:01:00.0
totalMemory: 10.91GiB freeMemory: 10.69GiB
2017-11-29 12:45:21.790965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2017-11-29 12:45:21.799627: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-11-29 12:45:21.800166: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:01:00.0
totalMemory: 10.91GiB freeMemory: 10.69GiB
2017-11-29 12:45:21.800180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Epoch 1/24
Epoch 1/24
mtk:2521:2528 [0] INFO NET : Using interface eth0:192.168.2.243<0>
mtk:2521:2528 [0] INFO NET/IB : Using interface eth0 for sideband communication
mtk:2521:2528 [0] INFO Using internal Network Socket
mtk:2521:2528 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
mtk:2521:2528 [0] INFO NET : Using interface eth0:192.168.2.243<0>
mtk:2521:2528 [0] INFO NET/Socket : 1 interfaces found
NCCL version 2.1.2+cuda8.0
mtk:29020:29027 [0] INFO NET : Using interface enp0s31f6:192.168.2.249<0>
mtk:29020:29027 [0] INFO NET/IB : Using interface enp0s31f6 for sideband communication
mtk:29020:29027 [0] INFO Using internal Network Socket
mtk:29020:29027 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
mtk:29020:29027 [0] INFO NET : Using interface enp0s31f6:192.168.2.249<0>
mtk:29020:29027 [0] INFO NET/Socket : 1 interfaces found
mtk:2521:2528 [0] INFO Using 256 threads
mtk:2521:2528 [0] INFO Min Comp Cap 6
mtk:2521:2528 [0] INFO NCCL_SINGLE_RING_THRESHOLD=131072
mtk:2521:2528 [0] INFO [0] Ring 0 :    0   1
mtk:2521:2528 [0] INFO 0 -> 1 via P2P/IPC
mtk:2521:2528 [0] INFO 0 -> 1 via P2P/IPC
mtk:29020:29027 [0] INFO 1 -> 0 via P2P/IPC
mtk:29020:29027 [0] INFO 1 -> 0 via P2P/IPC

mtk:2521:2528 [0] transport/p2p.cu:431 WARN failed to open CUDA IPC handle : 30 unknown error
mtk:2521:2528 [0] INFO transport/p2p.cu:441 -> 1
mtk:2521:2528 [0] INFO init.cu:462 -> 1
mtk:2521:2528 [0] INFO init.cu:517 -> 1

mtk:29020:29027 [0] transport/p2p.cu:431 WARN failed to open CUDA IPC handle : 30 unknown error
mtk:29020:29027 [0] INFO transport/p2p.cu:441 -> 1
mtk:29020:29027 [0] INFO init.cu:462 -> 1
mtk:29020:29027 [0] INFO init.cu:517 -> 1
mtk:2521:2528 [0] INFO Using 256 threads
mtk:2521:2528 [0] INFO Min Comp Cap 6
mtk:2521:2528 [0] INFO NCCL_SINGLE_RING_THRESHOLD=131072
mtk:2521:2528 [0] INFO [0] Ring 0 :    0   1
mtk:2521:2528 [0] INFO 0 -> 1 via P2P/IPC
mtk:2521:2528 [0] INFO 0 -> 1 via P2P/IPC
mtk:29020:29027 [0] INFO 1 -> 0 via P2P/IPC
mtk:29020:29027 [0] INFO 1 -> 0 via P2P/IPC

mtk:2521:2528 [0] transport/p2p.cu:431 WARN failed to open CUDA IPC handle : 30 unknown error
mtk:2521:2528 [0] INFO transport/p2p.cu:441 -> 1
mtk:2521:2528 [0] INFO init.cu:462 -> 1
mtk:2521:2528 [0] INFO init.cu:517 -> 1

mtk:29020:29027 [0] transport/p2p.cu:431 WARN failed to open CUDA IPC handle : 30 unknown error
mtk:29020:29027 [0] INFO transport/p2p.cu:441 -> 1
mtk:29020:29027 [0] INFO init.cu:462 -> 1
mtk:29020:29027 [0] INFO init.cu:517 -> 1
Using TensorFlow backend.
Using TensorFlow backend.
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1323, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
    status, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled cuda error
	 [[Node: training/Adadelta/DistributedAdadelta_Allreduce/HorovodAllreduce_training_Adadelta_gradients_dense_2_MatMul_grad_MatMul_1_0 = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/Adadelta/gradients/dense_2/MatMul_grad/MatMul_1)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "keras_mnist_advanced.py", line 122, in <module>
    validation_steps=3 * test_batches // hvd.size())
  File "/usr/local/lib/python3.5/dist-packages/keras/legacy/interfaces.py", line 87, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/keras/models.py", line 1223, in fit_generator
    initial_epoch=initial_epoch)
  File "/usr/local/lib/python3.5/dist-packages/keras/legacy/interfaces.py", line 87, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/keras/engine/training.py", line 2114, in fit_generator
    class_weight=class_weight)
  File "/usr/local/lib/python3.5/dist-packages/keras/engine/training.py", line 1832, in train_on_batch
    outputs = self.train_function(ins)
  File "/usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py", line 2352, in __call__
    **self.session_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled cuda error
	 [[Node: training/Adadelta/DistributedAdadelta_Allreduce/HorovodAllreduce_training_Adadelta_gradients_dense_2_MatMul_grad_MatMul_1_0 = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/Adadelta/gradients/dense_2/MatMul_grad/MatMul_1)]]

Caused by op 'training/Adadelta/DistributedAdadelta_Allreduce/HorovodAllreduce_training_Adadelta_gradients_dense_2_MatMul_grad_MatMul_1_0', defined at:
  File "keras_mnist_advanced.py", line 122, in <module>
    validation_steps=3 * test_batches // hvd.size())
  File "/usr/local/lib/python3.5/dist-packages/keras/legacy/interfaces.py", line 87, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/keras/models.py", line 1223, in fit_generator
    initial_epoch=initial_epoch)
  File "/usr/local/lib/python3.5/dist-packages/keras/legacy/interfaces.py", line 87, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/keras/engine/training.py", line 1996, in fit_generator
    self._make_train_function()
  File "/usr/local/lib/python3.5/dist-packages/keras/engine/training.py", line 990, in _make_train_function
    loss=self.total_loss)
  File "/usr/local/lib/python3.5/dist-packages/keras/legacy/interfaces.py", line 87, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/keras/optimizers.py", line 344, in get_updates
    grads = self.get_gradients(loss, params)
  File "/home/dl2017/Desktop/dp/lib/python3.5/site-packages/horovod/keras/__init__.py", line 57, in get_gradients
    device_sparse=self._device_sparse)
  File "/home/dl2017/Desktop/dp/lib/python3.5/site-packages/horovod/tensorflow/__init__.py", line 76, in allreduce
    summed_tensor = _allreduce(tensor)
  File "/home/dl2017/Desktop/dp/lib/python3.5/site-packages/horovod/tensorflow/mpi_ops.py", line 145, in _allreduce
    return MPI_LIB.horovod_allreduce(tensor, name=name)
  File "<string>", line 37, in horovod_allreduce
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 328, in apply_op
    op_type_name, name, **keywords)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

UnknownError (see above for traceback): ncclCommInitRank failed: unhandled cuda error
	 [[Node: training/Adadelta/DistributedAdadelta_Allreduce/HorovodAllreduce_training_Adadelta_gradients_dense_2_MatMul_grad_MatMul_1_0 = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/Adadelta/gradients/dense_2/MatMul_grad/MatMul_1)]]

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1323, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
    status, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled cuda error
	 [[Node: training/Adadelta/DistributedAdadelta_Allreduce/HorovodAllreduce_training_Adadelta_gradients_dense_2_MatMul_grad_MatMul_1_0 = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/Adadelta/gradients/dense_2/MatMul_grad/MatMul_1)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "keras_mnist_advanced.py", line 121, in <module>
    validation_steps=3 * test_batches // hvd.size())
  File "/usr/local/lib/python3.5/dist-packages/keras/legacy/interfaces.py", line 87, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/keras/models.py", line 1223, in fit_generator
    initial_epoch=initial_epoch)
  File "/usr/local/lib/python3.5/dist-packages/keras/legacy/interfaces.py", line 87, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/keras/engine/training.py", line 2114, in fit_generator
    class_weight=class_weight)
  File "/usr/local/lib/python3.5/dist-packages/keras/engine/training.py", line 1832, in train_on_batch
    outputs = self.train_function(ins)
  File "/usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py", line 2352, in __call__
    **self.session_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled cuda error
	 [[Node: training/Adadelta/DistributedAdadelta_Allreduce/HorovodAllreduce_training_Adadelta_gradients_dense_2_MatMul_grad_MatMul_1_0 = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/Adadelta/gradients/dense_2/MatMul_grad/MatMul_1)]]

Caused by op 'training/Adadelta/DistributedAdadelta_Allreduce/HorovodAllreduce_training_Adadelta_gradients_dense_2_MatMul_grad_MatMul_1_0', defined at:
  File "keras_mnist_advanced.py", line 121, in <module>
    validation_steps=3 * test_batches // hvd.size())
  File "/usr/local/lib/python3.5/dist-packages/keras/legacy/interfaces.py", line 87, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/keras/models.py", line 1223, in fit_generator
    initial_epoch=initial_epoch)
  File "/usr/local/lib/python3.5/dist-packages/keras/legacy/interfaces.py", line 87, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/keras/engine/training.py", line 1996, in fit_generator
    self._make_train_function()
  File "/usr/local/lib/python3.5/dist-packages/keras/engine/training.py", line 990, in _make_train_function
    loss=self.total_loss)
  File "/usr/local/lib/python3.5/dist-packages/keras/legacy/interfaces.py", line 87, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/keras/optimizers.py", line 344, in get_updates
    grads = self.get_gradients(loss, params)
  File "/home/dl2017/Desktop/dp/lib/python3.5/site-packages/horovod/keras/__init__.py", line 57, in get_gradients
    device_sparse=self._device_sparse)
  File "/home/dl2017/Desktop/dp/lib/python3.5/site-packages/horovod/tensorflow/__init__.py", line 76, in allreduce
    summed_tensor = _allreduce(tensor)
  File "/home/dl2017/Desktop/dp/lib/python3.5/site-packages/horovod/tensorflow/mpi_ops.py", line 145, in _allreduce
    return MPI_LIB.horovod_allreduce(tensor, name=name)
  File "<string>", line 37, in horovod_allreduce
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 328, in apply_op
    op_type_name, name, **keywords)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

UnknownError (see above for traceback): ncclCommInitRank failed: unhandled cuda error
	 [[Node: training/Adadelta/DistributedAdadelta_Allreduce/HorovodAllreduce_training_Adadelta_gradients_dense_2_MatMul_grad_MatMul_1_0 = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/Adadelta/gradients/dense_2/MatMul_grad/MatMul_1)]]

-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[32656,1],1]
  Exit code:    1
--------------------------------------------------------------------------

@alsrgv
Copy link
Member

alsrgv commented Nov 29, 2017

@BIGBALLON, sounds like NCCL is also confused about hostnames being the same, since it's trying to use IPC to talk between machines:

mtk:2521:2528 [0] INFO 0 -> 1 via P2P/IPC
mtk:2521:2528 [0] INFO 0 -> 1 via P2P/IPC

@BIGBALLON
Copy link
Author

BIGBALLON commented Nov 29, 2017

@alsrgv
cool!!! I change the hostname in one machine!!!!
It works!!! perfect !!

BTW, it seem two machines(2 GPUs) slow than one machine(1 GPU)??
Is it make sense

@haNa-meister
Copy link

I am positive that these two machines have different hostname.
And when I enable p2p dmesg has that:
[504615.873198] CPU31: Package power limit normal
[513819.673902] traps: titan_for_serve[58922] general protection ip:7fae46ab0208 sp:7ffc3ac67ec0 error:0 in libc-2.17.so[7fae46a34000+1b8000]
[524968.829720] traps: titan_for_serve[59461] general protection ip:7fc9ead15208 sp:7ffd82a59e30 error:0 in libc-2.17.so[7fc9eac99000+1b8000]
[535878.188978] traps: titan_for_serve[59061] general protection ip:7f14c6c5f208 sp:7ffe9d5b6890 error:0 in libc-2.17.so[7f14c6be3000+1b8000]
[547386.987218] traps: titan_for_serve[81300] general protection ip:7f1fe47e3208 sp:7ffd702b05e0 error:0 in libc-2.17.so[7f1fe4767000+1b8000]
[558056.602689] traps: titan_for_serve[76123] general protection ip:7f5e3f191208 sp:7ffdd592ca50 error:0 in libc-2.17.so[7f5e3f115000+1b8000]
[569325.624885] traps: titan_for_serve[85118] general protection ip:7f4a33e23208 sp:7ffced400c20 error:0 in libc-2.17.so[7f4a33da7000+1b8000]
[580235.000651] traps: titan_for_serve[83622] general protection ip:7f81b8826208 sp:7ffcf8644030 error:0 in libc-2.17.so[7f81b87aa000+1b8000]
[586011.336229] perf samples too long (86752 > 76923), lowering kernel.perf_event_max_sample_rate to 1750
[590904.613382] traps: titan_for_serve[81755] general protection ip:7fead876d208 sp:7ffd9c5d4540 error:0 in libc-2.17.so[7fead86f1000+1b8000]
[602293.514782] traps: titan_for_serve[84665] general protection ip:7f6171b2c208 sp:7fff10594080 error:0 in libc-2.17.so[7f6171ab0000+1b8000]
[608126.771182] all_reduce_perf[45280]: segfault at 1 ip 00007fd6585a9d8c sp 00007ffe22dd5898 error 4 in libc-2.17.so[7fd658523000+1b8000]
[608126.772180] all_reduce_perf[45281]: segfault at 1 ip 00007f01f6901d8c sp 00007ffe8e3ba718 error 4 in libc-2.17.so[7f01f687b000+1b8000]
[608126.772640] all_reduce_perf[45282]: segfault at 1 ip 00007fcc786cdd8c sp 00007fffe7b06158 error 4 in libc-2.17.so[7fcc78647000+1b8000]
[608126.773493] all_reduce_perf[45283]: segfault at 1 ip 00007f103e48ed8c sp 00007ffd5dfc1398 error 4 in libc-2.17.so[7f103e408000+1b8000]
[608126.774334] all_reduce_perf[45284]: segfault at 1 ip 00007f1c08751d8c sp 00007fff795dc538 error 4 in libc-2.17.so[7f1c086cb000+1b8000]
[608126.775259] all_reduce_perf[45286]: segfault at 1 ip 00007f5ab2fb3d8c sp 00007ffd601783c8 error 4 in libc-2.17.so[7f5ab2f2d000+1b8000]
[613802.313372] traps: titan_for_serve[2228] general protection ip:7fc80fa2b208 sp:7ffc28aa2ae0 error:0 in libc-2.17.so[7fc80f9af000+1b8000]
[625191.218123] traps: titan_for_serve[6637] general protection ip:7fe218e0d208 sp:7ffff669d3f0 error:0 in libc-2.17.so[7fe218d91000+1b8000]
[636701.015923] traps: titan_for_serve[26191] general protection ip:7fb37e852208 sp:7ffd8044e570 error:0 in libc-2.17.so[7fb37e7d6000+1b8000]
[648088.931338] traps: titan_for_serve[27946] general protection ip:7fcd5089a208 sp:7fff2d049fb0 error:0 in libc-2.17.so[7fcd5081e000+1b8000]
[659717.602733] traps: titan_for_serve[39951] general protection ip:7fba07d53208 sp:7ffdd5355050 error:0 in libc-2.17.so[7fba07cd7000+1b8000]
[670147.445286] traps: titan_for_serve[35682] general protection ip:7f3f0d4ca208 sp:7fff4e2d1210 error:0 in libc-2.17.so[7f3f0d44e000+1b8000]
[678326.969792] mlx5_core 0000:0b:00.1 eth0: Link down
[678326.971040] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[678326.971042] 8021q: adding VLAN 0 to HW filter on device eth0
[678327.817344] mlx5_core 0000:0b:00.0 eth1: Link up
[678327.817991] 8021q: adding VLAN 0 to HW filter on device eth1
[680577.289339] traps: titan_for_serve[29152] general protection ip:7f5603934208 sp:7fffedc1dc70 error:0 in libc-2.17.so[7f56038b8000+1b8000]
[692086.079912] traps: titan_for_serve[28810] general protection ip:7fcd3242e208 sp:7fffbb942b60 error:0 in libc-2.17.so[7fcd323b2000+1b8000]
[698943.146678] perf samples too long (148922 > 142857), lowering kernel.perf_event_max_sample_rate to 1000
[702755.697323] traps: titan_for_serve[28685] general protection ip:7ff12f2c6208 sp:7ffd421d6cb0 error:0 in libc-2.17.so[7ff12f24a000+1b8000]
[713784.956391] traps: titan_for_serve[26652] general protection ip:7f1f318bc208 sp:7ffef5159f70 error:0 in libc-2.17.so[7f1f31840000+1b8000]
[723976.030508] traps: titan_for_serve[39335] general protection ip:7f3b08585208 sp:7ffd25771680 error:0 in libc-2.17.so[7f3b08509000+1b8000]
[734765.521975] traps: titan_for_serve[34016] general protection ip:7fc36ad91208 sp:7ffef34d9870 error:0 in libc-2.17.so[7fc36ad15000+1b8000]
[745194.363376] traps: titan_for_serve[34957] general protection ip:7f2783dd3208 sp:7fffdf69f630 error:0 in libc-2.17.so[7f2783d57000+1b8000]
[756583.280369] traps: titan_for_serve[36125] general protection ip:7f58b2633208 sp:7fff04edb0e0 error:0 in libc-2.17.so[7f58b25b7000+1b8000]
[767732.431507] traps: titan_for_serve[36580] general protection ip:7fce98700208 sp:7ffc40f32b50 error:0 in libc-2.17.so[7fce98684000+1b8000]
[770445.069595] CPU65: Core power limit notification (total events = 22)
[770445.069597] CPU21: Core power limit notification (total events = 22)
[770445.069599] CPU7: Core power limit notification (total events = 22)
[770445.069601] CPU46: Core power limit notification (total events = 22)

@haNa-meister
Copy link

otherwise there is no massage related to p2p or shm I think

@alsrgv
Copy link
Member

alsrgv commented May 31, 2018

@haNa-meister, nothing really stands out in that log. Can you try to upgrade to latest NCCL 2.2? If that doesn't help, you can raise an issue at NVIDIA Dev Zone against NCCL.

@haNa-meister
Copy link

OK, actually I have tried NCCL2.2.12 before. Thanks for your help!

@haNa-meister
Copy link

Hi alex,
Thanks for your help!
I finally figure it out, it is a problem related to hostname.
We use '.' as separator in our server, thus it treat the prefix as hostname actually which is same in both machine.
I realize it today and change the hostname, then it works!
Thanks a lot!

@alsrgv
Copy link
Member

alsrgv commented Jun 1, 2018

@haNa-meister, excellent! :-)

@zeroprg
Copy link

zeroprg commented Sep 27, 2018

I have the same type of problem. Blank screen , nothing happened
I only run on CPU :

I can run remotely from one computer to another:
mpirun -np 1 -H 192.168.1.78 python3 horovod_exmp/keras_mnist.py
and from another one
mpirun -np 1 -H 192.168.1.70 python3 horovod_exmp/keras_mnist.py
but when I combine 2 machine in one cluster it's failed:
mpirun -np 1 -H 192.168.1.70,192.168.1.78 python3 horovod_exmp/keras_mnist.py give me blank screen when I MPI helloworld.py it works:
mpirun -np 1 -H 192.168.1.70,192.168.1.78 python3 helloworld.py
I checked my network interfaces: I don't have enp0s31f6 interface I have only :

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 02:81:05:1e:eb:4c brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.78/24 brd 192.168.1.255 scope global dynamic eth0
       valid_lft 62723sec preferred_lft 62723sec
    inet6 2001:56a:77e0:fd00:46be:3ed3:8c87:d3d8/64 scope global noprefixroute dynamic
       valid_lft 14648sec preferred_lft 14348sec
    inet6 fe80::bf63:3907:cab2:f275/64 scope link
       valid_lft forever preferred_lft forever
3: docker_gwbridge: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:bd:eb:b1:5b brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.1/16 brd 172.18.255.255 scope global docker_gwbridge
       valid_lft forever preferred_lft forever
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:98:8d:47:8a brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever

@zeroprg
Copy link

zeroprg commented Sep 27, 2018

When I run it one by one from one machine to another:

mpiuser@opi1:~$ mpirun -bind-to none -np 1 -H 192.168.1.78  python3 horovod_exmp/keras_mnist.py
Data loaded
horovod initialised
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples

When I run both:
hrv.init() hangout:

Data loaded
Data loaded

As I mentioned mpi working fine with python in cluster on both machines, I tested it by:

mpiuser@opi1:~$ mpirun -bind-to none -np 2 -H 192.168.1.78,192.168.1.70  python3 helloworld.py
Hello, World! I am process 0 of 2 on opi1.
Hello, World! I am process 1 of 2 on rock64_1.

@zeroprg
Copy link

zeroprg commented Sep 27, 2018

This is my strrace .. . S much missed. Do you know what library I have add ?

mpiuser@opi1:~$ strace mpirun -bind-to none -np 2 -H 192.168.1.78,192.168.1.70  python3 horovod_exmp/keras_mnist.py
execve("/usr/bin/mpirun", ["mpirun", "-bind-to", "none", "-np", "2", "-H", "192.168.1.78,192.168.1.70", "python3", "horovod_exmp/keras_mnist.py"], [/* 22 vars */]) = 0
brk(NULL)                               = 0x2308000
uname({sysname="Linux", nodename="opi1", ...}) = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
mmap2(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb6fb7000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/usr/lib/arm-linux-gnueabihf/openmpi/lib/tls/v7l/neon/vfp/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/arm-linux-gnueabihf/openmpi/lib/tls/v7l/neon/vfp", 0xbe92ce98) = -1 ENOENT (No such file or directory)
open("/usr/lib/arm-linux-gnueabihf/openmpi/lib/tls/v7l/neon/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/arm-linux-gnueabihf/openmpi/lib/tls/v7l/neon", 0xbe92ce98) = -1 ENOENT (No such file or directory)
open("/usr/lib/arm-linux-gnueabihf/openmpi/lib/tls/v7l/vfp/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/arm-linux-gnueabihf/openmpi/lib/tls/v7l/vfp", 0xbe92ce98) = -1 ENOENT (No such file or directory)
open("/usr/lib/arm-linux-gnueabihf/openmpi/lib/tls/v7l/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/arm-linux-gnueabihf/openmpi/lib/tls/v7l", 0xbe92ce98) = -1 ENOENT (No such file or directory)
open("/usr/lib/arm-linux-gnueabihf/openmpi/lib/tls/neon/vfp/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/arm-linux-gnueabihf/openmpi/lib/tls/neon/vfp", 0xbe92ce98) = -1 ENOENT (No such file or directory)
open("/usr/lib/arm-linux-gnueabihf/openmpi/lib/tls/neon/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/arm-linux-gnueabihf/openmpi/lib/tls/neon", 0xbe92ce98) = -1 ENOENT (No such file or directory)
open("/usr/lib/arm-linux-gnueabihf/openmpi/lib/tls/vfp/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/arm-linux-gnueabihf/openmpi/lib/tls/vfp", 0xbe92ce98) = -1 ENOENT (No such file or directory)
open("/usr/lib/arm-linux-gnueabihf/openmpi/lib/tls/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/arm-linux-gnueabihf/openmpi/lib/tls", 0xbe92ce98) = -1 ENOENT (No such file or directory)
open("/usr/lib/arm-linux-gnueabihf/openmpi/lib/v7l/neon/vfp/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/arm-linux-gnueabihf/openmpi/lib/v7l/neon/vfp", 0xbe92ce98) = -1 ENOENT (No such file or directory)
open("/usr/lib/arm-linux-gnueabihf/openmpi/lib/v7l/neon/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/arm-linux-gnueabihf/openmpi/lib/v7l/neon", 0xbe92ce98) = -1 ENOENT (No such file or directory)
open("/usr/lib/arm-linux-gnueabihf/openmpi/lib/v7l/vfp/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/arm-linux-gnueabihf/openmpi/lib/v7l/vfp", 0xbe92ce98) = -1 ENOENT (No such file or directory)
open("/usr/lib/arm-linux-gnueabihf/openmpi/lib/v7l/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/arm-linux-gnueabihf/openmpi/lib/v7l", 0xbe92ce98) = -1 ENOENT (No such file or directory)
open("/usr/lib/arm-linux-gnueabihf/openmpi/lib/neon/vfp/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/arm-linux-gnueabihf/openmpi/lib/neon/vfp", 0xbe92ce98) = -1 ENOENT (No such file or directory)
open("/usr/lib/arm-linux-gnueabihf/openmpi/lib/neon/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/arm-linux-gnueabihf/openmpi/lib/neon", 0xbe92ce98) = -1 ENOENT (No such file or directory)
open("/usr/lib/arm-linux-gnueabihf/openmpi/lib/vfp/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/arm-linux-gnueabihf/openmpi/lib/vfp", 0xbe92ce98) = -1 ENOENT (No such file or directory)
open("/usr/lib/arm-linux-gnueabihf/openmpi/lib/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/arm-linux-gnueabihf/openmpi/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("tls/v7l/neon/vfp/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("tls/v7l/neon/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("tls/v7l/vfp/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("tls/v7l/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("tls/neon/vfp/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("tls/neon/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("tls/vfp/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("tls/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("v7l/neon/vfp/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("v7l/neon/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("v7l/vfp/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("v7l/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("neon/vfp/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("neon/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("vfp/libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("libopen-rte.so.20", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=30356, ...}) = 0
mmap2(NULL, 30356, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb6faf000
close(3) 

@alsrgv
Copy link
Member

alsrgv commented Sep 27, 2018

@zeroprg, I think the issue is with those docker interfaces. Can you try using -mca btl_tcp_if_include eth0 -x NCCL_SOCKET_IFNAME=eth0 in your mpirun command?

@zeroprg
Copy link

zeroprg commented Sep 28, 2018

Finally fixed and could run horovod on 2 OrangePi ARM computers. I had 2 different problem. Both problem related to MPI.

  1. MPI must be the same version . I particular used OpenMPI 2.0.2 from Debian repository
  2. All machines in cluster must be the same processor type 64 or 32 bit accordingly because library used in cross loaded manner. I had error , (see my post above) related to 64/32 bits libraries not found , even if they exist.
  3. Another problem why it's not working by default because network card must be explicitly specified by this option:
    -mca btl_tcp_if_include eth0
    So finally my command looks like this:
    mpirun -np 2 -H opi1,opi2 -bind-to none -map-by slot -mca btl_tcp_if_include eth0 python3 horovod_exmp/keras_mnist.py
    Again thank you very much this branch , it's very useful. Don't forget to strace .. command if some problem happened. I will publish blog about running ### Horovod on ### OrangePi cluster of 8 machines. No Docker needed, It's just amazed performance for small electricity consumption.

@alsrgv
Copy link
Member

alsrgv commented Sep 28, 2018

@zeroprg, great, looking forward to your blog post!

@cdaningWings
Copy link

cdaningWings commented Oct 25, 2018

Hi, I have the same type of problem.
My environment is Centos7 nccl2.3.1,openmpi-3.1.2, and when I run it :
strace mpirun -np 4 -H lf-solar-gpu13-pm:2,lf-solar-gpu16-pm:2 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -mca pml ob1 -mca btl ^openib python3 tensorflow_mnist.py

the output is :
lf-solar-gpu13-pm:473785:473811 [0] NCCL INFO NET : Using interface br0:10.9.22.45<0>
lf-solar-gpu13-pm:473785:473811 [0] NCCL INFO NET/IB : Using interface br0 for sideband communication
lf-solar-gpu13-pm:473785:473811 [0] NCCL INFO Using internal Network Socket
lf-solar-gpu13-pm:473785:473811 [0] NCCL INFO NET : Using interface br0:10.9.22.45<0>
lf-solar-gpu13-pm:473785:473811 [0] NCCL INFO NET : Using interface bond0:fe80::ae1f:6bff:fe20:f588%bond0<0>
lf-solar-gpu13-pm:473785:473811 [0] NCCL INFO NET/Socket : 2 interfaces found
NCCL version 2.3.5+cuda9.0
lf-solar-gpu13-pm:473785:473811 [0] NCCL INFO rank 0 nranks 4
INFO NCCL debug level set to INFO
INFO NCCL debug level set to INFO
lf-solar-gpu13-pm:473786:473813 [1] NCCL INFO NET : Using interface br0:10.9.22.45<0>
lf-solar-gpu13-pm:473786:473813 [1] NCCL INFO NET/IB : Using interface br0 for sideband communication
lf-solar-gpu13-pm:473786:473813 [1] NCCL INFO Using internal Network Socket
lf-solar-gpu13-pm:473786:473813 [1] NCCL INFO rank 1 nranks 4
INFO rank 3 using buffSize = 2097152
INFO rank 2 using buffSize = 2097152
INFO rank 3 using device 1 (0000:03:00.0)
WARN src/core.cu:784 rank 3 failed to build comm maps
INFO rank 2 using device 0 (0000:02:00.0)
WARN src/core.cu:784 rank 2 failed to build comm maps
lf-solar-gpu13-pm:473785:473811 [0] NCCL INFO comm 0x7f7c00c4b4e0 rank 0 nranks 4
lf-solar-gpu13-pm:473785:473811 [0] NCCL INFO Could not find real path of /sys/class/net/br0/device
lf-solar-gpu13-pm:473785:473811 [0] NCCL INFO Could not find real path of /sys/class/net/bond0/device
lf-solar-gpu13-pm:473785:473811 [0] NCCL INFO CUDA Dev 0, IP Interfaces : br0(SOC) bond0(SOC)
lf-solar-gpu13-pm:473786:473813 [1] NCCL INFO comm 0x7f60c848b6a0 rank 1 nranks 4
lf-solar-gpu13-pm:473786:473813 [1] NCCL INFO NET : Using interface br0:10.9.22.45<0>
lf-solar-gpu13-pm:473786:473813 [1] NCCL INFO NET : Using interface bond0:fe80::ae1f:6bff:fe20:f588%bond0<0>
lf-solar-gpu13-pm:473786:473813 [1] NCCL INFO NET/Socket : 2 interfaces found
lf-solar-gpu13-pm:473786:473813 [1] NCCL INFO Could not find real path of /sys/class/net/br0/device
lf-solar-gpu13-pm:473786:473813 [1] NCCL INFO Could not find real path of /sys/class/net/bond0/device
lf-solar-gpu13-pm:473786:473813 [1] NCCL INFO CUDA Dev 1, IP Interfaces : br0(SOC) bond0(SOC)
and it seems the program hangs up and output nothing.
the whole file as follows: log

thank you for your help!!!

@alsrgv
Copy link
Member

alsrgv commented Oct 25, 2018

@cdaningWings, feels like you have NCCL 1.x on the second node since NCCL output looks different.

@cdaningWings
Copy link

@alsrgv Thanks for your help!
but, how can I determine if my second node is NCCL version 1. X? and my LD_LIBRARY_PATH does not contain NCCL 1. x information.

@cdaningWings
Copy link

@alsrgv Hi ,when i run the code in the second node[gpu16, the above code ran in gpu13], the error as follows:
INFO NCCL debug level set to INFO
INFO NCCL debug level set to INFO
INFO rank 1 using buffSize = 2097152
INFO rank 0 using buffSize = 2097152
INFO rank 1 using device 1 (0000:03:00.0)
INFO rank 0 using device 0 (0000:02:00.0)
lf-solar-gpu13-pm:487531:487559 [1] NCCL INFO NET : Using interface br0:10.9.22.45<0>
lf-solar-gpu13-pm:487531:487559 [1] NCCL INFO NET/IB : Using interface br0 for sideband communication
lf-solar-gpu13-pm:487531:487559 [1] NCCL INFO Using internal Network Socket
lf-solar-gpu13-pm:487531:487559 [1] NCCL INFO rank 3 nranks 4
lf-solar-gpu13-pm:487530:487556 [0] NCCL INFO NET : Using interface br0:10.9.22.45<0>
lf-solar-gpu13-pm:487530:487556 [0] NCCL INFO NET/IB : Using interface br0 for sideband communication
lf-solar-gpu13-pm:487530:487556 [0] NCCL INFO Using internal Network Socket
lf-solar-gpu13-pm:487530:487556 [0] NCCL INFO rank 2 nranks 4
lf-solar-gpu13-pm:487530:487556 [0] NCCL INFO comm 0x7f2a20413010 rank 2 nranks 4
lf-solar-gpu13-pm:487531:487559 [1] NCCL INFO comm 0x7ffba445e890 rank 3 nranks 4
lf-solar-gpu13-pm:487530:487556 [0] NCCL INFO NET : Using interface br0:10.9.22.45<0>
lf-solar-gpu13-pm:487531:487559 [1] NCCL INFO NET : Using interface br0:10.9.22.45<0>
lf-solar-gpu13-pm:487531:487559 [1] NCCL INFO NET : Using interface bond0:fe80::ae1f:6bff:fe20:f588%bond0<0>
lf-solar-gpu13-pm:487530:487556 [0] NCCL INFO NET : Using interface bond0:fe80::ae1f:6bff:fe20:f588%bond0<0>
lf-solar-gpu13-pm:487530:487556 [0] NCCL INFO NET/Socket : 2 interfaces found
lf-solar-gpu13-pm:487531:487559 [1] NCCL INFO NET/Socket : 2 interfaces found

lf-solar-gpu13-pm:487531:487559 [1] include/socket.h:345 NCCL WARN Net : Socket creation failed : Address family not supported by protocol
lf-solar-gpu13-pm:487530:487556 [0] include/socket.h:345 NCCL WARN Net : Socket creation failed : Address family not supported by protocol
lf-solar-gpu13-pm:487530:487556 [0] NCCL INFO transport/net_socket.cu:139 -> 2
lf-solar-gpu13-pm:487531:487559 [1] NCCL INFO transport/net_socket.cu:139 -> 2
lf-solar-gpu13-pm:487531:487559 [1] NCCL INFO bootstrap.cu:19 -> 2
lf-solar-gpu13-pm:487531:487559 [1] NCCL INFO bootstrap.cu:225 -> 2
lf-solar-gpu13-pm:487531:487559 [1] NCCL INFO init.cu:420 -> 2
lf-solar-gpu13-pm:487530:487556 [0] NCCL INFO bootstrap.cu:19 -> 2
lf-solar-gpu13-pm:487530:487556 [0] NCCL INFO bootstrap.cu:225 -> 2
lf-solar-gpu13-pm:487530:487556 [0] NCCL INFO init.cu:420 -> 2
lf-solar-gpu13-pm:487530:487556 [0] NCCL INFO init.cu:557 -> 2
lf-solar-gpu13-pm:487531:487559 [1] NCCL INFO init.cu:557 -> 2

so, it means the gpu13 have some problems?

@alsrgv
Copy link
Member

alsrgv commented Oct 25, 2018

@cdaningWings, can you try -x NCCL_SOCKET_IFNAME=br0 to specify IPv4 br0 device instead of IPv6 bond0 device?

@cdaningWings
Copy link

cdaningWings commented Oct 25, 2018

@alsrgv I have tried, but the mistake is the same, details as follows: https://gist.github.com/cdaningWings/df60514312957cb7dfa6b4c99fa7cb45

@alsrgv
Copy link
Member

alsrgv commented Oct 25, 2018

@cdaningWings, a couple of things:

  1. Can you uninstall horovod, and reinstall with HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_NCCL_HOME=/path/to/nccl2.3.5 pip install -v --no-cache-dir horovod
  2. If (1) still does not help, can you try ssh-ing to 10.9.22.45 from the other server?

@cdaningWings
Copy link

@alsrgv
1. I have reinstall the horovod, but still does not help.
2. yes, I can ssh-ing 10.9.22.45 from the other server

And I still have some question. when i run the code in the gpu13, and the second node is gpu16. the INFO is :NCCL INFO Could not find real path of /sys/class/net/br0/device。 but if i run the code in the gpu16, and the second node is gpu13, the INFO is: NCCL WARN Net : Socket creation failed : Address family not supported by protocol...
This is a network problem? Or is it a hardware problem? My server configuration is the same。

@alsrgv
Copy link
Member

alsrgv commented Oct 25, 2018

@cdaningWings, can you paste the output of ifconfig from both servers?

@cdaningWings
Copy link

@alsrgv ok
The gpu16:
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 1500
inet6 fe80::ae1f:6bff:fe21:d4fc prefixlen 64 scopeid 0x20
ether ac:1f:6b:21:d4:fc txqueuelen 0 (Ethernet)
RX packets 4138422 bytes 3349556769 (3.1 GiB)
RX errors 0 dropped 8625 overruns 0 frame 0
TX packets 2693773 bytes 1648991110 (1.5 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

br0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.9.22.51 netmask 255.255.255.0 broadcast 10.9.22.255
inet6 fe80::ae1f:6bff:fe21:d4fc prefixlen 64 scopeid 0x20
ether ac:1f:6b:21:d4:fc txqueuelen 0 (Ethernet)
RX packets 1800828 bytes 2785223618 (2.5 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 2693767 bytes 1648991254 (1.5 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

eno1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500
ether ac:1f:6b:21:d4:fc txqueuelen 1000 (Ethernet)
RX packets 4138038 bytes 3349532865 (3.1 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 2693780 bytes 1648992512 (1.5 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device memory 0xfb320000-fb33ffff

eno2: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500
ether ac:1f:6b:21:d4:fc txqueuelen 1000 (Ethernet)
RX packets 386 bytes 24036 (23.4 KiB)
RX errors 0 dropped 386 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device memory 0xfb300000-fb31ffff

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10
loop txqueuelen 0 (Local Loopback)
RX packets 177655 bytes 48921269 (46.6 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 177655 bytes 48921269 (46.6 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

The gpu13:
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 1500
inet6 fe80::ae1f:6bff:fe20:f588 prefixlen 64 scopeid 0x20
ether ac:1f:6b:20:f5:88 txqueuelen 0 (Ethernet)
RX packets 17978543 bytes 18010902851 (16.7 GiB)
RX errors 0 dropped 3444 overruns 169 frame 0
TX packets 16557309 bytes 15827401301 (14.7 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

br0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.9.22.45 netmask 255.255.255.0 broadcast 10.9.22.255
inet6 fe80::ae1f:6bff:fe20:f588 prefixlen 64 scopeid 0x20
ether ac:1f:6b:20:f5:88 txqueuelen 0 (Ethernet)
RX packets 7608185 bytes 17216922249 (16.0 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 16557270 bytes 15827398607 (14.7 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

eno1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500
ether ac:1f:6b:20:f5:88 txqueuelen 1000 (Ethernet)
RX packets 3446 bytes 307525 (300.3 KiB)
RX errors 0 dropped 3444 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device memory 0xfb320000-fb33ffff

eno2: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500
ether ac:1f:6b:20:f5:88 txqueuelen 1000 (Ethernet)
RX packets 17975104 bytes 18010595788 (16.7 GiB)
RX errors 0 dropped 0 overruns 169 frame 0
TX packets 16557324 bytes 15827404255 (14.7 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device memory 0xfb300000-fb31ffff

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10
loop txqueuelen 0 (Local Loopback)
RX packets 17371080 bytes 371725587331 (346.1 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 17371080 bytes 371725587331 (346.1 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

@cdaningWings
Copy link

@alsrgv hi I changed the gpu16 to gpu9. and I run the code in the gpu 9, I have got the exception as follows:

lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO NET : Using interface br0:10.9.22.33<0>
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO NET/IB : Using interface br0 for sideband communication
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO Using internal Network Socket
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO NET : Using interface br0:10.9.22.33<0>
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO NET : Using interface bond0:fe80::ec4:7aff:feb3:29d6%bond0<0>
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO NET/Socket : 2 interfaces found
NCCL version 2.3.5+cuda9.0
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO rank 0 nranks 2
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO NET : Using interface br0:10.9.22.45<0>
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO NET/IB : Using interface br0 for sideband communication
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO Using internal Network Socket
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO rank 1 nranks 2
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO comm 0x7fc6803285d0 rank 0 nranks 2
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO Could not find real path of /sys/class/net/br0/device
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO Could not find real path of /sys/class/net/bond0/device
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO CUDA Dev 0, IP Interfaces : br0(SOC) bond0(SOC)
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO comm 0x7f2c6844d470 rank 1 nranks 2
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO NET : Using interface br0:10.9.22.45<0>
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO NET : Using interface bond0:fe80::ae1f:6bff:fe20:f588%bond0<0>
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO NET/Socket : 2 interfaces found
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO Could not find real path of /sys/class/net/br0/device
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO Could not find real path of /sys/class/net/bond0/device
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO CUDA Dev 0, IP Interfaces : br0(SOC) bond0(SOC)
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO Using 256 threads
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO Min Comp Cap 5
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO Ring 00 : 0 1
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO Ring 01 : 0 1
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO Ring 00 : 1 -> 0 via NET/Socket/0
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO Ring 00 : 0 -> 1 via NET/Socket/0
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO Ring 01 : 1 -> 0 via NET/Socket/1
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO Ring 01 : 0 -> 1 via NET/Socket/1

lf-solar-gpu13-pm:504059:504159 [0] include/socket.h:361 NCCL WARN Call to connect failed : Connection timed out
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO transport/net_socket.cu:139 -> 2
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO include/net.h:25 -> 2
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO transport/net.cu:290 -> 2
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO init.cu:490 -> 2
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO init.cu:557 -> 2

lf-solar-gpu9-pm:285594:285607 [0] include/socket.h:361 NCCL WARN Call to connect failed : Connection timed out
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO transport/net_socket.cu:139 -> 2
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO include/net.h:25 -> 2
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO transport/net.cu:290 -> 2
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO init.cu:490 -> 2
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO init.cu:557 -> 2
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO rank 0 nranks 2
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO rank 1 nranks 2
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO comm 0x7f2c684434a0 rank 1 nranks 2
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO comm 0x7fc68032af50 rank 0 nranks 2
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO Could not find real path of /sys/class/net/br0/device
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO Could not find real path of /sys/class/net/bond0/device
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO CUDA Dev 0, IP Interfaces : br0(SOC) bond0(SOC)
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO Could not find real path of /sys/class/net/br0/device
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO Could not find real path of /sys/class/net/bond0/device
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO CUDA Dev 0, IP Interfaces : br0(SOC) bond0(SOC)
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO Using 256 threads
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO Min Comp Cap 5
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO Ring 00 : 0 1
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO Ring 01 : 0 1
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO Ring 00 : 0 -> 1 via NET/Socket/0
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO Ring 00 : 1 -> 0 via NET/Socket/0
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO Ring 01 : 1 -> 0 via NET/Socket/1
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO Ring 01 : 0 -> 1 via NET/Socket/1

lf-solar-gpu13-pm:504059:504159 [0] include/socket.h:361 NCCL WARN Call to connect failed : Connection timed out
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO transport/net_socket.cu:139 -> 2
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO include/net.h:25 -> 2
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO transport/net.cu:290 -> 2
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO init.cu:490 -> 2
lf-solar-gpu13-pm:504059:504159 [0] NCCL INFO init.cu:557 -> 2

lf-solar-gpu9-pm:285594:285607 [0] include/socket.h:361 NCCL WARN Call to connect failed : Connection timed out
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO transport/net_socket.cu:139 -> 2
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO include/net.h:25 -> 2
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO transport/net.cu:290 -> 2
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO init.cu:490 -> 2
lf-solar-gpu9-pm:285594:285607 [0] NCCL INFO init.cu:557 -> 2
Traceback (most recent call last):
File "/home/research/kangcp/opt/Anaconda3-5.2.0/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1322, in _do_call
return fn(*args)
File "/home/research/kangcp/opt/Anaconda3-5.2.0/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/research/kangcp/opt/Anaconda3-5.2.0/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error
[[Node: DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_fully_connected_1_BiasAdd_grad_tuple_control_dependency_1_0 = HorovodAllreduceT=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "tensorflow_mnist.py", line 129, in
tf.app.run()
File "/home/research/kangcp/opt/Anaconda3-5.2.0/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
sys.exit(main(argv))
File "tensorflow_mnist.py", line 125, in main
, step = mon_sess.run([train_op, global_step], feed_dict={image: image, label: label
})
File "/home/research/kangcp/opt/Anaconda3-5.2.0/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 577, in run
run_metadata=run_metadata)
File "/home/research/kangcp/opt/Anaconda3-5.2.0/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1053, in run
run_metadata=run_metadata)
File "/home/research/kangcp/opt/Anaconda3-5.2.0/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1144, in run
raise six.reraise(*original_exc_info)
File "/home/research/kangcp/opt/Anaconda3-5.2.0/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/home/research/kangcp/opt/Anaconda3-5.2.0/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1129, in run
return self._sess.run(*args, **kwargs)
File "/home/research/kangcp/opt/Anaconda3-5.2.0/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1201, in run
run_metadata=run_metadata)
File "/home/research/kangcp/opt/Anaconda3-5.2.0/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 981, in run
return self._sess.run(*args, **kwargs)
File "/home/research/kangcp/opt/Anaconda3-5.2.0/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/home/research/kangcp/opt/Anaconda3-5.2.0/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "/home/research/kangcp/opt/Anaconda3-5.2.0/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
run_metadata)
File "/home/research/kangcp/opt/Anaconda3-5.2.0/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error
[[Node: DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_fully_connected_1_BiasAdd_grad_tuple_control_dependency_1_0 = HorovodAllreduceT=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

Caused by op 'DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_fully_connected_1_BiasAdd_grad_tuple_control_dependency_1_0', defined at:
File "tensorflow_mnist.py", line 129, in
tf.app.run()
File "/home/research/kangcp/opt/Anaconda3-5.2.0/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "tensorflow_mnist.py", line 88, in main
train_op = opt.minimize(loss, global_step=global_step)
File "/home/research/kangcp/opt/Anaconda3-5.2.0/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 399, in minimize
grad_loss=grad_loss)
File "/home/research/kangcp/opt/Anaconda3-5.2.0/lib/python3.6/site-packages/horovod/tensorflow/init.py", line 194, in compute_gradients
compression=self._compression)
File "/home/research/kangcp/opt/Anaconda3-5.2.0/lib/python3.6/site-packages/horovod/tensorflow/init.py", line 83, in allreduce
summed_tensor_compressed = _allreduce(tensor_compressed)
File "/home/research/kangcp/opt/Anaconda3-5.2.0/lib/python3.6/site-packages/horovod/tensorflow/mpi_ops.py", line 90, in _allreduce
return MPI_LIB.horovod_allreduce(tensor, name=name)
File "", line 50, in horovod_allreduce
File "/home/research/kangcp/opt/Anaconda3-5.2.0/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/research/kangcp/opt/Anaconda3-5.2.0/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3414, in create_op
op_def=op_def)
File "/home/research/kangcp/opt/Anaconda3-5.2.0/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1740, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

UnknownError (see above for traceback): ncclCommInitRank failed: unhandled system error
[[Node: DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_fully_connected_1_BiasAdd_grad_tuple_control_dependency_1_0 = HorovodAllreduceT=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

Traceback (most recent call last):
File "/home/research/kangcp/opt/Anaconda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1322, in _do_call
return fn(*args)
File "/home/research/kangcp/opt/Anaconda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/research/kangcp/opt/Anaconda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error
[[Node: DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_fully_connected_1_BiasAdd_grad_tuple_control_dependency_1_0 = HorovodAllreduceT=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "tensorflow_mnist.py", line 129, in
tf.app.run()
File "/home/research/kangcp/opt/Anaconda/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
sys.exit(main(argv))
File "tensorflow_mnist.py", line 125, in main
, step = mon_sess.run([train_op, global_step], feed_dict={image: image, label: label
})
File "/home/research/kangcp/opt/Anaconda/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 577, in run
run_metadata=run_metadata)
File "/home/research/kangcp/opt/Anaconda/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1053, in run
run_metadata=run_metadata)
File "/home/research/kangcp/opt/Anaconda/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1144, in run
raise six.reraise(*original_exc_info)
File "/home/research/kangcp/opt/Anaconda/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/home/research/kangcp/opt/Anaconda/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1129, in run
return self._sess.run(*args, **kwargs)
File "/home/research/kangcp/opt/Anaconda/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1201, in run
run_metadata=run_metadata)
File "/home/research/kangcp/opt/Anaconda/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 981, in run
return self._sess.run(*args, **kwargs)
File "/home/research/kangcp/opt/Anaconda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/home/research/kangcp/opt/Anaconda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "/home/research/kangcp/opt/Anaconda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
run_metadata)
File "/home/research/kangcp/opt/Anaconda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error
[[Node: DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_fully_connected_1_BiasAdd_grad_tuple_control_dependency_1_0 = HorovodAllreduceT=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

Caused by op 'DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_fully_connected_1_BiasAdd_grad_tuple_control_dependency_1_0', defined at:
File "tensorflow_mnist.py", line 129, in
tf.app.run()
File "/home/research/kangcp/opt/Anaconda/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "tensorflow_mnist.py", line 88, in main
train_op = opt.minimize(loss, global_step=global_step)
File "/home/research/kangcp/opt/Anaconda/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 399, in minimize
grad_loss=grad_loss)
File "/home/research/kangcp/opt/Anaconda/lib/python3.6/site-packages/horovod/tensorflow/init.py", line 194, in compute_gradients
compression=self._compression)
File "/home/research/kangcp/opt/Anaconda/lib/python3.6/site-packages/horovod/tensorflow/init.py", line 83, in allreduce
summed_tensor_compressed = _allreduce(tensor_compressed)
File "/home/research/kangcp/opt/Anaconda/lib/python3.6/site-packages/horovod/tensorflow/mpi_ops.py", line 90, in _allreduce
return MPI_LIB.horovod_allreduce(tensor, name=name)
File "", line 50, in horovod_allreduce
File "/home/research/kangcp/opt/Anaconda/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/research/kangcp/opt/Anaconda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3414, in create_op
op_def=op_def)
File "/home/research/kangcp/opt/Anaconda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1740, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

UnknownError (see above for traceback): ncclCommInitRank failed: unhandled system error
[[Node: DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_fully_connected_1_BiasAdd_grad_tuple_control_dependency_1_0 = HorovodAllreduceT=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]


Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[41334,1],1]
Exit code: 1

GPU 9' s graphics card is M40, and GPU 13', GPU 16' is P40.

@cdaningWings
Copy link

@alsrgv Hi, Finally I fixed and could run horovod, gpu16 have installed pytorch, #91.
Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

6 participants