New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some error while running pytorch_mnist.py #489
Comments
Hey @yh673025667, which version of Horovod are you running? There were a number of changes made to PyTorch in the latest version, might want to try using |
@yh673025667, do you still have this issue if you add |
@alsrgv If i add -mca btl ^openib to my shell, it will hang forever even when running tensorflow or keras. |
@alsrgv ERROR LOG WHEN ADD -mca btl ^openib WARNING: Open MPI failed to TCP connect to a peer MPI process. This Your Open MPI job may now fail. Local host: nmyjs836 |
@tgaddair :( i have reinstalled horovod==0.13.10. BUT even have errors while running tensorflow and keras. |
@yh673025667, to resolve hangs - can you make sure you exclude all the non-routed interfaces (or explicitly include routed) - via this doc. Your error message with 0.13.10 looks suspicious. Can you make sure you reinstalled on both nodes and used |
@alsrgv yeap, i have tested exclude all the non-routed interfaces and tensorflow is ok. but pytorch still have |
@alsrgv i have installed 0.13.10 via p[p with |
@yh673025667, to clarify, what are the latest mpirun full commands that work with TensorFlow and crash with PyTorch? |
-x NCCL_IB_DISABLE=1 is required because it will encountered errors in tensorflow when it is not added |
@yh673025667, if you exclude non routed interfaces and add |
@alsrgv it will hang
|
@yh673025667, thanks! Can you share the output of |
@alsrgv Thanks very much for reply soon. eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450 ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 65520 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 [@nmyjs-202-75 horovod]$ |
@yh673025667, great! Can you exclude flannel.1 and ib0 as well as docker0 and lo? |
@alsrgv ok, i will try it now, thanks again! |
it will report errors ERROR LOG: INFO:tensorflow:Running local_init_op. nmyjs-202-75:24499:24701 [0] transport/net_ib.cu:218 WARN No module present for GPU Direct RDMA. nmyjs836:14846:19892 [0] misc/ibvwrap.cu:247 WARN Call to ibv_reg_mr failed |
@alsrgv By the way . This is exactly the error when i said "-x NCCL_IB_DISABLE=1 is required because it will encountered errors in tensorflow when it is not added" |
@yh673025667 are you using Infini Band for communication? |
@yh673025667, what about pytorch, with correctly excluded interfaces, |
@abidmalikwaterloo yeap, when my shell script are as follows , it can work well
|
@alsrgv As i said above, ONLY when my shell script like that , tensorflow and keras can work well |
@yh673025667, can you copy paste the latest log with PyTorch, including the mpirun command? |
@alsrgv command line:
ERROR LOG: The process that invoked fork was: Local host: [[58116,1],7] (PID 40367) If you are absolutely sure that your application will successfully
|
@yh673025667, in this command you did not add Can you add those flags when you run PyTorch example? |
@alsrgv ok , i will try again. Should i have to add |
@yh673025667, for now, yes. Once you get the basics running, we can figure out how to make IB work. |
@alsrgv
ERROR LOG https://gist.github.com/yh673025667/14dafb1f70ed7c72d28fd82ce62548de |
@yh673025667, thanks for the log, it's very helpful. I noticed that you have UCX installed as part of Mellanox driver package. Can you try adding |
@alsrgv Hi ,
ERROR LOG https://gist.github.com/yh673025667/f3afbd685b2f1ffc2f620332c35b297d |
@yh673025667, excellent. Can you replace NCCL_SOCKET_IFNAME=^docker0 with NCCL_SOCKET_IFNAME=eth just to include eth0? |
@alsrgv Great!!! It can work now!!! LOG https://gist.github.com/yh673025667/5c3b8655e52025850c73f27f336fd961 |
@yh673025667, great! Can you share the output of |
@alsrgv Thanks Again!!! |
@yh673025667, are they connected to the same IB switch? |
@alsrgv Sorry , I don't know how to distinguish whether the two machines are connected to the same IB switch. But i have paste output of |
@yh673025667, can you run |
@alsrgv I have tried the same command line with tensorflow . it's really slow.............................to compare with infiniband communication |
@alsrgv |
@yh673025667, were you able to use InfiniBand with TensorFlow or any other application between these two servers before? I'm not sure it's configured correctly. |
@alsrgv Yes, when my shell are like follows: |
@yh673025667, you have |
@alsrgv yes.... SO i am very confused too....................... |
@yh673025667, I see. Can you try |
@alsrgv Thanks ! When i try It's very strange ......... I have Disable IB, Why they can communicate with ib ??? |
@yh673025667, excellent. They use socket over faster interface, but not RDMA. Can you check with a team that set up InfiniBand network whether two servers you are using support RDMA to each other? |
@alsrgv Hi, When i use ibping with two servers, here are the results: --- (Lid 49) ibping statistics --- Is this means my two servers do not support RDMA??? cause 100% loss |
@yh673025667, I suspect there's some issue with RDMA. I'd recommend checking with your network admin. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
machine configuration:
OS: centos 7
Python: python3.6
pytorch: 0.4.0
nccl: 2.2.13
openmpi: 3.1.2
GPUs : 8 Titan Xp per machine
my run shell script:
mpirun -np 8
-H 10.141.202.75:8
-bind-to none -map-by slot
-x NCCL_DEBUG=ERROR -x LD_LIBRARY_PATH -x PATH -x NCCL_IB_HCA=mlx4_0 -x NCCL_IB_DISABLE=1
-mca pml ob1
python examples/pytorch_mnist.py
ERROR I have encountered:
2.BUT When i have tested my pytorch(pytorch_mnist.py) with horovod, I have encountered VERY FATAL ERROR, no matter 8GPUs with single machine or 16 GPUs with two machines, SAME errors have appeared!
ERROR log are as follows:
A process has executed an operation involving a call to the
"fork()" system call to create a child process. Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption. The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.
The process that invoked fork was:
Local host: [[6927,1],4] (PID 26036)
If you are absolutely sure that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
[nmyjs-202-75:26027] [[6927,0],0] ORTE_ERROR_LOG: Out of resource in file util/show_help.c at line 501
*** Error in `python': munmap_chunk(): invalid pointer: 0x0000558febe409e0 ***
======= Backtrace: =========
/usr/lib64/libc.so.6(+0x7ab54)[0x7f3678d3fb54]
/usr/lib64/libcuda.so.1(+0x1e3835)[0x7f3664928835]
/usr/lib64/libcuda.so.1(+0x248aaf)[0x7f366498daaf]
/usr/lib64/libcuda.so.1(+0x1e5180)[0x7f366492a180]
/usr/lib64/libpthread.so.0(+0x7e25)[0x7f367908fe25]
/usr/lib64/libc.so.6(clone+0x6d)[0x7f3678dbd34d]
======= Memory map: ========
200000000-200200000 rw-s 00000000 00:05 60924 /dev/nvidiactl
200200000-200400000 ---p 00000000 00:00 0
200400000-200404000 rw-s 00000000 00:05 60924 /dev/nvidiactl
200404000-200600000 ---p 00000000 00:00 0
200600000-200a00000 rw-s 00000000 00:05 60924 /dev/nvidiactl
200a00000-201800000 ---p 00000000 00:00 0
201800000-201804000 rw-s 00000000 00:05 60924 /dev/nvidiactl
201804000-201a00000 ---p 00000000 00:00 0
201a00000-201e00000 rw-s 00000000 00:05 60924 /dev/nvidiactl
201e00000-202c00000 ---p 00000000 00:00 0
202c00000-202c04000 rw-s 00000000 00:05 60924 /dev/nvidiactl
202c04000-202e00000 ---p 00000000 00:00 0
202e00000-203200000 rw-s 00000000 00:05 60924 /dev/nvidiactl
203200000-204000000 ---p 00000000 00:00 0
204000000-204004000 rw-s 00000000 00:05 60924 /dev/nvidiactl
204004000-204200000 ---p 00000000 00:00 0
204200000-204600000 rw-s 00000000 00:05 60924 /dev/nvidiactl
204600000-205400000 ---p 00000000 00:00 0
205400000-205404000 rw-s 00000000 00:05 60924 /dev/nvidiactl
205404000-205600000 ---p 00000000 00:00 0
205600000-205a00000 rw-s 00000000 00:05 60924 /dev/nvidiactl
205a00000-206800000 ---p 00000000 00:00 0
206800000-206804000 rw-s 00000000 00:05 60924 /dev/nvidiactl
206804000-206a00000 ---p 00000000 00:00 0
206a00000-206e00000 rw-s 00000000 00:05 60924 /dev/nvidiactl
206e00000-207c00000 ---p 00000000 00:00 0
207c00000-207c04000 rw-s 00000000 00:05 60924 /dev/nvidiactl
207c04000-207e00000 ---p 00000000 00:00 0
207e00000-208200000 rw-s 00000000 00:05 60924 /dev/nvidiactl
208200000-209000000 ---p 00000000 00:00 0
209000000-209004000 rw-s 00000000 00:05 60924 /dev/nvidiactl
209004000-209200000 ---p 00000000 00:00 0
209200000-209600000 rw-s 00000000 00:05 60924 /dev/nvidiactl
209600000-20a400000 ---p 00000000 00:00 0
20a400000-20a404000 rw-s 00000000 00:05 60924 /dev/nvidiactl
20a404000-20a600000 ---p 00000000 00:00 0
20a600000-20aa00000 rw-s 00000000 00:05 60924 /dev/nvidiactl
20aa00000-20aa04000 rw-s 00000000 00:05 60924 /dev/nvidiactl
20aa04000-20ac00000 ---p 00000000 00:00 0
20ac00000-20b000000 rw-s 00000000 00:05 60924 /dev/nvidiactl
20b000000-20b004000 rw-s 00000000 00:05 60924 /dev/nvidiactl
20b004000-20b200000 ---p 00000000 00:00 0
20b200000-20b600000 rw-s 00000000 00:05 60924 /dev/nvidiactl
20b600000-20b604000 rw-s 00000000 00:05 60924 /dev/nvidiactl
20b604000-20b800000 ---p 00000000 00:00 0
20b800000-20bc00000 rw-s 00000000 00:05 60924 [nmyjs-202-75:26039] *** Process received signal ***
[nmyjs-202-75:26039] Signal: Aborted (6)
[nmyjs-202-75:26039] Signal code: (-6)
[nmyjs-202-75:26037] *** Process received signal ***
[nmyjs-202-75:26037] Signal: Aborted (6)
[nmyjs-202-75:26037] Signal code: (-6)
[nmyjs-202-75:26037] [ 0] /usr/lib64/libpthread.so.0(+0xf5e0)[0x7f36790975e0]
[nmyjs-202-75:26037] [ 1] [nmyjs-202-75:26039] [ 0] /usr/lib64/libc.so.6(gsignal+0x37)[0x7f3678cfa1f7]
[nmyjs-202-75:26037] [ 2] /usr/lib64/libpthread.so.0(+0xf5e0)[0x7f0e8d18f5e0]
[nmyjs-202-75:26039] [ 1] /usr/lib64/libc.so.6(abort+0x148)[0x7f3678cfb8e8]
[nmyjs-202-75:26037] [ 3] /usr/lib64/libc.so.6(+0x74f47)[0x7f3678d39f47]
[nmyjs-202-75:26037] [ 4] /usr/lib64/libc.so.6(gsignal+0x37)[0x7f0e8cdf21f7]
[nmyjs-202-75:26039] [ 2] /usr/lib64/libc.so.6(+0x7ab54)[0x7f3678d3fb54]
[nmyjs-202-75:26037] [ 5] /usr/lib64/libcuda.so.1(+0x1e3835)[0x7f3664928835]
[nmyjs-202-75:26037] [ 6] /usr/lib64/libc.so.6(abort+0x148)[0x7f0e8cdf38e8]
[nmyjs-202-75:26039] [ 3] /usr/lib64/libcuda.so.1(+0x248aaf)[0x7f366498daaf]
[nmyjs-202-75:26037] [ 7] /usr/lib64/libcuda.so.1(+0x1e5180)[0x7f366492a180]
[nmyjs-202-75:26037] /usr/lib64/libc.so.6(+0x74f47)[0x7f0e8ce31f47]
[nmyjs-202-75:26039] [ 4] [ 8] /usr/lib64/libpthread.so.0(+0x7e25)[0x7f367908fe25]
[nmyjs-202-75:26037] [ 9] /usr/lib64/libc.so.6(clone+0x6d)[0x7f3678dbd34d]
[nmyjs-202-75:26037] *** End of error message ***
/usr/lib64/libc.so.6(+0x7ab54)[0x7f0e8ce37b54]
[nmyjs-202-75:26039] [ 5] /usr/lib64/libcuda.so.1(+0x1e3835)[0x7f0e78a20835]
[nmyjs-202-75:26039] [ 6] /usr/lib64/libcuda.so.1(+0x248aaf)[0x7f0e78a85aaf]
[nmyjs-202-75:26039] [ 7] /usr/lib64/libcuda.so.1(+0x1e5180)[0x7f0e78a22180]
[nmyjs-202-75:26039] [ 8] /usr/lib64/libpthread.so.0(+0x7e25)[0x7f0e8d187e25]
[nmyjs-202-75:26039] [ 9] /usr/lib64/libc.so.6(clone+0x6d)[0x7f0e8ceb534d]
[nmyjs-202-75:26039] *** End of error message ***
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 5 with PID 0 on node nmyjs-202-75 exited on signal 6 (Aborted).
[nmyjs-202-75:26027] 6 more processes have sent help message help-opal-runtime.txt / opal_init:warn-fork
[nmyjs-202-75:26027] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
So Why just Pytorch with horovod will have problems and tensorflow or keras can run ok???
The text was updated successfully, but these errors were encountered: