Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Launching on arm64 with Fast-RTPS with fat archive from 2018-06-21 never quits #89

Closed
clalancette opened this issue Jun 21, 2018 · 13 comments · Fixed by #124
Closed

Launching on arm64 with Fast-RTPS with fat archive from 2018-06-21 never quits #89

clalancette opened this issue Jun 21, 2018 · 13 comments · Fixed by #124
Assignees
Labels
bug Something isn't working

Comments

@clalancette
Copy link
Contributor

Bug report

Required Info:

Steps to reproduce issue

$ ros2 launch share/demo_nodes_cpp/launch/topics/talker_listener.launch.py

Let it run for a few seconds, then Ctrl-C:

root@changeling:~/ros2-linux# ros2 launch share/demo_nodes_cpp/launch/topics/talker_listener.launch.py
[INFO] [launch]: process[talker-1]: started with pid [21936]
[INFO] [launch]: process[listener-2]: started with pid [21937]
[INFO] [talker]: Publishing: 'Hello World: 1'
[INFO] [listener]: I heard: [Hello World: 1]
[INFO] [talker]: Publishing: 'Hello World: 2'
[INFO] [listener]: I heard: [Hello World: 2]
[INFO] [talker]: Publishing: 'Hello World: 3'
[INFO] [listener]: I heard: [Hello World: 3]
^C[WARNING] [launch.LaunchService]: user interrupted with ctrl-c (SIGINT)
signal_handler(2)
signal_handler(2)
[INFO] [launch]: process[talker-1]: process has finished cleanly
[INFO] [launch]: process[listener-2]: process has finished cleanly

It hangs there forever, until the user hits Ctrl-C.

Expected behavior

Launch file terminates cleanly on first Ctrl-C

Actual behavior

Launch hangs around "forever".

@dhood
Copy link
Member

dhood commented Jun 21, 2018

For cross-reference, similar behaviour has been reported on amd in #80 which was addressed by ros2/rclpy#191

@clalancette
Copy link
Contributor Author

And here's a run with debugging enabled:

[DEBUG] [launch]: emitting event synchronously: 'launch.events.IncludeLaunchDescription'
[DEBUG] [launch]: emitting event synchronously: 'launch.events.IncludeLaunchDescription'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.include_launch_description.IncludeLaunchDescription object at 0x7fa3a062e8>'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.include_launch_description.IncludeLaunchDescription object at 0x7fa3a062e8>' \u2713 '<launch.event_handlers.on_include_launch_description.OnIncludeLaunchDescription object at 0x7fa3a06cf8>'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.include_launch_description.IncludeLaunchDescription object at 0x7fa3a06320>'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.include_launch_description.IncludeLaunchDescription object at 0x7fa3a06320>' \u2713 '<launch.event_handlers.on_include_launch_description.OnIncludeLaunchDescription object at 0x7fa3a06cf8>'
[INFO] [launch]: process[talker-1]: started with pid [31098]
[INFO] [asyncio]: %s: %r
[DEBUG] [launch]: emitting event: 'launch.events.process.ProcessStarted'
[INFO] [launch]: process[listener-2]: started with pid [31099]
[INFO] [asyncio]: %s: %r
[DEBUG] [launch]: emitting event: 'launch.events.process.ProcessStarted'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.process.process_started.ProcessStarted object at 0x7fa0589128>'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.process.process_started.ProcessStarted object at 0x7fa0589128>' \u2713 '<launch.event_handler.EventHandler object at 0x7fa3a06a90>'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.process.process_started.ProcessStarted object at 0x7fa0589208>'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.process.process_started.ProcessStarted object at 0x7fa0589208>' \u2713 '<launch.event_handler.EventHandler object at 0x7fa3a06a90>'
[INFO] [asyncio]: poll took %.3f ms: %s events
[DEBUG] [launch]: emitting event synchronously: 'launch.events.process.ProcessStdout'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.process.process_stdout.ProcessStdout object at 0x7fa0589240>'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.process.process_stdout.ProcessStdout object at 0x7fa0589240>' \u2713 '<launch.event_handlers.on_process_io.OnProcessIO object at 0x7fa3a06588>'
[INFO] [talker]: Publishing: 'Hello World: 1'
[DEBUG] [launch]: emitting event synchronously: 'launch.events.process.ProcessStdout'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.process.process_stdout.ProcessStdout object at 0x7fa35b9b70>'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.process.process_stdout.ProcessStdout object at 0x7fa35b9b70>' \u2713 '<launch.event_handlers.on_process_io.OnProcessIO object at 0x7fa3a06588>'
[INFO] [listener]: I heard: [Hello World: 1]
[DEBUG] [launch]: emitting event synchronously: 'launch.events.process.ProcessStdout'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.process.process_stdout.ProcessStdout object at 0x7fa0589128>'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.process.process_stdout.ProcessStdout object at 0x7fa0589128>' \u2713 '<launch.event_handlers.on_process_io.OnProcessIO object at 0x7fa3a06588>'
[INFO] [talker]: Publishing: 'Hello World: 2'
[DEBUG] [launch]: emitting event synchronously: 'launch.events.process.ProcessStdout'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.process.process_stdout.ProcessStdout object at 0x7fa0589128>'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.process.process_stdout.ProcessStdout object at 0x7fa0589128>' \u2713 '<launch.event_handlers.on_process_io.OnProcessIO object at 0x7fa3a06588>'
[INFO] [listener]: I heard: [Hello World: 2]
[DEBUG] [launch]: emitting event synchronously: 'launch.events.process.ProcessStdout'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.process.process_stdout.ProcessStdout object at 0x7fa05892b0>'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.process.process_stdout.ProcessStdout object at 0x7fa05892b0>' \u2713 '<launch.event_handlers.on_process_io.OnProcessIO object at 0x7fa3a06588>'
[INFO] [talker]: Publishing: 'Hello World: 3'
[DEBUG] [launch]: emitting event synchronously: 'launch.events.process.ProcessStdout'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.process.process_stdout.ProcessStdout object at 0x7fa05892b0>'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.process.process_stdout.ProcessStdout object at 0x7fa05892b0>' \u2713 '<launch.event_handlers.on_process_io.OnProcessIO object at 0x7fa3a06588>'
[INFO] [listener]: I heard: [Hello World: 3]
^C[WARNING] [launch.LaunchService]: user interrupted with ctrl-c (SIGINT)
[DEBUG] [launch]: emitting event: 'launch.events.Shutdown'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.shutdown.Shutdown object at 0x7fa35a7898>'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.shutdown.Shutdown object at 0x7fa35a7898>' \u2713 '<launch.event_handlers.on_shutdown.OnShutdown object at 0x7fa35b9b00>'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.shutdown.Shutdown object at 0x7fa35a7898>' \u2713 '<launch.event_handlers.on_shutdown.OnShutdown object at 0x7fa35b9438>'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.shutdown.Shutdown object at 0x7fa35a7898>' \u2713 '<launch.event_handlers.on_shutdown.OnShutdown object at 0x7fa3a0dc88>'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.shutdown.Shutdown object at 0x7fa35a7898>' \u2713 '<launch.event_handlers.on_shutdown.OnShutdown object at 0x7fa3a06c50>'
[WARNING] [asyncio]: Executing %s took %.3f seconds
[DEBUG] [launch]: emitting event synchronously: 'launch.events.process.ProcessStdout'
[DEBUG] [launch]: emitting event synchronously: 'launch.events.process.ProcessStdout'
[INFO] [asyncio]: %r was closed by peer
[INFO] [asyncio]: %r was closed by peer
[INFO] [asyncio]: %r was closed by peer
[INFO] [asyncio]: %r was closed by peer
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.process.process_stdout.ProcessStdout object at 0x7fa35a79e8>'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.process.process_stdout.ProcessStdout object at 0x7fa35a79e8>' \u2713 '<launch.event_handlers.on_process_io.OnProcessIO object at 0x7fa3a06588>'
signal_handler(2)
[INFO] [asyncio]: %r exited with return code %r
[INFO] [asyncio]: %r exited with return code %r
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.process.process_stdout.ProcessStdout object at 0x7fa35b9470>'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.process.process_stdout.ProcessStdout object at 0x7fa35b9470>' \u2713 '<launch.event_handlers.on_process_io.OnProcessIO object at 0x7fa3a06588>'
signal_handler(2)
[INFO] [launch]: process[talker-1]: process has finished cleanly
[DEBUG] [launch]: emitting event: 'launch.events.process.ProcessExited'
[INFO] [launch]: process[listener-2]: process has finished cleanly
[DEBUG] [launch]: emitting event: 'launch.events.process.ProcessExited'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.process.process_exited.ProcessExited object at 0x7fa35b9b70>'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.process.process_exited.ProcessExited object at 0x7fa35b9b70>' \u2713 '<launch.event_handler.EventHandler object at 0x7fa3a06898>'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.process.process_exited.ProcessExited object at 0x7fa35b94e0>'
[DEBUG] [launch.LaunchService]: processing event: '<launch.events.process.process_exited.ProcessExited object at 0x7fa35b94e0>' \u2713 '<launch.event_handler.EventHandler object at 0x7fa3a06898>'

@wjwwood
Copy link
Member

wjwwood commented Jun 21, 2018

Since the signal handler of launch is running and the shutdown event goes out and gets processed, I think it must be different from #80, but I'll have to try and reproduce it here to debug it.

@clalancette what's the python -V?

@wjwwood
Copy link
Member

wjwwood commented Jun 21, 2018

Nvm, it's bionic. I was thinking it might be Debian or something.

@clalancette
Copy link
Contributor Author

Yeah, it's bionic. Just for reference, python3 -V gives me: 3.6.5

@wjwwood
Copy link
Member

wjwwood commented Jun 22, 2018

I've been able to figure out that it's related to rclpy. If I don't use the ROS part of launch it works fine. Still investigating though.

@wjwwood
Copy link
Member

wjwwood commented Jun 29, 2018

This is also happening on x86 Linux now as well (though perhaps not always).

@wjwwood wjwwood added the bug Something isn't working label Jun 29, 2018
@mikaelarguedas
Copy link
Member

mikaelarguedas commented Jun 29, 2018

Happens to me consistently on my bionic amd64 testing

Edit: happens with fastrtps but not opensplice or connext

@nuclearsandwich
Copy link
Member

nuclearsandwich commented Jul 16, 2018

I've only started looking into this on a native arm64 host. It's a host where I had previously installed .deb packages and I wasn't able to reproduce the issue with debs. What occurred instead after the first ^C was:

^C[WARNING] [launch.LaunchService]: user interrupted with ctrl-c (SIGINT)
root@ros2-bouncy-testing:~#signal_handler(2)
signal_handler(2)
[INFO] [launch]: process[talker-1]: process has finished cleanly
[INFO] [launch]: process[listener-2]: process has finished cleanly

And pressing enter, sending a clear, or any other input re-printed the command prompt. However, after downloading and running the fat archive a few times on the same host I now get the issue behavior using launch from either the debs or the fat archive. Haven't yet determined if this is some kind of bizarre contamination or just bad luck with a race condition. I straced launch to see what's going on when the first ^C is sent and it looks like it's getting stuck in a futex

--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
futex(0xffff78003ef8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xffff78004a50, FUTEX_WAKE_PRIVATE, 1) = 1
getpid()                                = 53958
write(5, "\2", 1)                       = 1
rt_sigreturn({mask=[PIPE]})             = -1 EINTR (Interrupted system call)
rt_sigaction(SIGINT, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=0}, {sa_handler=0xffff8a5e8e20, sa_mask=[INT], sa_flags=SA_RESTART}, 8) = 0
getpid()                                = 53958
write(2, "[WARNING] [launch.LaunchService]"..., 72) = 72
getpid()                                = 53958
rt_sigaction(SIGINT, {sa_handler=0x446ef8, sa_mask=[], sa_flags=0}, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=0}, 8) = 0
epoll_pwait(3, [{EPOLLIN, {u32=19, u64=19}}, {EPOLLIN, {u32=21, u64=21}}, {EPOLLIN, {u32=4, u64=4}}], 5, -1000, NULL, 8) = 3
read(19, "signal_handler(2)\n", 262144) = 18
read(21, "signal_handler(2)\n", 262144) = 18
recvfrom(4, "\2", 4096, 0, NULL, NULL)  = 1
recvfrom(4, 0x38ba0e40, 4096, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_pwait(3, [], 5, 0, NULL, 8)       = 0
futex(0xffff78000e70, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=54066, si_uid=0, si_status=0, si_utime=8, si_stime=6} ---
getpid()                                = 53958
write(5, "\21", 1)                      = 1
rt_sigreturn({mask=[PIPE]})             = 281472695012976
futex(0xffff78000e70, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=54067, si_uid=0, si_status=0, si_utime=8, si_stime=6} ---
getpid()                                = 53958
write(5, "\21", 1)                      = 1
rt_sigreturn({mask=[PIPE]})             = 281472695012976
futex(0xffff78000e70, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = 0
epoll_pwait(3, [{EPOLLHUP, {u32=19, u64=19}}, {EPOLLIN|EPOLLHUP, {u32=18, u64=18}}, {EPOLLIN, {u32=4, u64=4}}, {EPOLLHUP, {u32=21, u64=21}}, {EPOLLIN|EPO
LLHUP, {u32=20, u64=20}}], 5, 0, NULL, 8) = 5
read(19, "", 262144)                    = 0
epoll_ctl(3, EPOLL_CTL_DEL, 19, 0xffffc68965c8) = 0
epoll_ctl(3, EPOLL_CTL_DEL, 18, 0xffffc6896318) = 0
recvfrom(4, "\21\21", 4096, 0, NULL, NULL) = 2
sendto(5, "\0", 1, 0, NULL, 0)          = 1
sendto(5, "\0", 1, 0, NULL, 0)          = 1
recvfrom(4, "\0\0", 4096, 0, NULL, NULL) = 2
recvfrom(4, 0x38ba17c0, 4096, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
read(21, "", 262144)                    = 0
epoll_ctl(3, EPOLL_CTL_DEL, 21, 0xffffc68965c8) = 0
epoll_ctl(3, EPOLL_CTL_DEL, 20, 0xffffc6896318) = 0
epoll_pwait(3, [], 1, 0, NULL, 8)       = 0
write(1, "signal_handler(2)\n", 18)     = 18
close(19)                               = 0
close(18)                               = 0
wait4(54066, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 54066
sendto(5, "\0", 1, 0, NULL, 0)          = 1
wait4(54067, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 54067
sendto(5, "\0", 1, 0, NULL, 0)          = 1
close(21)                               = 0
close(20)                               = 0
epoll_pwait(3, [{EPOLLIN, {u32=4, u64=4}}], 1, 0, NULL, 8) = 1
recvfrom(4, "\0\0", 4096, 0, NULL, NULL) = 2
recvfrom(4, 0x38ba19f0, 4096, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_pwait(3, [], 1, 0, NULL, 8)       = 0
epoll_pwait(3, [], 1, 0, NULL, 8)       = 0
write(1, "signal_handler(2)\n", 18)     = 18
getpid()                                = 53958
write(2, "[INFO] [launch]: process[talker-"..., 65) = 65
getpid()                                = 53958
write(2, "[INFO] [launch]: process[listene"..., 67) = 67
epoll_pwait(3, [], 1, 0, NULL, 8)       = 0
epoll_pwait(3, [], 1, 0, NULL, 8)       = 0
epoll_pwait(3, [], 1, 0, NULL, 8)       = 0
epoll_pwait(3, [], 1, 0, NULL, 8)       = 0
epoll_pwait(3, [], 1, 0, NULL, 8)       = 0
epoll_pwait(3, [], 1, 0, NULL, 8)       = 0
epoll_pwait(3, [], 1, 0, NULL, 8)       = 0
epoll_pwait(3, [], 1, 0, NULL, 8)       = 0
getpid()                                = 53958
epoll_ctl(3, EPOLL_CTL_DEL, 4, 0xffffc6897308) = 0
close(4)                                = 0
close(5)                                = 0
close(3)                                = 0
rt_sigaction(SIGCHLD, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, {sa_handler=0x446ef8, sa_mask=[], sa_flags=SA_RESTART}, 8) = 0
rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, {sa_handler=0x446ef8, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGQUIT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, {sa_handler=0x446ef8, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGTERM, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, {sa_handler=0x446ef8, sa_mask=[], sa_flags=0}, 8) = 0
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xffff86276000
munmap(0xffff8990c000, 262144)          = 0
munmap(0xffff86276000, 262144)          = 0
munmap(0xffff8b196000, 262144)          = 0
munmap(0xffff89a4c000, 262144)          = 0
sigaltstack(NULL, {ss_sp=0x37d3ee30, ss_flags=0, ss_size=16384}) = 0
sigaltstack({ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}, NULL) = 0
munmap(0xffff89b4c000, 262144)          = 0
futex(0xffff89695734, FUTEX_WAIT_PRIVATE, 8, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0xffff89695734, FUTEX_WAIT_PRIVATE, 12, NULL

Here's the gdb backtrace and thread info

(gdb) bt
#0  0x0000ffffb2ba29c8 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0xffffa0000e70)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  do_futex_wait (sem=sem@entry=0xffffa0000e70, abstime=0x0) at sem_waitcommon.c:111
#2  0x0000ffffb2ba2ae8 in __new_sem_wait_slow (sem=0xffffa0000e70, abstime=0x0) at sem_waitcommon.c:181
#3  0x00000000004411c0 in PyThread_acquire_lock_timed ()
#4  0x000000000059cb7c in ?? ()
#5  0x000000000079b000 in ?? ()
Backtrace stopped: not enough registers or memory available to unwind further
(gdb) info threads
  Id   Target Id         Frame
* 1    Thread 0xffffb2c05010 (LWP 41430) "python3" 0x0000ffffb2ba29c8 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0,
    futex_word=0xffffa0000e70) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  2    Thread 0xffffb05051f0 (LWP 41433) "python3" 0x0000ffffb29bc600 in __GI_epoll_pwait (epfd=<optimized out>, events=events@entry=0xffffb05040b8,
    maxevents=maxevents@entry=128, timeout=1, set=set@entry=0x0) at ../sysdeps/unix/sysv/linux/epoll_pwait.c:42
  3    Thread 0xffffafd041f0 (LWP 41434) "python3" 0x0000ffffb2ba510c in __libc_recvmsg (fd=fd@entry=9, msg=msg@entry=0xffffafd038a0,
    flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recvmsg.c:28
  4    Thread 0xffffaf5031f0 (LWP 41435) "python3" 0x0000ffffb2ba510c in __libc_recvmsg (fd=fd@entry=10, msg=msg@entry=0xffffaf5028a0,
    flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recvmsg.c:28
  5    Thread 0xffffaed021f0 (LWP 41436) "python3" 0x0000ffffb2ba510c in __libc_recvmsg (fd=fd@entry=11, msg=msg@entry=0xffffaed018a0,
    flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recvmsg.c:28
  6    Thread 0xffffae5011f0 (LWP 41437) "python3" 0x0000ffffb2ba022c in futex_wait_cancelable (private=<optimized out>, expected=0,
    futex_word=0xffffb08df738 <eprosima::fastrtps::rtps::AsyncWriterThread::cv_+40>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  7    Thread 0xffffadd001f0 (LWP 41438) "python3" 0x0000ffffb2ba05d4 in futex_abstimed_wait_cancelable (private=<optimized out>,
    abstime=0xffffadcfe428, expected=0, futex_word=0xffffa0003ef8) at ../sysdeps/unix/sysv/linux/futex-internal.h:205

Still to do are to check the same issue from source (and see if I can get more info out of that backtrace as a result)

edit: regarding reproducibility, I can reproduce it on both amd64 and arm64 from debs. The reason my initial runs weren't able to reproduce it was that I was still using OpenSplice rather than FastRTPS.

@nuclearsandwich
Copy link
Member

After much back-and-forth between gdb and pdb I am pretty sure this is an issue with the fastrtps AsyncWriterThread blocking indefinitely. I got turned around between two blocked futexes (thread 1 and 7 in the output above). When python debugging finally started functioning I get the following at the final "hanging" state.

(gdb) py-list
1068            # called.  That sets ._is_stopped to True, and ._tstate_lock to None.
1069            lock = self._tstate_lock
1070            if lock is None:  # already determined that the C code is done
1071                assert self._is_stopped
1072            else:
>1073                if lock.acquire(block, timeout):
1074                    lock.release()
1075                    self._stop()
1076    
1077        @property
1078        def name(self):
(gdb) py-bt
Traceback (most recent call first):
  <built-in method acquire of _thread.lock object at remote 0x7f3da3375440>
  File "/usr/lib/python3.6/threading.py", line 1073, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
  File "/usr/lib/python3.6/threading.py", line 1056, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.6/threading.py", line 1295, in _shutdown
    t.join()

But when I set a pdb breakpoint in _wait_for_tstate_lock and step, I'm able to acquire this lock no problem and step through to further shutdown code. So I'm not entirely sure why this blocks here. If I wait to attach gdb until the process is in its hanging state there's one fewer threads and none of them yield any python data readable by gdb. The remaining threads are:

(gdb) info threads
  Id   Target Id         Frame 
  1    Thread 0x7f50ed4c6740 (LWP 5155) "ros2" 0x00007f50ed0a2449 in futex_wait (private=<optimized out>, expected=12, futex_word=0x7f50e79415a4 <eprosima::fastrtps::rtps::AsyncWriterThread::cv_+36>) at ../sysdeps/unix/sysv/linux/futex-internal.h:61
  2    Thread 0x7f50e57ac700 (LWP 5158) "ros2" 0x00007f50ec1d1bb7 in epoll_wait (epfd=14, events=0x7f50e57ab780, maxevents=128, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  3    Thread 0x7f50e4fab700 (LWP 5159) "ros2" 0x00007f50ed0a7567 in __libc_recvmsg (fd=9, msg=0x7f50e4faada0, flags=0) at ../sysdeps/unix/sysv/linux/recvmsg.c:28
* 4    Thread 0x7f50dffff700 (LWP 5160) "ros2" 0x00007f50ed0a7567 in __libc_recvmsg (fd=10, msg=0x7f50dfffeda0, flags=0) at ../sysdeps/unix/sysv/linux/recvmsg.c:28
  5    Thread 0x7f50df7fe700 (LWP 5161) "ros2" 0x00007f50ed0a7567 in __libc_recvmsg (fd=11, msg=0x7f50df7fdda0, flags=0) at ../sysdeps/unix/sysv/linux/recvmsg.c:28
  6    Thread 0x7f50deffd700 (LWP 5162) "ros2" 0x00007f50ed0a29f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f50e79415a8 <eprosima::fastrtps::rtps::AsyncWriterThread::cv_+40>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88

I'm at a loss as to what to look into next as I'm not sure what's responsible for the AsyncWriterThreads but it seems like this may not be an issue with launch or even launch_ros. I think based on time constraints I have to put this one down for now.

@wjwwood
Copy link
Member

wjwwood commented Jul 23, 2018

Thanks @nuclearsandwich for taking some time to look into it. This information might be useful to me, so thanks for writing it up.

My current state on this is that this line:

executor.spin_once(timeout_sec=1.0)

Never returns after ctrl-c is done (the second ctrl-c seems to wake it up). Which is very strange because you'd think that it would eventually return after the timeout period expired (even if it did not return immediately after ctrl-c).

This odd behavior might be related with where @nuclearsandwich ended up, that a thread is getting dead locked or something within Fast-RTPS (might be affecting the rmw_wait function eventually).

@wjwwood
Copy link
Member

wjwwood commented Jul 24, 2018

I cannot reproduce this on my 16.04 VM, I'm going to try my 18.04 VM to try and corroborate with @mikaelarguedas's comment #89 (comment). Otherwise, I'll have to drop back into packet.net aarch64 machines.

@wjwwood
Copy link
Member

wjwwood commented Aug 7, 2018

So, I've narrowed down the issue so that it only happens when:

  • on bionic, and
  • when using FastRTPS, and
  • when using the rclpy part of launch.

So, on xenial it seems to never hang (though there is some unconfirmed evidence that it might depending on timing), and it never hangs with Connext that I've seen (or OpenSplice, though I haven't tried that myself), and it never hangs if you comment out the rclpy part of ros2 launch temporarily. Architecture also doesn't seem to be an issue.

I've been trying to recreate this without launch, in a script that just uses rclpy, threads, and Python signal handlers, but so far I've not been successful. I'm going to keep working on that, but at this point I'm not sure of the root cause and still not sure how to address it.

@wjwwood wjwwood self-assigned this Aug 16, 2018
@wjwwood wjwwood added in progress Actively being worked on (Kanban column) in review Waiting for review (Kanban column) and removed in progress Actively being worked on (Kanban column) labels Aug 16, 2018
@wjwwood wjwwood removed the in review Waiting for review (Kanban column) label Aug 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants