Join GitHub today
Launching on arm64 with Fast-RTPS with fat archive from 2018-06-21 never quits #89
Steps to reproduce issue
Let it run for a few seconds, then Ctrl-C:
It hangs there forever, until the user hits Ctrl-C.
Launch file terminates cleanly on first Ctrl-C
Launch hangs around "forever".
And here's a run with debugging enabled:
I've only started looking into this on a native arm64 host. It's a host where I had previously installed .deb packages and I wasn't able to reproduce the issue with debs. What occurred instead after the first ^C was:
And pressing enter, sending a clear, or any other input re-printed the command prompt. However, after downloading and running the fat archive a few times on the same host I now get the issue behavior using launch from either the debs or the fat archive. Haven't yet determined if this is some kind of bizarre contamination or just bad luck with a race condition. I straced launch to see what's going on when the first ^C is sent and it looks like it's getting stuck in a futex
Here's the gdb backtrace and thread info
Still to do are to check the same issue from source (and see if I can get more info out of that backtrace as a result)
edit: regarding reproducibility, I can reproduce it on both amd64 and arm64 from debs. The reason my initial runs weren't able to reproduce it was that I was still using OpenSplice rather than FastRTPS.
After much back-and-forth between gdb and pdb I am pretty sure this is an issue with the fastrtps AsyncWriterThread blocking indefinitely. I got turned around between two blocked futexes (thread 1 and 7 in the output above). When python debugging finally started functioning I get the following at the final "hanging" state.
But when I set a pdb breakpoint in
I'm at a loss as to what to look into next as I'm not sure what's responsible for the AsyncWriterThreads but it seems like this may not be an issue with launch or even launch_ros. I think based on time constraints I have to put this one down for now.
Thanks @nuclearsandwich for taking some time to look into it. This information might be useful to me, so thanks for writing it up.
My current state on this is that this line:
Never returns after ctrl-c is done (the second ctrl-c seems to wake it up). Which is very strange because you'd think that it would eventually return after the timeout period expired (even if it did not return immediately after ctrl-c).
This odd behavior might be related with where @nuclearsandwich ended up, that a thread is getting dead locked or something within Fast-RTPS (might be affecting the
So, I've narrowed down the issue so that it only happens when:
So, on xenial it seems to never hang (though there is some unconfirmed evidence that it might depending on timing), and it never hangs with Connext that I've seen (or OpenSplice, though I haven't tried that myself), and it never hangs if you comment out the rclpy part of
I've been trying to recreate this without launch, in a script that just uses rclpy, threads, and Python signal handlers, but so far I've not been successful. I'm going to keep working on that, but at this point I'm not sure of the root cause and still not sure how to address it.