-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Launching on arm64 with Fast-RTPS with fat archive from 2018-06-21 never quits #89
Comments
For cross-reference, similar behaviour has been reported on amd in #80 which was addressed by ros2/rclpy#191 |
And here's a run with debugging enabled:
|
Since the signal handler of launch is running and the shutdown event goes out and gets processed, I think it must be different from #80, but I'll have to try and reproduce it here to debug it. @clalancette what's the |
Nvm, it's bionic. I was thinking it might be Debian or something. |
Yeah, it's bionic. Just for reference, |
I've been able to figure out that it's related to rclpy. If I don't use the ROS part of launch it works fine. Still investigating though. |
This is also happening on x86 Linux now as well (though perhaps not always). |
Happens to me consistently on my bionic amd64 testing Edit: happens with fastrtps but not opensplice or connext |
I've only started looking into this on a native arm64 host. It's a host where I had previously installed .deb packages and I wasn't able to reproduce the issue with debs. What occurred instead after the first ^C was:
And pressing enter, sending a clear, or any other input re-printed the command prompt. However, after downloading and running the fat archive a few times on the same host I now get the issue behavior using launch from either the debs or the fat archive. Haven't yet determined if this is some kind of bizarre contamination or just bad luck with a race condition. I straced launch to see what's going on when the first ^C is sent and it looks like it's getting stuck in a futex
Here's the gdb backtrace and thread info
Still to do are to check the same issue from source (and see if I can get more info out of that backtrace as a result) edit: regarding reproducibility, I can reproduce it on both amd64 and arm64 from debs. The reason my initial runs weren't able to reproduce it was that I was still using OpenSplice rather than FastRTPS. |
After much back-and-forth between gdb and pdb I am pretty sure this is an issue with the fastrtps AsyncWriterThread blocking indefinitely. I got turned around between two blocked futexes (thread 1 and 7 in the output above). When python debugging finally started functioning I get the following at the final "hanging" state.
But when I set a pdb breakpoint in
I'm at a loss as to what to look into next as I'm not sure what's responsible for the AsyncWriterThreads but it seems like this may not be an issue with launch or even launch_ros. I think based on time constraints I have to put this one down for now. |
Thanks @nuclearsandwich for taking some time to look into it. This information might be useful to me, so thanks for writing it up. My current state on this is that this line:
Never returns after ctrl-c is done (the second ctrl-c seems to wake it up). Which is very strange because you'd think that it would eventually return after the timeout period expired (even if it did not return immediately after ctrl-c). This odd behavior might be related with where @nuclearsandwich ended up, that a thread is getting dead locked or something within Fast-RTPS (might be affecting the |
I cannot reproduce this on my 16.04 VM, I'm going to try my 18.04 VM to try and corroborate with @mikaelarguedas's comment #89 (comment). Otherwise, I'll have to drop back into packet.net aarch64 machines. |
So, I've narrowed down the issue so that it only happens when:
So, on xenial it seems to never hang (though there is some unconfirmed evidence that it might depending on timing), and it never hangs with Connext that I've seen (or OpenSplice, though I haven't tried that myself), and it never hangs if you comment out the rclpy part of I've been trying to recreate this without launch, in a script that just uses rclpy, threads, and Python signal handlers, but so far I've not been successful. I'm going to keep working on that, but at this point I'm not sure of the root cause and still not sure how to address it. |
Bug report
Required Info:
Steps to reproduce issue
Let it run for a few seconds, then Ctrl-C:
It hangs there forever, until the user hits Ctrl-C.
Expected behavior
Launch file terminates cleanly on first Ctrl-C
Actual behavior
Launch hangs around "forever".
The text was updated successfully, but these errors were encountered: