-
Notifications
You must be signed in to change notification settings - Fork 294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ros1_bridge silently stops working and cannot be restarted #129
Comments
Did you call |
This could be related to Fast-RTPS configuration on the other end, as it is an IHMC implementation of a ROS 2 node, not using rmw. |
From https://github.com/ros2/ros2/wiki/Linux-Install-Debians:
|
@calvertdw I believe the daemon restarts itself the first time you launch a ROS 2 process, I've never had to manually start it any other time. I also don't believe it's an issue with our stuff because our comms remain alive (you can still talk to our stuff via ROS 2 directly). You can publish and subscribe to our topics without issue. Debug symbols for the bridge won't be valid because we've recompiled the bridge. I can rebuild ours with debug flags and use GDB to start a C++ debugger session but I don't know what I'm looking for, that's what I'm hoping to figure out. |
Can you try to reproduce the problem while running a debug build of the bridge in |
So upon continued investigation I'm actually starting to believe that this is related to an interaction between the bridge and the ROS 1 @dirk-thomas I can work on getting something like that set up but it might be a day or two because the current place we're investigating this is in "production" on a robot so I'll have to make a test setup when I get back to Florida (I'm in Houston at Johnson working on Valkyrie right now). For a bit of context on the setup and an overview of the topology: As I mentioned above this is a deployment being tested on the NASA Valkyrie humanoid. They have two on-board computers, one for real-time feedback control and one for non-realtime perception and out of band management of the asynchronous API. The real-time machine also has a custom PCI appliance for talking to the motor amps and other embedded systems like the power management board. The real-time machine runs our control stack (which has a DDS API implemented via Fast-RTPS and implementing the ROS 2 partition/namespace standards) as well as the NASA management stack which is ROS 1 based. So The non-realtime machine runs the vision stuff (multisense drivers) and the I think it'll be easy enough to attach GDB to the bridge but I'm not so sure that the indicative information itself will be in the bridge (though it might at a minimum give us a stack trace showing us what interaction with ROS 1 is hanging and why?), which is why I'll have to try and reproduce this off the robot because if we attach GDB to any of the processes on the real-time machine we'll probably miss deadlines and the whole thing won't be able to run anyway. |
I also need access to a build machine back in Pensacola to make a debug build of the bridge because our message package OOM's the bridge compile process on machines that don't have > 16GB of RAM even when building single job thread and my laptop can't cut it. |
If you want to try it early on the robot you could comment in some of the print statements which output various messages based on "progress" / "activity" in the bridge (sorry, that was written before log macros were available). Maybe that will also provide some information on what the bridge is doing when it is "hanging". It might even be enough to rule out a problem in the bridge if that is the case.
Wow, that is pretty extreme. We are aware that the bridge needs quite some memory to build due to the template specializations but I have only seen 2-4 GB per thread. Independent of the problem in this ticket if you could provide an example with a similar memory usage in a separate ticket it would be good to look into it and see what can be done to lower the resource usage. |
@dirk-thomas to be fair it's probably our fault for having 162 messages in a single package… we're going to break that up in our next release cycle :) |
Oh, I see. I was worried that it was due to some deep nesting or similar. I can totally see how 162 msgs in a single package could get you there 😉 |
@dirk-thomas I'm still working on creating a debug build of the bridge (the debug symbols take up even more memory and I don't have a machine that can build it… working on getting our packages sorted out first), but something of note is that ever since we started using |
@dirk-thomas I was able to create a bridge with debug symbols but similar to #131 I can't reproduce this using simulations. I'll see if we can recreate it on the real robot and get you some information but it's going to be tricky to get you an example you can run. |
I will go ahead and close this for now since we can't do anything without further information. Please feel free to comment on the closed ticket with the requested information and the issue can be reopened if necessary. |
Bug report
Required Info:
Ubuntu 16.04
Binaries
Ardent
Fast-RTPS
N/A
Steps to reproduce issue
This is probably not an easy repro, I'm not even sure what causes it myself. Right now I'm looking for information on how to get more verbosity out of the bridge so I can figure out what is hanging and why, because the current failure mode is entirely silent.
We are using a version of the
ros1_bridge
with our messages built in to it. We haven't modified the bridge itself, just linked our messages in to it. You can find a .tar'd install space of it here: https://bintray.com/ihmcrobotics/distributions/ihmc-ros1-bridgeWe are able to use the bridge to successfully send and receive topics to and from ROS 1 based software (ROS 1 Kinetic installed from apt-get on Ubuntu 16.04) but occasionally the bridge just stops working. It will no longer create console output when a new talker or listener attempts to participate. The bridge also stops responding to INT signals. The only way to terminate the bridge is to send an SIGKILL. Even more interestingly, once you have sent SIGKILL to the bridge and stopped the ROS 2 daemon with
ros2 daemon stop
, then restarted the bridge, it continues to behave in this way. Bridges for new talkers/listeners do not get created and signals are not handled.I've tried SIGKILL'ing the bridge then restarting it with
--show-introspection
and it prints a few messages about bridging some running ROS 2 publishers that we have but after just a few seconds it stops printing any introspections as well.I'd love to be able to provide more useful information I just don't know where to start since the failure mode is entirely silent. If I can rebuild the bridge with some debug flags or something to get more verbose logging I'd love to start there and hopefully get this figured out.
Expected behavior
ROS 1 bridge is able to bridge ROS 2 and ROS 1 talkers/listeners
Actual behavior
ROS 1 bridge stops working and cannot be restarted
The text was updated successfully, but these errors were encountered: