Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ros1_bridge silently stops working and cannot be restarted #129

Closed
dljsjr opened this issue Jul 24, 2018 · 13 comments
Closed

ros1_bridge silently stops working and cannot be restarted #129

dljsjr opened this issue Jul 24, 2018 · 13 comments
Labels
bug Something isn't working more-information-needed Further information is required

Comments

@dljsjr
Copy link

dljsjr commented Jul 24, 2018

Bug report

Required Info:

  • Operating System:
    Ubuntu 16.04
  • Installation type:
    Binaries
  • Version or commit hash:
    Ardent
  • DDS implementation:
    Fast-RTPS
  • Client library (if applicable):
    N/A

Steps to reproduce issue

This is probably not an easy repro, I'm not even sure what causes it myself. Right now I'm looking for information on how to get more verbosity out of the bridge so I can figure out what is hanging and why, because the current failure mode is entirely silent.

We are using a version of the ros1_bridge with our messages built in to it. We haven't modified the bridge itself, just linked our messages in to it. You can find a .tar'd install space of it here: https://bintray.com/ihmcrobotics/distributions/ihmc-ros1-bridge

We are able to use the bridge to successfully send and receive topics to and from ROS 1 based software (ROS 1 Kinetic installed from apt-get on Ubuntu 16.04) but occasionally the bridge just stops working. It will no longer create console output when a new talker or listener attempts to participate. The bridge also stops responding to INT signals. The only way to terminate the bridge is to send an SIGKILL. Even more interestingly, once you have sent SIGKILL to the bridge and stopped the ROS 2 daemon with ros2 daemon stop, then restarted the bridge, it continues to behave in this way. Bridges for new talkers/listeners do not get created and signals are not handled.

I've tried SIGKILL'ing the bridge then restarting it with --show-introspection and it prints a few messages about bridging some running ROS 2 publishers that we have but after just a few seconds it stops printing any introspections as well.

I'd love to be able to provide more useful information I just don't know where to start since the failure mode is entirely silent. If I can rebuild the bridge with some debug flags or something to get more verbose logging I'd love to start there and hopefully get this figured out.

Expected behavior

ROS 1 bridge is able to bridge ROS 2 and ROS 1 talkers/listeners

Actual behavior

ROS 1 bridge stops working and cannot be restarted

@calvertdw
Copy link

Did you call ros2 daemon start before restarting the bridge?

@calvertdw
Copy link

This could be related to Fast-RTPS configuration on the other end, as it is an IHMC implementation of a ROS 2 node, not using rmw.

@calvertdw
Copy link

From https://github.com/ros2/ros2/wiki/Linux-Install-Debians:

  • ros-$ROS_DISTRO-*-dbgsym. These packages provide the debugging symbols stripped from binaries.

@dljsjr
Copy link
Author

dljsjr commented Jul 24, 2018

@calvertdw I believe the daemon restarts itself the first time you launch a ROS 2 process, I've never had to manually start it any other time.

I also don't believe it's an issue with our stuff because our comms remain alive (you can still talk to our stuff via ROS 2 directly). You can publish and subscribe to our topics without issue.

Debug symbols for the bridge won't be valid because we've recompiled the bridge. I can rebuild ours with debug flags and use GDB to start a C++ debugger session but I don't know what I'm looking for, that's what I'm hoping to figure out.

@dirk-thomas
Copy link
Member

Can you try to reproduce the problem while running a debug build of the bridge in gdb? Once it "stops" working maybe the stacktrace will provide enough context.

@sloretz sloretz added bug Something isn't working more-information-needed Further information is required labels Jul 26, 2018
@dljsjr
Copy link
Author

dljsjr commented Jul 27, 2018

So upon continued investigation I'm actually starting to believe that this is related to an interaction between the bridge and the ROS 1 roscore and not something dying in the bridge itself.

@dirk-thomas I can work on getting something like that set up but it might be a day or two because the current place we're investigating this is in "production" on a robot so I'll have to make a test setup when I get back to Florida (I'm in Houston at Johnson working on Valkyrie right now).

For a bit of context on the setup and an overview of the topology: As I mentioned above this is a deployment being tested on the NASA Valkyrie humanoid. They have two on-board computers, one for real-time feedback control and one for non-realtime perception and out of band management of the asynchronous API. The real-time machine also has a custom PCI appliance for talking to the motor amps and other embedded systems like the power management board. The real-time machine runs our control stack (which has a DDS API implemented via Fast-RTPS and implementing the ROS 2 partition/namespace standards) as well as the NASA management stack which is ROS 1 based. So roscore is on this machine.

The non-realtime machine runs the vision stuff (multisense drivers) and the ros1_bridge dynamic_bridge. We were doing some cycle testing yesterday when we were able to get the bridge in to the same state I described above, and even after doing a full power cycle of the non-realtime machine we were not able to get the bridge to restart correctly. Additionally, after about one or two minutes in this weird state, the robot power browned out without a load spike or system reboot of the real-time box, meaning the ROS 1 based management stack on the real-time computer had seized for long enough that a bunch of heartbeats got missed.

I think it'll be easy enough to attach GDB to the bridge but I'm not so sure that the indicative information itself will be in the bridge (though it might at a minimum give us a stack trace showing us what interaction with ROS 1 is hanging and why?), which is why I'll have to try and reproduce this off the robot because if we attach GDB to any of the processes on the real-time machine we'll probably miss deadlines and the whole thing won't be able to run anyway.

@dljsjr
Copy link
Author

dljsjr commented Jul 27, 2018

I also need access to a build machine back in Pensacola to make a debug build of the bridge because our message package OOM's the bridge compile process on machines that don't have > 16GB of RAM even when building single job thread and my laptop can't cut it.

@dirk-thomas
Copy link
Member

If you want to try it early on the robot you could comment in some of the print statements which output various messages based on "progress" / "activity" in the bridge (sorry, that was written before log macros were available). Maybe that will also provide some information on what the bridge is doing when it is "hanging". It might even be enough to rule out a problem in the bridge if that is the case.

our message package OOM's the bridge compile process on machines that don't have > 16GB of RAM even when building single job thread

Wow, that is pretty extreme. We are aware that the bridge needs quite some memory to build due to the template specializations but I have only seen 2-4 GB per thread. Independent of the problem in this ticket if you could provide an example with a similar memory usage in a separate ticket it would be good to look into it and see what can be done to lower the resource usage.

@dljsjr
Copy link
Author

dljsjr commented Jul 27, 2018

@dirk-thomas to be fair it's probably our fault for having 162 messages in a single package… we're going to break that up in our next release cycle :)

@dirk-thomas
Copy link
Member

to be fair it's probably our fault for having 162 messages in a single package

Oh, I see. I was worried that it was due to some deep nesting or similar. I can totally see how 162 msgs in a single package could get you there 😉

@dljsjr
Copy link
Author

dljsjr commented Jul 31, 2018

@dirk-thomas I'm still working on creating a debug build of the bridge (the debug symbols take up even more memory and I don't have a machine that can build it… working on getting our packages sorted out first), but something of note is that ever since we started using --bridge-all-topics pursuant to the "issue" I was having in #130 we have not been able to reproduce any crashes or hard hangs.

@dljsjr
Copy link
Author

dljsjr commented Aug 6, 2018

@dirk-thomas I was able to create a bridge with debug symbols but similar to #131 I can't reproduce this using simulations. I'll see if we can recreate it on the real robot and get you some information but it's going to be tricky to get you an example you can run.

@dirk-thomas
Copy link
Member

I will go ahead and close this for now since we can't do anything without further information. Please feel free to comment on the closed ticket with the requested information and the issue can be reopened if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working more-information-needed Further information is required
Projects
None yet
Development

No branches or pull requests

4 participants