Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMCL takes a loooooong time to come up on OpenSplice, no error message #1094

Closed
rotu opened this issue Sep 4, 2019 · 14 comments

Comments

@rotu
Copy link
Contributor

@rotu rotu commented Sep 4, 2019

Bug report

Required Info:

  • Operating System:
    • Ubuntu 18.04
  • Installation type:
    • binaries
  • Version or commit hash:
    • ros-dashing-nav2-amcl/bionic,now 0.2.3-1bionic.20190831.172054 amd64 [installed]
  • DDS implementation:
    • OpenSplice (with changes to ros_ospl.xml to set AllowMulticast spdp)
  • Client library (if applicable):
    • N/A

Steps to reproduce issue

Using the launch files here: https://github.com/RoverRobotics/openrover-ros2/tree/98fa2973b726bc8b867ce1980b4063f210585584/openrover_demo/launch

# on the rover
ros2 launch openrover_demo drive.launch.py
# on a computer
ros2 launch openrover_demo rviz.launch.py
# on either the computer or the rover
ros2 launch openrover_demo amcl.launch.py

give an initial pose in RViz with the 2D Pose Estimate tool.

Expected behavior

AMCL will publish the location of the robot or display an error message why it cannot.

Actual behavior

Occasionally, AMCL will start and I will see the robot location and lidar data in RViz within a few seconds.

Most of the time AMCL acknowledges the initial pose and then does not publish a tf for a long time (minutes), and does not indicate any status information.

Here is some example output from the Nav2 stack:

[amcl-3] [INFO] [amcl]: Creating
[lifecycle_manager-1] [INFO] [lifecycle_manager]: Starting the system bringup...
[lifecycle_manager-1] [INFO] [lifecycle_manager]: Creating and initializing lifecycle service clients
[lifecycle_manager-1] [INFO] [lifecycle_manager]: Configuring and activating map_server
[map_server-2] [INFO] [map_server]: Configuring
[map_server-2] [INFO] [map_server]: OccGridLoader: Creating
[map_server-2] [INFO] [map_server]: OccGridLoader: Configuring
[map_server-2] [WARN] [map_server]: Mode parameter not set, using default value (trinary): bad conversion
[map_server-2] [INFO] [map_server]: Activating
[map_server-2] [INFO] [map_server]: OccGridLoader: Activating
[lifecycle_manager-1] [INFO] [lifecycle_manager]: Configuring and activating amcl
[amcl-3] [INFO] [amcl]: Configuring
[amcl-3] [INFO] [amcl]: initTransforms
[amcl-3] [WARN] [rcl.logging_rosout]: Publisher already registered for provided node name. If this is due to multiple nodes with the same name then all logs for that logger name will go out over the existing publisher. As soon as any node with that name is destructed it will unregister the publisher, preventing any further logs for that name from being published on the rosout topic.
[amcl-3] [INFO] [amcl]: initPubSub
[amcl-3] [INFO] [amcl]: Subscribed to map topic.
[amcl-3] [INFO] [amcl]: Activating
[lifecycle_manager-1] [INFO] [lifecycle_manager]: The system is active
[amcl-3] [INFO] [amcl]: initialPoseReceived
[amcl-3] [INFO] [amcl]: Setting pose (1567563024.080354): -0.313 -2.710 -2.384
[amcl-3] [INFO] [amcl]: Received a 567 X 329 map @ 0.050 m/pix
[amcl-3] 
[amcl-3] [INFO] [amcl]: Setting pose (1567563025.430725): -0.313 -2.710 -2.384
[amcl-3] [INFO] [amcl]: initialPoseReceived
[amcl-3] [INFO] [amcl]: Setting pose (1567563033.278710): -0.107 -1.923 -2.367

a minute and a half later:

[amcl-3] [INFO] [amcl_rclcpp_node]: [signalFailure] Drop message: frame 'lidar_link' at time 1567563119.337 for reason(0)

[amcl-3] [INFO] [amcl]: createLaserObject
[amcl-3] [INFO] [amcl_rclcpp_node]: [signalFailure] Drop message: frame 'lidar_link' at time 1567563167.865 for reason(0)
[amcl-3] [INFO] [amcl_rclcpp_node]: [signalFailure] Drop message: frame 'lidar_link' at time 1567563186.525 for reason(0)
[amcl-3] [INFO] [amcl_rclcpp_node]: [signalFailure] Drop message: frame 'lidar_link' at time 1567563189.549 for reason(0)
[amcl-3] [INFO] [amcl_rclcpp_node]: [signalFailure] Drop message: frame 'lidar_link' at time 1567563201.945 for reason(0)
[amcl-3] [INFO] [amcl_rclcpp_node]: [signalFailure] Drop message: frame 'lidar_link' at time 1567563204.466 for reason(0)
[amcl-3] [INFO] [amcl_rclcpp_node]: [signalFailure] Drop message: frame 'lidar_link' at time 1567563207.645 for reason(0)
[amcl-3] [INFO] [amcl_rclcpp_node]: [signalFailure] Drop message: frame 'lidar_link' at time 1567563230.842 for reason(0)
[amcl-3] [INFO] [amcl_rclcpp_node]: [signalFailure] Drop message: frame 'lidar_link' at time 1567563233.145 for reason(0)
[amcl-3] [INFO] [amcl_rclcpp_node]: [signalFailure] Drop message: frame 'lidar_link' at time 1567563237.897 for reason(0)
[amcl-3] [INFO] [amcl_rclcpp_node]: [signalFailure] Drop message: frame 'lidar_link' at time 1567563245.674 for reason(0)
[amcl-3] [INFO] [amcl_rclcpp_node]: [signalFailure] Drop message: frame 'lidar_link' at time 1567563247.618 for reason(0)
[amcl-3] [INFO] [amcl_rclcpp_node]: [signalFailure] Drop message: frame 'lidar_link' at time 1567563250.294 for reason(0)

Only five minutes later does RViz start to show the robot model and the lidar scan.

Additionally the robot state publisher and state estimation nodes give output like the following, which indicates a downstream node is blocking publishing from upstream nodes:

[robot_state_publisher-5] ========================================================================================
[robot_state_publisher-5] Report      : WARNING
[robot_state_publisher-5] Date        : 2019-09-03T21:14:48-0500
[robot_state_publisher-5] Description : writer 1494363807:441:1:8963 topic rt/tf_static waiting on high watermark due to reader 1271133040:441:1:8964
[robot_state_publisher-5] 
[robot_state_publisher-5] Node        : gauss
[robot_state_publisher-5] Process     : robot_state_publisher <31747>
[robot_state_publisher-5] Thread      : xmit.user 7fbcddeb9700
[robot_state_publisher-5] Internals   : 6.9.190705OSS///ddsi2/q_transmit.c/845/0/1567563288.757925142/0
[robot_state_publisher-5] ========================================================================================
[robot_state_publisher-5] Report      : API_INFO
[robot_state_publisher-5] Date        : 2019-09-03T21:15:51-0500
[robot_state_publisher-5] Description : The number of samples '5000' has surpassed the warning level of '5000' samples.
[robot_state_publisher-5] Node        : gauss
[robot_state_publisher-5] Process     : robot_state_publisher <31747>
[robot_state_publisher-5] Thread      : dq.builtins 7fbcddefd700
[robot_state_publisher-5] Internals   : 6.9.190705OSS///v_checkMaxSamplesWarningLevel/v_kernel.c/1940/769/1567563351.563910320/0
[se_node-6] ========================================================================================
[se_node-6] Report      : WARNING
[se_node-6] Date        : 2019-09-03T21:17:11-0500
[se_node-6] Description : writer 1779548450:441:1:12035 topic rt/tf waiting on high watermark due to reader 2003778342:441:1:18180
[se_node-6] 
[se_node-6] Node        : gauss
[se_node-6] Process     : se_node <31748>
[se_node-6] Thread      : xmit.user 7f078d486700
[se_node-6] Internals   : 6.9.190705OSS///ddsi2/q_transmit.c/845/0/1567563431.488097452/0
[se_node-6] ========================================================================================
[se_node-6] Report      : WARNING
[se_node-6] Date        : 2019-09-03T21:17:11-0500
[se_node-6] Description : writer 1779548450:441:1:12035 topic rt/tf waiting on high watermark due to reader 1271133040:2687:1:8452
[se_node-6] 
[se_node-6] Node        : gauss
[se_node-6] Process     : se_node <31748>
[se_node-6] Thread      : xmit.user 7f078d486700
[se_node-6] Internals   : 6.9.190705OSS///ddsi2/q_transmit.c/845/0/1567563431.488192481/0
[se_node-6] ========================================================================================
@rotu

This comment has been minimized.

Copy link
Contributor Author

@rotu rotu commented Sep 4, 2019

@mkhansen-intel I suspect this is a cascading failure caused by TF2's QoS reliability=reliable and made worse by increased use of TF's throughout the nav stack.

@SteveMacenski

This comment has been minimized.

Copy link
Member

@SteveMacenski SteveMacenski commented Sep 4, 2019

I know I'm stating the obvious, but I'm listing it to spark thoughts.

The laser receive callback wouldnt be called if TF wasn't available as well as if the laser wasn't received. The first step I would take is enabling both a laserReceivedCb with the message filter and another without it. Try to reproduce this issue. When you do, see if the non-filtered CB is getting messages through (just print "hi" or something). If so, its TF. If not, then its the laser received from Gazebo.

It also could be from map not being received, but I'm not seeing the "Waiting for map..." messages

@rotu

This comment has been minimized.

Copy link
Contributor Author

@rotu rotu commented Sep 4, 2019

I suspect TFs because this is what I see when I ros2 run tf2_ros tf2_monitor. Still (1) the stack should recover faster (2) I should see some error messages to indicate what's wrong.

RESULTS: for all Frames

Frames:
Frame: /base_link, published by <no authority available>, Average Delay: 217.318, Max Delay: 306.519
Frame: /caster_back_left_link, published by <no authority available>, Average Delay: 257.517, Max Delay: 321.816
Frame: /caster_back_right_link, published by <no authority available>, Average Delay: 257.517, Max Delay: 321.816
Frame: /cpu_link, published by <no authority available>, Average Delay: 217.318, Max Delay: 306.519
Frame: /imu_link, published by <no authority available>, Average Delay: 217.318, Max Delay: 306.519
Frame: /lidar_link, published by <no authority available>, Average Delay: 217.318, Max Delay: 306.519
Frame: /plate_link, published by <no authority available>, Average Delay: 217.318, Max Delay: 306.519
Frame: /sheetmetal_link, published by <no authority available>, Average Delay: 217.318, Max Delay: 306.519
Frame: /wheel_left_link, published by <no authority available>, Average Delay: 257.517, Max Delay: 321.816
Frame: /wheel_right_link, published by <no authority available>, Average Delay: 257.517, Max Delay: 321.816
Frame: base_footprint, published by <no authority available>, Average Delay: 257.158, Max Delay: 367.588

All Broadcasters:
Node: <no authority available> 1.89781 Hz, Average Delay: 247.261 Max Delay: 367.588
@rotu

This comment has been minimized.

Copy link
Contributor Author

@rotu rotu commented Sep 4, 2019

@SteveMacenski I added another subscription to laser scans using the same QoS with a debug message as well as a debug message at the beginning of AmclNode::laserReceived and it looks laser scans are trickling through the separate subscription, they just aren't making it through the message filter for a few minutes.

I'm pretty sure the overall performance is significantly worse when bringing up the rest of the nav2 stack (rather than just map_server and amcl), though I don't have numbers to it.

@SteveMacenski

This comment has been minimized.

Copy link
Member

@SteveMacenski SteveMacenski commented Sep 4, 2019

Most of those frames are static transforms too, not even stuff on the /tf topic. That's... interesting. Where are the dynamic frames? (map, odom, etc)

Good to know, good to know its definitely TF then.

Edit, are you on master or latest dashing binaries (as the ticket says). We haven't updated dashing recently in the build farm. Try master?

@rotu

This comment has been minimized.

Copy link
Contributor Author

@rotu rotu commented Sep 4, 2019

The dynamic frames eventually show up. It just takes for-freaking-ever.

I’m on Dashing on purpose. I’d like to get my rover working against actually released packages instead of the rolling target that is master.

@SteveMacenski

This comment has been minimized.

Copy link
Member

@SteveMacenski SteveMacenski commented Sep 4, 2019

Fair enough, but as a debug instrument, it would be good to know if the current stuff with ros2 master fixes the problems. Because if so, the answer is just to release new debians (or worst case, E release). If the answer is not, then that's something actionable

I can say I haven't seen things not coming up until minutes later in quite awhile working off master. With that said if things dont come up right within 10 seconds, I kill it and restart so it may just be my habits unless what you're describing is consistent and pervasive rather than random/occasional

@rotu rotu changed the title AMCL takes a loooooong time to come up, no error message AMCL takes a loooooong time to come up on OpenSplice, no error message Sep 5, 2019
@rotu

This comment has been minimized.

Copy link
Contributor Author

@rotu rotu commented Sep 5, 2019

I've found different performance characteristics on other RMW implementations. This particular failure mode seems to mostly affect OpenSplice.

@eboasson

This comment has been minimized.

Copy link

@eboasson eboasson commented Sep 5, 2019

I suspect the problem can be reduced by toggling a configuration setting of OpenSplice's DDSI service: the curiously named Internal/AggressiveKeepLastWhc. It defaults to false, which means it ignores the KEEP_LAST setting at protocol level for writers, instead treating them as KEEP_ALL. There is method behind that madness.

The basic design of OpenSplice is that applications interact with a "kernel" that maintains all data (ideally in shared memory), which in turn also interacts with a number of services that handle all kinds of background tasks (in that ideal setting, those are processes, but typically it is all squeezed into a single process without shared memory and those services end up as threads in the process). One of these services is the networking service, and data that is published by an application is simply enqueued for the networking service by the kernel.

All this antedates the DDSI standard by a fair bit and OpenSplice has another, proprietary, networking service in which completely different design choices have been made — and so has a sweet spot in different system design choices. In particular, there is no such thing as history in that other protocol, it is simply a (reliable) transport. For KEEP_LAST writers, the kernel decides what data to enqueue and what data to drop. Consequently, if the networking service were to independently drop data as well, there would be two independent processes both dropping data, and that's kinda nasty, even if most of the time, there won't be much difference.

And so, the default is not to drop data within the DDSI service, but the price is that you can get flow-control slowing things down where you wouldn't expect it. If that gets bad, messages like "topic rt/tf_static waiting on high watermark" starting show up.

The other message, the "number of samples '5000' has surpassed the warning level" probably has not that much to do with it: notice that it says the thread is "dq.builtin", which is a thread exclusively dealing with (DDSI) built-in data, i.e., discovery data. That means the warning most likely relates to built-in topics, which in turn suggests there are readers/writers/participants being created (and presumably deleted) very often, but that no-one is removing the disposed ones from a data reader. But that is a bit of a guess.

@rotu

This comment has been minimized.

Copy link
Contributor Author

@rotu rotu commented Sep 5, 2019

Thanks for the background and recommendations, @eboasson!

Editing my ospl.xml to set Internal/AggressiveKeepLastWhc = true unfortunately did not make the startup issues go away. It did seem to prevent the warnings about "high watermark" and number of samples warning level.

Additionally, setting Internal/RetransmitMerging=never and Internal/Watermarks/WhcHigh=800 kB, which have been previously suggested as performance tweaks did not work.

I created a wireshark capture of an attempted bringup. Maybe there's something in there you can make sense of. The file is 78.8MB compressed or 1.4GB uncompressed: https://send.firefox.com/download/36fdd8c41f76b42e/#4Phjj_ax11m36nOc0jz7JQ

@eboasson

This comment has been minimized.

Copy link

@eboasson eboasson commented Sep 6, 2019

I have had a quick look @rotu ... it ain't pretty. A really detailed study would take more time, but I'm not sure that would really make a difference in practice. What I see:

  • 7 processes on the one node, 11 on the other.
  • Each process tries to learn the topic definitions, readers (+ DCPS subscribers), writers (+ DCPS publishers) of all other processes. Without multicast (i.e., as you described) other than for initial participant discovery, with 18 processes that's necessarily quite a bit of data.
  • The vast majority of the traffic is topic discovery. It seems plausible that the XML parsing & validation by the "spliced" service can't keep up, and as it is the "dq.builtin" thread that dumps this data into a data reader for further processing by that service, this fits with the earlier observation concering the "number of samples surpassing the warning level".
  • To make matters worse, there are quite a few retransmits that appear to be caused by packet loss. My guess is that with up to 17 processes sending to one at the same time, the UDP receive buffers can easily overflow, leading to packet loss and (even more) retransmits.
  • If the peers are all known when a topic/reader/writer is created, the corresponding discovery data is sent straight to all peers (using multicast if that's allowed). If they are not yet known, they'll request retransmits later, which is even more expensive.

Given the numbers of processes on the two nodes and the dominant traffic being topic discovery, one would naively expect a 11/7 ratio in traffic for the two machines. Wireshark reports 877474 units for 10.0.2.22 and 601316 for 10.0.0.8. That is pretty close, so I think nothing more than this is going on.

What to do:

  • Use shared memory, if you can get your hands on a shared memory-enabled build. When you do that, suddenly there are only two nodes that need to exchange topic definitions, rather than 18 processes. Obviously going from 18 nodes all-to-all to 2 nodes all-to-all is a gigantic reduction in traffic.
  • A hack to disable topic discovery. It'll run just fine without it as long as you don't need transient/persistent data. That's literally a hack in the DDSI service source code, but a really small one (though I have done it before, I'd have to dig into the sources to figure it out again).
  • Enlarge the UDP socket receive buffers. There is an option for that (Internal/MinimumSocketReceiveBufferSize) but it would probably also require raising the kernel's limit on the socket receive buffer size. I don't know how much it would help, but I do suspect it'd help a bit.
  • ... just use Cyclone DDS 😁
@SteveMacenski

This comment has been minimized.

Copy link
Member

@SteveMacenski SteveMacenski commented Sep 6, 2019

... just use Cyclone DDS

I got off the phone with Joe last week and he convinced me in about 15 minutes that Cyclone was the only real option to move to next.

@eboasson thanks for that very detailed and helpful description. If you have any suggestions for how to modify our sub/pubs to support in this stack that would be appreciated, but it sounds mostly like issues around the DDS implementation + not using components (yet)

@mkhansen-intel

This comment has been minimized.

Copy link
Collaborator

@mkhansen-intel mkhansen-intel commented Sep 11, 2019

@rotu - as this is NOT an AMCL or really Nav2 bug, can we close it? If OpenSplice needs to be fixed (doesn't sound like it can be) then maybe it should be an upstream issue in the OpenSplice project.

@rotu

This comment has been minimized.

Copy link
Contributor Author

@rotu rotu commented Sep 11, 2019

@mkhansen-intel agreed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.