Improving jitter through the driver and middleware stack for data processors #41

tpanzarella · 2020-05-01T21:27:31Z

In both the pointcloud_processor and scan_processor (the latter is really insignificant) this patch has the driver give up ownership of the message at the time of data publishing. This allows for middleware optimizations to include the elimination of unnecessary copies. Several tests on my machine show order of magnitude jitter improvements through the driver + middleware stack both out-of-process and while running as a component to sub-millisecond levels. Here are some data:

NOTE: msg_stamp here is the time stamped on the message by the driver (so, how consistent is the LiDAR at delivering the data). recv_stamp here is the time received by a subscriber (so, how long spent in the driver and through the ROS2 middleware before a subscriber can work with the data). I'm using TIME_FROM_ROS_RECEPTION as my timestamp mode so I can truly measure time spent in the driver and ROS2 middleware and not worry about in-LiDAR or network latency.

Before this patch, here is what the data looked like for the pointcloud processor (these are for interprocess communication tests over 1,000 messages -- the effect is similar when run as a component):

Here is a quantile plot of the same:

Here are the end-to-end numbers for this same set of data (again, before this patch):

Here is what the data look like with this patch:
(Not perfect, there is much more we can do, but significantly improved in terms of jitter)

In terms of the end-to-end numbers, prior to this patch, the median end-to-end pointcloud delivery latency was 18.164 ms, however, the median absolute deviation (MAD) was 4.256 ms. With this patch, the median latency was 23.706 ms, but the MAD was only .252 ms. So, the patch significantly lessens the jitter which makes the system more stable. We can always tune to lessen absolute latency. Indeed, in my setup, the DDS could be further tuned.

The driver now gives up ownership of the PointCloud message in the pointcloud_processor at publish time to allow for middleware optimizations to include the elimination of unnecessary copies. Several tests on my machine show order of magnitude jitter improvements through the driver + middleware stack both out-of-process and while running as a component to sub-millisecond levels.

SteveMacenski · 2020-05-01T21:47:04Z

This allows for middleware optimizations to include the elimination of unnecessary copies

~~Is this for the case of using the component?~~ (edit: reread the post. That's interesting, do you know why this is?) I'm surprised for using the non-real-time rclcpp API that the inter-process comms would be improved so much that way. I'm just trying to fully understand why that is the case so I can apply that to other places in the ROS ecosystem that could benefit. Excellent analysis, thank you.

This looks good, it'll go in as is.

tpanzarella · 2020-05-01T21:57:26Z

I believe it has to do with cutting down on the number of data copies and allocations once the data leaves the driver. As the code was written prior, the msg was getting passed to publish as a reference. So, for safety, the middleware must make a copy of the data right off the bat and to support that copy there would be an allocation as well. By passing a unique_ptr to publish we are explicitly communicating our ownership intentions to the middleware. Specifically, that we relinquish ownership. I believe, internally, when the middleware takes receipt of this unique_ptr, it immediately promotes it to a shared_ptr (details beyond this comment) but that does not incur the copy or allocation. It's pretty cheap.

I have to play a bit with the intra-process comms. I have not gotten this to give us true zero-copy yet but I haven't looked that hard. I will spend some time with that next week as it is relevant to what we are currently working on.

SteveMacenski · 2020-05-01T22:01:30Z

That begs the question to me then: why does ROS2 even allow you to publish a non-pointer type? It seems like that's a reasonable burden to put on developers for that much increase in stability.

tpanzarella · 2020-05-01T22:04:23Z

I certainly like the smart pointers as a matter of taste as ownership semantics are clearly communicated. It is a facility provided to us by the language and we should exploit it. I suppose the same performance could be achieved by casting a non-pointer type to an r-value reference, however, that can silently degrade to an l-value reference and you would not realize it until the performance became a problem. The smart pointers make the intention clear.

tpanzarella · 2020-05-01T22:06:34Z

I should also note that I have done a bit of tuning on my machine (DDS, UDP buffer sizes, etc.), however, the delta shown in those plots are simply as a result of the change in the way we are publishing through the driver.

tpanzarella · 2020-05-01T22:09:03Z

Let me know if you want me to handle the merge.

SteveMacenski · 2020-05-01T22:55:32Z

@wjwwood maybe something to consider in your API review for rclcpp. I think it would make alot of sense to force the use of smart pointers if this is the kind if impact it can make.

wjwwood · 2020-05-02T00:30:29Z

The reason why we allow a reference is to allow the user to have messages on the stack if they want or to reuse a message loop to loop. Because you give up ownership you need to allocate a new one each time.

You could also potentially get a similar performance improvement by using the middleware publisher allocation feature which lets the middleware preallocate resources and reuse them on each publish call rather than needing to allocate resources each publish. Not all the middleware implementations leverage this optimization but they could in the future. @mjcarroll worked on that feature maybe he could point you to an example, I’m on my phone atm and don’t have an example at hand.

For the zerocopy intraprocess you just need to publish and subscribe with unique ptr and ensure there’s only one publisher and subscription pair.

One further thing you could do is use the loaned message api where you can get a message from the middleware, fill it out, and then publish it (returning ownership to the middleware). This can allow the middleware to do zero copy, even Between processes. Iceoryx does this, @Karsten1987 gave a ROSCon talk about this.

SteveMacenski · 2020-05-02T00:45:01Z

The reason why we allow a reference is to allow the user to have messages on the stack if they want or to reuse a message loop to loop

Ah, I do remember us talking about that. Thanks for the reminder. Now that I'm in front of a computer, I'm scouring all of the navigation repos for instances where we don't do that to update to work with unique_ptrs at the bare minimum. We play around with too many pointclouds/images/maps to not use that boost.

I appreciate the detailed response. I think I've covered the major bases with tickets where updates are required.

tpanzarella · 2020-05-02T12:30:11Z

Hey @wjwwood thanks for your comments. Quick point of clarity on the zero-copy strategy...

For the zerocopy intraprocess you just need to publish and subscribe with unique ptr and ensure there’s only one publisher and subscription pair.

This makes sense that zero-copy would work in this scenario. I assume when there is a 1-N (producer->consumer) topology, subscriptions to a std::shared_ptr<const T> should also get the data in true zero copy, no?

Also, something basic related to this that maybe you can help with. Is it a true statement that to get zero-copy the .use_intra_process_comms(true) flag needs to be set on the NodeOptions passed to the Node ctor?

This AM, I wanted to see if I could get this to work correctly. However, I ran into a problem. The driver is implemented as a managed (lifecycle) node. As the driver was coming up and an external maanger was pushing it through its FSM, I got the following exception:

[ouster_driver-1] [ERROR] []: Caught exception in callback for transition 10
[ouster_driver-1] [ERROR] []: Original error: intraprocess communication allowed only with volatile durability
[ouster_driver-1] [WARN] []: Error occurred while doing error handling.
[ouster_driver-1] [ERROR] [ouster_driver]: Handing error in Ouster driver node.

I got the exact same behavior when 1) running the driver standalone (not in a component container) and driving the FSM via launch and 2) using a custom lifecycle manager to drive the FSM when running the driver in a component container (as there is currently a limitation in launch that we cannot define a managed lifecycle component so we need a custom manager for this).

My direct question is: is it possible to do zero-copy / intraprocess comms with managed nodes? If yes, I think there is a fundamental piece I am missing. Any insights you can lend are really appreciated. Pinging @mjcarroll on this too as I think he is pretty deep on these topics as well -- Hi Mike :-)

The one particular line of the error above that has me puzzled is:

[ERROR] []: Original error: intraprocess communication allowed only with volatile durability

I know what it is trying to say, but, is there a way that I can communicate to the middleware something like: "hey, do intraprocess if you can but for things like messages whose QoS preclude it, just take the interprocess approach." All we really care about are zero-copy semantics for the data, those with rclcpp::SensorDataQoS (i.e., among other things: volatile durability).

One further thing you could do is use the loaned message api where you can get a message from the middleware, fill it out, and then publish it (returning ownership to the middleware). This can allow the middleware to do zero copy, even Between processes. Iceoryx does this, @Karsten1987 gave a ROSCon talk about this.

Yes. I am looking forward to getting Iceoryx set up. It gives us the best of both worlds: Zero-copy / low latency data transformation pipelines with 3D LiDAR scans and the loose coupling / modularity of using out-of-process nodes.

SteveMacenski · 2020-05-13T00:11:59Z

@tpanzarella your charts are very compelling, I've be interested to see what that looks like with ROS1's Ouster Drivers. I think showing a side by side of "this than that" could be a really good propaganda piece for ROS2 docs / a Box Robotics blog post.

tpanzarella added 3 commits May 1, 2020 15:44

Updated scan_processor to release ownership of scan_msg at publish time.

3928b90

Fixed some linter issues

6a4a234

tpanzarella requested a review from SteveMacenski May 1, 2020 21:27

SteveMacenski approved these changes May 1, 2020

View reviewed changes

SteveMacenski mentioned this pull request May 1, 2020

Change all publishers to publish unique ptrs ros-navigation/navigation2#1672

Closed

SteveMacenski merged commit ed188b3 into ros-drivers:eloquent-devel May 1, 2020

This was referenced May 2, 2020

[ROS2] Replace publishers with msg references with unique_ptrs cra-ros-pkg/robot_localization#564

Closed

[ROS2] Replace publishers with msg references with unique_ptrs SteveMacenski/slam_toolbox#201

Closed

tpanzarella mentioned this pull request May 22, 2020

Provides a simple way to selectively turn on/off data processors #46

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving jitter through the driver and middleware stack for data processors #41

Improving jitter through the driver and middleware stack for data processors #41

tpanzarella commented May 1, 2020

SteveMacenski commented May 1, 2020 •

edited

Loading

tpanzarella commented May 1, 2020

SteveMacenski commented May 1, 2020 •

edited

Loading

tpanzarella commented May 1, 2020

tpanzarella commented May 1, 2020

tpanzarella commented May 1, 2020

SteveMacenski commented May 1, 2020

wjwwood commented May 2, 2020

SteveMacenski commented May 2, 2020

tpanzarella commented May 2, 2020

SteveMacenski commented May 13, 2020

Improving jitter through the driver and middleware stack for data processors #41

Improving jitter through the driver and middleware stack for data processors #41

Conversation

tpanzarella commented May 1, 2020

SteveMacenski commented May 1, 2020 • edited Loading

tpanzarella commented May 1, 2020

SteveMacenski commented May 1, 2020 • edited Loading

tpanzarella commented May 1, 2020

tpanzarella commented May 1, 2020

tpanzarella commented May 1, 2020

SteveMacenski commented May 1, 2020

wjwwood commented May 2, 2020

SteveMacenski commented May 2, 2020

tpanzarella commented May 2, 2020

SteveMacenski commented May 13, 2020

SteveMacenski commented May 1, 2020 •

edited

Loading

SteveMacenski commented May 1, 2020 •

edited

Loading