Automatically match QoS settings across the bridge #5

jacobperron · 2021-03-15T20:46:28Z

I've updated the design doc to reflect the implementation.

Specifically, this change introduces a WaitForQosHandler class used for deferring topic bridge creation. It creates a thread for each topic bridge to wait for at least one publisher to be available. Once a publisher is available it signals to the domain bridge via a callback, and a topic bridge is created.

In the special case of more than one publisher with different QoS settings, I've adopted a similar approach as rosbag2 for selecting a QoS that is compatible with most of the available publishers. Note, we could factor out this logic following ros2/rosbag2#601 and/or ros2/rmw#304.

Query endpoint info for publishers to get QoS settings before bridging a topic. TODO: only create bridge once a publisher is detected (use GraphListener). TODO: integration test (use launch_testing). Signed-off-by: Jacob Perron <jacob@openrobotics.org>

Poll in a thread until a publisher can be queried for QoS settings. Signed-off-by: Jacob Perron <jacob@openrobotics.org>

Signed-off-by: Jacob Perron <jacob@openrobotics.org>

ivanpauno

Looks like a good start, I've left some comments.

Note to self: review the test cases.

doc/design.md

src/domain_bridge/domain_bridge.cpp

src/domain_bridge/wait_for_qos_handler.hpp

ivanpauno · 2021-03-16T14:32:12Z

src/domain_bridge/wait_for_qos_handler.hpp

+        }
+      };
+    auto waiting_thread = std::make_shared<std::thread>(invoke_callback_when_qos_ready);
+    waiting_threads_.push_back(waiting_thread);


Instead of having many waiting threads, you could have one that checks for all the topics.
That scales much better

e.g.: you can store a std::vector<std::pair<topic, callback>> and the thread waiting for events can check if there's a public in all topics and call callbacks accordingly (the vector will need mutual exclusion)

I considered this and meant to try a refactor to do it. I agree it would scale much better. I'll take a look at doing it.

So, the problem is that the node used for querying topics is specific to the domain ID. In other words, I think we need at least one thread per domain ID. We could try to be a bit more clever and store a std::map<node, std::vector<std::pair<topic, callback>>> and have exactly one node (domain ID) per thread. The logic's going to get a bit more complex.

That sounds fine to me, it's a bit tricky but considering that in the common case (bridging from one domain to one domain) you only have one thread it seems to be worth doing.

PTAL at 4ae1095

src/domain_bridge/wait_for_qos_handler.hpp

ivanpauno · 2021-03-16T16:35:47Z

src/domain_bridge/wait_for_qos_handler.hpp

+    }
+
+    // Initialize QoS arbitrarily
+    QosMatchInfo result_qos(endpoint_info_vec[0].qos_profile());


Suggested change

QosMatchInfo result_qos(endpoint_info_vec[0].qos_profile());

QosMatchInfo result_qos;

result_qos.qos.reliability(endpoint_info_vec[0].qos_profile().reliability());

result_qos.qos.durability(endpoint_info_vec[0].qos_profile().durability());

I think copying the liveliness and deadline of the read profile isn't a good idea.
Each of the publishers might have a different one, but the good thing is that if your subscription has the default liveliness and default deadline it will match everything.

Hmmm, yeah good point. We could also take the largest of all values for deadline and lifespan (note the QoS here is also applied to the publisher). Liveliness is trickier; I'm not sure how to handle the "manual by topic" since we are republishing to the other side. Maybe it's best to always use "automatic" for the bridge publisher. What do you think?

I'm not sure how to handle the "manual by topic" since we are republishing to the other side

A subscription with "AUTOMATIC" liveliness will match everything, and the publisher should also be automatic if not you will need to have code doing some manual assert_liveliness calls.

We could also take the largest of all values for deadline and lifespan

Sounds fine to me, that's in general the default but manuall setting the durations to "INFINITY" is better.

Sounds fine to me, that's in general the default but manuall setting the durations to "INFINITY" is better.

I'd like to try and mimic the QoS as best we can, so I'll avoid setting the durations to infinity by default.

I think we have to make an exception for liveliness, and explicitly use AUTOMATIC.

See c62b3a1

Always use automatic liveliness

Use max of all deadline

Use max of all lifespan

Updated the design doc and added tests.

Use max of all deadline
Use max of all lifespan

IMO, those two are better manually chosen, "replicating" them doesn't seem to completely make sense.

They seem to make sense to me. If we have a publisher in domain A with a deadline, then using the same deadline for the bridge publisher on domain B will ensure subscriptions in domain B will experience similar behavior as subscriptions in domain A. Ditto for lifespan. What in particular doesn't make sense about this logic?

I will definitely expose a configuration point for users to override QoS values (e.g. via the YAML config). Expect a follow-up PR after this one.

They seem to make sense to me. If we have a publisher in domain A with a deadline, then using the same deadline for the bridge publisher on domain B will ensure subscriptions in domain B will experience similar behavior as subscriptions in domain A.

The thing is that the bridge will introduce some delay, so you tipically won't be able to met the same deadline as the original publisher.
If deadlines are important, I would rather think what's a reasonable deadline for the new domain than letting the bridge infer them.

Ditto for lifespan. What in particular doesn't make sense about this logic?

Lifespan is how much the message lives in the queue, and it's important how that's is combined with the history size/kind.
Because the last two cannot be instronspected, I don't think that inferring the lifespan will always lead to good results.

anyways, keeping this logic sounds fine to me.

The thing is that the bridge will introduce some delay, so you tipically won't be able to met the same deadline as the original publisher.

Since deadline is a duration between messages, any delay introduced by the bridge should (hopefully) be systematic. Even though the total time from the original publisher to the endpoint in the another domain may take longer, the time between consecutive messages republished by the bridged should ideally remain the same as they were before they entered the bridge. E.g. if the original publisher is publishing at 10Hz, I would expect the bridge to also publish at 10Hz into the other domain.

Lifespan is how much the message lives in the queue, and it's important how that's is combined with the history size/kind.

This is a good point. I can add a note about this potential pitfall in the design doc.

I've reconsidered and changed the default values to be max integers instead of matching available publishers in #13

Signed-off-by: Jacob Perron <jacob@openrobotics.org>

- Always use automatic liveliness - Use max of deadline policies - Use max of lifespan policies Signed-off-by: Jacob Perron <jacob@openrobotics.org>

Use a condition variable to notify each thread that a new callback has been registered. Signed-off-by: Jacob Perron <jacob@openrobotics.org>

jacobperron · 2021-03-19T04:08:42Z

So, the logic introduced in c62b3a1 uncovered an integer overflow bug in rclcpp::Duration. This PR depends on ros2/rclcpp#1584 for QoS matching to work properly. Specifically, setting a deadline (or lifespan) with a value larger than INT32_MAX (which is the default value) doesn't work.

ivanpauno

looking good, only found some minor issues

ivanpauno · 2021-03-19T12:25:13Z

src/domain_bridge/wait_for_qos_handler.hpp

+            cv->wait(
+              lock,
+              [this, &topic_callback_vec]
+              {return topic_callback_vec.size() > 0u || this->shutting_down_.load();});


don't combine condition variables and atomics, this code is equivalent to:

while (!(topic_callback_vec.size() > 0u || this->shutting_down_.load())) { // boolean flipped here and notification sent here, deadlock cv->wait(lock); }

this article explains the problem well.

tl;dr: replace the atomic with a normal bool and lock the waiting_map_mutex_ before flipping it (no need to hold the mutex while calling notify_all() though).

actually because of the logic used here, it seems that topic_callback_vec.size() > 0u is always true.
Maybe just deleting the condition variable code (?)

I think without the condition variable, the thread essentially runs a busy-loop when topic_callback_vec.size() == 0u. There is no work to do, but it will continue looping and waiting for graph events.

Thanks for catching the deadlock bug! If I remove the atomic from the predicate, I get other shutdown issues; Looking into it.

I think without the condition variable, the thread essentially runs a busy-loop when topic_callback_vec.size() == 0u. There is no work to do, but it will continue looping and waiting for graph events.

True, I didn't check that correctly.

Thanks for catching the deadlock bug! If I remove the atomic from the predicate, I get other shutdown issues; Looking into it.

Just to be more clear, my recommendation was to replace this line

domain_bridge/src/domain_bridge/wait_for_qos_handler.hpp

Line 67 in 4ae1095

shutting_down_.store(true);

with

{ std::lock_guard<std::mutex> lock(waiting_map_mutex_); shutting_down_ = true; }

and shutting_down_ can be made a bool.

Now using a regular bool for shutting_down_ and fixed some cleanup logic in the event of a SIGINT: eac644d

ivanpauno · 2021-03-19T12:29:39Z

src/domain_bridge/wait_for_qos_handler.hpp

+    std::shared_ptr<rclcpp::Node>,
+    std::pair<std::shared_ptr<std::thread>, std::shared_ptr<std::condition_variable>>>;


I don't mind, but it doesn't seem that you really need to wrap the std::thread and the std::condition_variable in a std::shared_ptr here

std::thread and std::condition_variable are not copyable, so I think this is the only way I can store a list of them. Correct me if I'm wrong.

std::thread and std::condition_variable are not copyable, so I think this is the only way I can store a list of them. Correct me if I'm wrong.

you have to use std::unordered_map::emplace() to avoid that issue, e.g.:

waiting_threads_.emplace( std::piecewise_construct, std::forward_as_tuple(node), std::forward_as_tuple( std::piecewise_construct, std::forward_as_tuple(invoke_callback_when_qos_ready), std::forward_as_tuple()));

so beautiful 😂

Though it's perhaps slightly less performant, using shared_ptr seems easier to understand 😅

If we get a SIGINT while querying a topic's QoS, then exit cleanly. Signed-off-by: Jacob Perron <jacob@openrobotics.org>

ivanpauno · 2021-03-23T13:17:54Z

src/domain_bridge/wait_for_qos_handler.hpp

+              } catch (const rclcpp::exceptions::RCLError & ex) {
+                // If the context was shutdown, then exit cleanly
+                // This can happen if we get a SIGINT
+                const auto context = node->get_node_options().context();
+                if (!context->is_valid()) {
+                  return;
+                }
+                throw ex;
+              }


argh... this doesn't look pretty nice

doesn't matter here, but it would be nice if we could fix this in rclcpp

Yeah, I know.. I think the only other way around it would be to disable the default signint handler and implement our own, but that wouldn't look pretty either.

src/domain_bridge/wait_for_qos_handler.hpp

Signed-off-by: Jacob Perron <jacob@openrobotics.org>

jacobperron added 7 commits March 10, 2021 10:00

[wip] Auto-detect QoS settings

b636b3c

Query endpoint info for publishers to get QoS settings before bridging a topic. TODO: only create bridge once a publisher is detected (use GraphListener). TODO: integration test (use launch_testing). Signed-off-by: Jacob Perron <jacob@openrobotics.org>

Defer bridging until QoS is available

9cd189f

Poll in a thread until a publisher can be queried for QoS settings. Signed-off-by: Jacob Perron <jacob@openrobotics.org>

Move thread management into it's own class

5d1f2dc

Signed-off-by: Jacob Perron <jacob@openrobotics.org>

Add more tests

d61f298

Signed-off-by: Jacob Perron <jacob@openrobotics.org>

Use graph events to trigger check for publishers

bbd6431

Signed-off-by: Jacob Perron <jacob@openrobotics.org>

Update design doc to reflect implementation

e14aaf7

Signed-off-by: Jacob Perron <jacob@openrobotics.org>

minor changes to design doc

3ac6371

Signed-off-by: Jacob Perron <jacob@openrobotics.org>

jacobperron requested review from ivanpauno and mjeronimo March 15, 2021 20:54

ivanpauno reviewed Mar 16, 2021

View reviewed changes

jacobperron mentioned this pull request Mar 16, 2021

QoS overriding #8

Closed

jacobperron added 6 commits March 16, 2021 15:35

Fix design doc typos

fd1342b

Signed-off-by: Jacob Perron <jacob@openrobotics.org>

Explicit capture

84ae94a

Signed-off-by: Jacob Perron <jacob@openrobotics.org>

Use map.at accessor instead of operator[]

f1bb5c1

Signed-off-by: Jacob Perron <jacob@openrobotics.org>

Fix comment

c780649

Signed-off-by: Jacob Perron <jacob@openrobotics.org>

Handle deadline, lifespan, and liveliness

c62b3a1

- Always use automatic liveliness - Use max of deadline policies - Use max of lifespan policies Signed-off-by: Jacob Perron <jacob@openrobotics.org>

Create one thread per domain ID (node)

4ae1095

Use a condition variable to notify each thread that a new callback has been registered. Signed-off-by: Jacob Perron <jacob@openrobotics.org>

ivanpauno reviewed Mar 19, 2021

View reviewed changes

Fix deadlock bug and handle shutdown race

eac644d

If we get a SIGINT while querying a topic's QoS, then exit cleanly. Signed-off-by: Jacob Perron <jacob@openrobotics.org>

jacobperron requested a review from ivanpauno March 22, 2021 23:42

ivanpauno approved these changes Mar 23, 2021

View reviewed changes

Remove vestigial change

b3af55b

Signed-off-by: Jacob Perron <jacob@openrobotics.org>

ivanpauno approved these changes Mar 24, 2021

View reviewed changes

jacobperron merged commit e61b93d into main Mar 24, 2021

delete-merged-branch bot deleted the match_qos branch March 24, 2021 16:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically match QoS settings across the bridge #5

Automatically match QoS settings across the bridge #5

jacobperron commented Mar 15, 2021 •

edited

ivanpauno left a comment

ivanpauno Mar 16, 2021

ivanpauno Mar 16, 2021

jacobperron Mar 16, 2021

jacobperron Mar 17, 2021

ivanpauno Mar 17, 2021

jacobperron Mar 19, 2021

ivanpauno Mar 16, 2021

jacobperron Mar 16, 2021

ivanpauno Mar 16, 2021

jacobperron Mar 16, 2021

jacobperron Mar 16, 2021

ivanpauno Mar 17, 2021

jacobperron Mar 17, 2021 •

edited

ivanpauno Mar 17, 2021

jacobperron Mar 17, 2021

jacobperron Mar 25, 2021

jacobperron commented Mar 19, 2021

ivanpauno left a comment

ivanpauno Mar 19, 2021

ivanpauno Mar 19, 2021

jacobperron Mar 19, 2021

jacobperron Mar 19, 2021

ivanpauno Mar 19, 2021

ivanpauno Mar 19, 2021

jacobperron Mar 22, 2021

ivanpauno Mar 19, 2021

jacobperron Mar 19, 2021

ivanpauno Mar 19, 2021

jacobperron Mar 19, 2021

ivanpauno Mar 23, 2021 •

edited

jacobperron Mar 24, 2021

-    QosMatchInfo result_qos(endpoint_info_vec[0].qos_profile());
+    QosMatchInfo result_qos;
+    result_qos.qos.reliability(endpoint_info_vec[0].qos_profile().reliability());
+    result_qos.qos.durability(endpoint_info_vec[0].qos_profile().durability());

		std::shared_ptr<rclcpp::Node>,
		std::pair<std::shared_ptr<std::thread>, std::shared_ptr<std::condition_variable>>>;

Automatically match QoS settings across the bridge #5

Automatically match QoS settings across the bridge #5

Conversation

jacobperron commented Mar 15, 2021 • edited

ivanpauno left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacobperron Mar 17, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacobperron commented Mar 19, 2021

ivanpauno left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivanpauno Mar 23, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacobperron commented Mar 15, 2021 •

edited

jacobperron Mar 17, 2021 •

edited

ivanpauno Mar 23, 2021 •

edited