Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add option for controlling the blocking behavior of publish #255

Open
wjwwood opened this issue Jul 6, 2016 · 7 comments
Open

add option for controlling the blocking behavior of publish #255

wjwwood opened this issue Jul 6, 2016 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@wjwwood
Copy link
Member

wjwwood commented Jul 6, 2016

Currently, the blocking behavior of rcl_publish is not well defined, see:

https://github.com/ros2/rcl/blob/494cfc5f2b119a40b087a09f747423d17f039b5f/rcl/include/rcl/publisher.h#L173-L175

This was also brought up in the large message publishing pull request: ros2/rmw_connext#183

It would be good to define how publish should behave in different situations. The available options are something like:

  • undefined, it may or may not block but which is unspecified and implementation specific
  • blocking (synchronous), it blocks until some predicate is met
  • non-blocking (asynchronous), it stores the message to be published and returns immediately
    • actual delivery of the message is done in a different thread
  • blocking, but with a timeout

The blocking predicate could one of "until the data is sent", "until the data is received / acknowledged", or something else.

We can choose to either define the behavior or not, whereas not defining the behavior would basically be saying "check the implementation" but defining the behavior would potentially make it more difficult for some implementations to adhere to the interface. I'm personally in favor of specifying the behavior and forcing the rmw implementation to ensure the publish function behaves as expected.

Separately we can decide whether or not to allow control of the behavior. As I see it we could do one of these:

  • do not choosing between sync/async publishing
    • this is basically always using an "auto" mode for selecting the publishing mode
    • it would give the implementation a lot of flexibility
  • allow the user to select sync, async, or auto
    • this would force the rmw implementation to support multiple modes
    • this would be burdensome, for example, for implementations that only have sync publishing
  • always use async publishing
    • also hard for implementations that only have sync publishing (they would need to implement queueing and have a publishing thread)
  • always use sync publishing
    • this is the easiest, assuming the async implementations have a "wait for publish to finish" method (connext does, not sure about fastrtps)

My proposal would be to expose a publish blocking behavior option that is either "sync", "async", or "auto". Also, when using "auto", have a function to detect whether it will be "sync" or "async" and maybe a compile time check (static_assert) for the default auto behavior. Depending on how it interacts with message size the compile time check may not be possible, i.e. perhaps it only works with bounded message types.

For implementations that only have sync publishing, they will need to implement a message queue and thread to do the synchronous publishing. This is burdensome, but I don't see a better way. It would be possible that we could have a rmw_supports_async_publishing() -> bool function, and if the implementation returns false we could have a generic implementation in the client libraries to support this, maybe even in rcl, in order to reduce the burden. But since our current targets both support async publishing this isn't such a problem.

For implementations that only support async publishing (or require it for large messages), they would need to emulate synchronous publishing. I don't think that would be too hard, especially if like Connext they have a function to wait on the publishing to be complete. Otherwise some custom synchronization maybe be necessary.

For ideas on how the heuristic that controls whether "auto" selects "sync" or "async" could operate see: ros2/rmw_connext#190

I'd recommend also that "auto" be the default for this proposed option. That allows the implementation to select whichever one is most efficient based on the design of the implementation, the QoS settings, and the message type being published. However, it might also be prudent to select either sync or async as a default for consistency in our interface across implementations.


Some points that should be decided before implementing this:

  • should there be a heuristic based "auto" option, or just "sync" / "async"?
  • should the default be "auto", "sync", or "async"?
  • should the option be allowed to change after creating the publisher?
    • either seamlessly or by destroying the recreating the publisher automatically
  • what should the predicate for unblocking a sync publish be?
    • either data is sent or data is acknowledged or an option to select one or the other?
@gbiggs
Copy link
Member

gbiggs commented Jul 6, 2016

Whether or not to block on publish is something that a bunch of people here and at Honda spent quite a bit of time thinking about a couple of years ago. In the end, it was made configurable by the system integrator only - the component developer doesn't get a say.

I think I agree with you that having sync, async and auto options is nice. This gives node developers the most flexibility to decide how to relate their processing to publishing without that behaviour being altered in odd ways at system integration time. The separate QoS for the actual communication provides the necessary hooks for the system integrator to do their work already. If a node specifies whether or not it will wait for a publish to complete, then the system integrator will know the impact of their chosen QoS settings on the publisher as well as the subscriber.

However, this approach would introduce several complications. For example, if the publish is done asynchronously but the QoS settings mean that the data cannot be written immediately, and then a time out occurs, how do you communicate to the publishing node that the data never made it to its destination? If rcl is storing the message internally and waiting for DDS to be ready to write it, does that mean that rcl will then need its own set of QoS properties to control things like internal time outs?

I think this problem alone is probably why the DDS specification says that write() should block if there is a risk of data loss on a RELIABLE channel, and I think that making rcl_publish() have clear and easy to understand behaviour may require that rcl uses a similar approach, despite the benefits of allowing the node developer to choose amongst three options.

Assuming that the three-options approach is taken:

For auto, the heuristic must be easy to understand (the proposed one is about right, I think). It also needs to be determinable at compile time what the result of auto will be. Otherwise it will be too difficult to produce deterministic software using the auto option, potentially forcing the developer to use a lower performance option to gain deterministic behaviour.

I don't think it is too much to ask to that implementations using a middleware with only sync or async publishing natively provide the necessary support for the opposite type themselves. rcl is the higher-level entity, it should provide the ideal APIs for the developer and rely on the lower layers to implement them (or internally provide a default, as you suggest), rather than only providing the lowest common denominator.

Making default behaviour "auto" is, I think, not so great an idea. The API should present consistent and easy-to-predict behaviour by default. Having to learn the heuristic and then apply it every time you write publish is too much. "async" seems like a sensible default to me.

Have you considered whether these three options will be parameters, or whether they will be separate functions in the API? Making them separate functions would make the expected behaviour more clear when reading code.

@dirk-thomas
Copy link
Member

dirk-thomas commented Jul 6, 2016

We are designing an asynchronous publish / subscribe system. So I don't think a blocking behavior until the message has arrived (at all subscribers, including the ones which are only receiving the data and never acknowledge them) is a reasonable expectation. Therefore "sync" should only imply that the rmw implementation has gotten the passed message and will do something with it. But it doesn't imply anything about the arrival of the message. And imo it also doesn't necessarily imply if the message has been written to the network buffer or is still somewhere in a local buffer / queue of the rmw impl.

If in order to match the requested QoS settings the rmw call blocks for longer (e.g. to ensure that message (or a previous one) is not lost) that is reasonable. DDS chose this behavior for a good reason. If not the implementation would potentially need to do a lot more work and also imply additional queuing, threading, allocation etc.

Requiring rmw impl. to mimic (a)sync publishing if they work the other way around is a significant burden which I don't think is reasonable. It will make the system significantly more complex with e.g. yet another queuing happening which implies another location of allocation, threading etc.

Regarding the suggested "Blocking, but with a timeout": It don't think this goal can be achieved with a reasonable consistent behavior. What if it times out when only some of the subscribers have received a message? Are the others then aborted? What if one message is still "in-flight" but received after the timeout? I also don't think it can be implemented with all rmw APIs without an additional queueing / threading before calling into the vendor specific API.

From an application point of view for the case where you do want an acknowledgement that a message has been received you can use a service since it provides exactly those guarantees. And for the reasons stated above a service is usually also a 1-to-1 communication pattern.

@wjwwood
Copy link
Member Author

wjwwood commented Sep 8, 2016

Sorry for the long delay in my response. I've had this half written for a while but I didn't have the time to finish it and post it.


After reading both of your comments and thinking more, I decided my understanding/description of how the "asynchronous publishing mode" of Connext and Fast-RTPS must work was insufficient. I think it's instructive to read about the cases in which write may block on the Connext documentation page:

http://community.rti.com/rti-doc/510/ndds/doc/html/api_cpp/structFooDataWriter.html#abb3770f202340bc819368463987eb055

Just to summarize, DataWrite::write() may block if:

  • QOS is "reliable + keep all" and writing would result in:
    • loss of data (history is full) or
    • a resource limit being exceeded
  • QOS is "best effort + keep all" and the "asynchronous publishing mode" is being used but is behind (slow flow controller for example)

The "best effort + keep all" blocking case was confusing at first, but basically it means "it will block if it means the data will never be sent". Best effort, in this case, means no repeated tries, but it does guarantee that the data is attempted to be delivered once.


@gbiggs I'll just try to answer a few of your questions in line:

if the publish is done asynchronously but the QoS settings mean that the data cannot be written immediately, and then a timeout occurs, how do you communicate to the publishing node that the data never made it to its destination? If rcl is storing the message internally and waiting for DDS to be ready to write it, does that mean that rcl will then need its own set of QoS properties to control things like internal time outs?

I believe it's always the case that if either the history cache is full, or the resource limits have been exhausted, and the QOS is "reliable + keep all", then DataWrite::write() will block. Even if the asynchronous publishing mode is enabled.

A QoS of "best effort + keep all" would normally not block as the receivers should not be able to stall the consumption of the outgoing data as is the case with "reliable + keep all". However, it can block when you're using asynchronous publishing if the flow controller (which controls how often queued data is written to the network) is setup to write data out slower than it is being published. In that way, the "keep all" part of "best effort + keep all" guarantees that data is written to the network (even if that means blocking behavior), which is sort of supplemental to what "best effort" enforces, which is that it does not guarantee that it is received by the other side before continuing to send data, nor will it resend missed data.

If rcl is storing the message internally and waiting for DDS to be ready to write it, does that mean that rcl will then need its own set of QoS properties to control things like internal time outs?

After more reflection, I think having rcl do its own queueing would be a bad idea. We've explicitly avoided it until now and that's been really nice. This is a fine conclusion to come to because the rmw implementations we're looking at already support async and so we would not be in the position of needing to implement an async interface on top of a sync interface. I still think that a custom queue (with their own timeout settings) would be necessary to do so, but DDS implementations with async publishing mode already have this (extra async queue with timeout set in the reliability QoS settings).

Making default behaviour "auto" is, I think, not so great an idea. The API should present consistent and easy-to-predict behaviour by default. Having to learn the heuristic and then apply it every time you write publish is too much. "async" seems like a sensible default to me.

I'm torn on this, because I agree that having a consistent, and simple to predict, behavior is really nice, but the whole point of "auto" is to let the system choose the most performant option given the settings and message type. If default is "async" and so the user has to know to set "auto" to get a more performant behavior, then they can probably just do the judgement themselves and set it to "sync", bypassing "auto". The downside to "async" as a default is that always the most resource expensive way to send the data.

Have you considered whether these three options will be parameters, or whether they will be separate functions in the API? Making them separate functions would make the expected behaviour more clear when reading code.

I haven't.


@dirk-thomas I'll respond to your comments inline as well:

We are designing an asynchronous publish / subscribe system. So I don't think a blocking behavior until the message has arrived (at all subscribers, including the ones which are only receiving the data and never acknowledge them) is a reasonable expectation.

So I think I was wrong about my categorization of the possibilities. I think the right break down of predicates for synchronous blocking of our publish call would be:

  • "until there is room in the history"
    • was previously described as "until the data is received / acknowledged"
  • "until the data is sent"

The first predicate "until there is room in the history" is, in my opinion, a description of DDS's "reliable" reliability QoS setting paired with the "keep all" history QoS setting. It sends the message and waits for responses from everyone before removing it from history. So, if this is taking a long time, then the history queue may fill up waiting on this to happen, at which point publish would block as anything else would result in "loss of data".

So my incorrectly phrased "until the data is received / acknowledged" could be rewritten as "until the previous message's data is received / acknowledged" iff the history size is 1. This is where my confusion came in I think. I suppose you could do something like publish(msg); wait_for_history_to_be_empty() achieve the behavior I described, but it doesn't seem like a good idea. You're better off using a service at that point.

For the "best effort + keep all" QoS, I believe the "until the data is sent" predicate applies. If implemented in a synchronous mode (default for DDS) and using "best effort + keep all" QoS, then I'd expect the DataWriter::write() call might block until the "write" system call on the UDP socket returns for all datagrams, though I'm not certain of that. Even in "asynchronous publishing mode" DataWriter::write may block if the "asynchronous queue" fills up. This is a corollary of "until there is room in the history" for best effort, so you could reword this predicate as "until there is room in the asynchronous publishing queue". But we're talking about synchronous publishing for these predicates, so I think "until the data is sent" or "until the data will be sent" are still accurate descriptions. I believe the Connext docs say that this asynchronous publishing queue's size is set by the history length.

Therefore "sync" should only imply that the rmw implementation has gotten the passed message and will do something with it. But it doesn't imply anything about the arrival of the message. And imo it also doesn't necessarily imply if the message has been written to the network buffer or is still somewhere in a local buffer / queue of the rmw impl.

I think this is right. In the case of "sync + best effort + keep all" I think it could mean data has been written to the network buffer, but that's not really an important detail I think since that's not particularly useful information at that point. It is sufficient to say that it has been received and will be written at some point.

If in order to match the requested QoS settings the rmw call blocks for longer (e.g. to ensure that message (or a previous one) is not lost) that is reasonable. DDS chose this behavior for a good reason. If not the implementation would potentially need to do a lot more work and also imply additional queuing, threading, allocation etc.

DDS does this by default, but the one's we're looking at supporting also have "asynchronous publishing mode" which does have additional queues and threading as you suggested.

Requiring rmw impl. to mimic (a)sync publishing if they work the other way around is a significant burden which I don't think is reasonable. It will make the system significantly more complex with e.g. yet another queuing happening which implies another location of allocation, threading etc.

I agree that a completely synchronous system would be burdened to have to provide an async interface. However, ROS 1 provides an async publish interface and the two vendors we're looking at both have an async option and even require it to be used in some cases. So I think there are compelling reasons to have async publishing behavior defined in our API, either as an option alongside "sync" or as the only option.

Regarding the suggested "Blocking, but with a timeout": It don't think this goal can be achieved with a reasonable consistent behavior. What if it times out when only some of the subscribers have received a message? Are the others then aborted? What if one message is still "in-flight" but received after the timeout? I also don't think it can be implemented with all rmw APIs without an additional queueing / threading before calling into the vendor specific API.

The vendor specific API (in this case DDS) already has support for this. In fact it is impossible to block without a timeout being considered AFAIK. See max_blocking_time:

http://community.rti.com/rti-doc/510/ndds/doc/html/api_cpp/structDDS__ReliabilityQosPolicy.html#afb7c00e8e7734647152883fea621f7e9

If we have a non-DDS implementation then this may be a problem, but I don't believe it to be likely as most calls that would block can block with timeout (whether that be pushing to a queue or writing to a socket or whatever).

If the timeout occurs when calling publish, then the person calling publish knows the message they tried to publish was not sent nor added to the history and they can either try again or ignore that message or something else. In this case the timeout expiring when using "reliable" would not cause a nacked message (which was previously sent) to not be resent, but rather would prevent a new message from being inserted into the history cache.

@gbiggs
Copy link
Member

gbiggs commented Sep 12, 2016

I'm really feeling the limits of GitHub's comment thread management now.

The "best effort + keep all" blocking case was confusing at first, but basically it means "it will block if it means the data will never be sent". Best effort, in this case, means no repeated tries, but it does guarantee that the data is attempted to be delivered once.

To me, the way DataWrite::write() works sounds like reasonable behaviour and giving rcl_publish identical behaviour sounds like a good approach.

After more reflection, I think having rcl do its own queueing would be a bad idea.

I agree. If a particular rmw implementation needs to provide queueing because the underlying transport does not, then that can be done within the rmw implementation. This would keep rcl simpler.

I'm torn on this, because I agree that having a consistent, and simple to predict, behavior is really nice, but the whole point of "auto" is to let the system choose the most performant option given the settings and message type. If default is "async" and so the user has to know to set "auto" to get a more performant behavior, then they can probably just do the judgement themselves and set it to "sync", bypassing "auto". The downside to "async" as a default is that always the most resource expensive way to send the data.

In that case, I think that probably following DDS is a good way to go. Make rcl_publish have the same behaviour as DataWriter::write(). I feel quite strongly that relying on externally-configurable QoS settings to control the behaviour of data transport is an important feature for system integrators. The less is hard-coded into the node, the better.


Regarding "blocking with a timeout" potentially leading to some nodes receiving the data and some not: I think that if you are using QoS settings that allow this case, then your application should be designed to be robust against it. Otherwise, if your application cannot handle some nodes not getting a piece of data, you should be using QoS settings that guarantee delivery of data.


It seems to me that the difficultly being dealt with here stems from the interaction between rcl_publish's behaviour and what QoS settings may be in effect. It is probably worth enumerating the various combinations of options and how rcl_publish will behave under each. Do we have a current list of the available QoS parameters?

@wjwwood
Copy link
Member Author

wjwwood commented Sep 22, 2016

I feel quite strongly that relying on externally-configurable QoS settings to control the behaviour of data transport is an important feature for system integrators. The less is hard-coded into the node, the better.

I'm actually not convinced by this line of thinking. I do not think it is wise to allow an external integrator to change the blocking behavior of a function call used by existing code. If the blocking behavior of publish is to change, then the code calling it needs to either be designed to handle both cases or must change as well. To this end, I think the blocking behavior must be "hard coded" into the program and if you want the flexibility of changing the behavior then the program should expose an option to change that behavior, where it can change the settings on publish and then has an opportunity to change how it calls that code.

If you consider the matrix of blocking behaviors as programmed versus as configured externally:

programmed \ configured blocking non-blocking
blocking x 1
non-blocking 2 x

Case 1, where it was programmed as blocking but configured to be non-blocking, is the least problematic scenario. It wouldn't be a problem with throughput or threading, but it could inadvertently remove a desired flow control mechanism and cause the program to use much more resources than intended. Even though the consequences are mild, I still think allowing the program to decide how to switch from keep all to keep last is better than changing it externally and hoping it works correctly.

For case 2, where it was programmed as non-blocking but configured to be blocking, could be very problematic. As I assume you can imagine, changing the behavior of a function to be blocking from non-blocking could cause pipelines that normally run very fast to get stalled and have adverse effects on the program. Had the developer thought publish could block they may have called from a thread rather than from the main thread or a callback from another reactor system, like Qt or libusb.

I suppose we could say "always assume publish will block", but that keeps people from making use of the non-blocking behavior when keep last is used. In ROS 1 the rule is, "publish will never block" (it didn't have a keep all with resource limits option). However, if we allow publish's blocking behavior to be configuration driven, then I maintain that allowing external configurations to change the blocking behavior of the api calls will only work in very carefully crafted programs.

@gbiggs
Copy link
Member

gbiggs commented Sep 26, 2016

I can see your point about the impact of allowing external configuration of blocking behaviour. If a node is not explicitly designed for it, then as you say it could make a mess that the system integrator would than have to track down.

I think that implicit in my desire for external configuration of things like this is the ability for a node to allow or disallow certain configurations, so that if a node can be written to allow some QoS property to be configured, and the developer has written it in that way, then someone can use that freedom. But if the node cannot or has not been written in a way that allows configuration of a particular aspect, then it would need to disallow that configuration.

Possibly such a capability is beyond the scope of ROS 2, at least at this point. I'm also not sure how common a use case it would really be, and if it would be worth the effort.

I think that maybe in the design of DDS is the implicit idea that a developer would be writing their code using DDS and selecting the QoS properties themselves. I base this on the fact that the original specification only provides a means for programatically setting the QoS properties; XML configurations came from the implementers. (It's just as possible that RTI and PrismTech just couldn't agree on an XML format, of course.)

If we are going to disallow external configuration of blocking behaviour, then a good default is necessary. "Publish will never block" is easy to understand, but I am in favour of more complex behaviour if it is clearly documented and predictable. The set of behaviour @wjwwood has suggested seems sensible to me.

@ros-discourse
Copy link

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/rmw-fast-dds-publication-mode-sync-vs-async-and-how-to-change-it/17153/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants