Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store the subscriber, client, service and timer #431

Merged
merged 3 commits into from
Mar 12, 2018

Conversation

deng02
Copy link
Contributor

@deng02 deng02 commented Jan 16, 2018

objects as shared pointers (#349)

To prevent an object from being deleted while the rcl_wait_set is
using raw pointers to internal members, we store them as shared
pointers.

The subscriptions are stored as a pair because a single
subscription handle can have both an intra and non-intra
process handle being used by the wait set. In order to
remove objects based on null wait set handles, we need both.

Copy link
Member

@wjwwood wjwwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pull request, but I think it needs a slight change of approach to work best. I've made some comments inline to that effect.


Whether or not we use the shared_ptr to a C++ object or the rcl_* equivalent type, we still are adding the overhead of creating shared_ptr's each time we loop over wait, which will be quite a bit more expensive. There might be a better solution, which is more complicated but also more performant, which involves preventing the destructor from running until the executor is woken up and in between wait calls. Not doing that correctly will result in deadlocks, and might be just as bad as the current solution due to lock contention, but we wouldn't know which is better until we tested it in a few different scenarios.

VectorRebind<std::pair<const rcl_subscription_t *, SubscriptionBase::SharedPtr>> sub_pair_ptrs_;
VectorRebind<rclcpp::ServiceBase::SharedPtr> service_ptrs_;
VectorRebind<rclcpp::ClientBase::SharedPtr> client_ptrs_;
VectorRebind<rclcpp::TimerBase::SharedPtr> timer_ptrs_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these should instead store shared_ptr's to the rcl_* types. Here's my rationale why, consider this case (current behavior):

  • user creates pub and sub which talk
  • user spins
  • asynchronously user reset's the sub pointer
  • when a message is received for the deleted subscription:
    • a segfault occurs in rcl_wait() for using "finalized" pointer to a rcl_subscription_t

The existing behavior, described in #349, this resulted in segfault's when rcl_wait was still using an rcl_* object, because that object had had it's _fini() function called on it from the destructor of one of our C++ classes.

With the implementation in this pr it goes like this:

  • user creates pub and sub which talk
  • user spins
  • asynchronously user reset's the sub pointer
  • when a message is received for the deleted subscription:
    • a callback is called for the subscription the user has already deleted

With the proposal from #349, i.e. using shared_ptr to the rcl_* types, it would go like this (I think):

  • user creates pub and sub which talk
  • user spins
  • asynchronously user reset's the sub pointer
  • when a message is received for the deleted subscription:
    • the message is "dropped" because the weak_ptr to the subscription the executor has fails to materialize into a shared_ptr because it no longer exists

@wjwwood
Copy link
Member

wjwwood commented Jan 16, 2018

It would also be great to have an automated test for this case, so we don't regress on this point. Maybe someone on our team has time to do that if @deng02 doesn't.

@deng02
Copy link
Contributor Author

deng02 commented Jan 17, 2018

@wjwwood Thanks very much for the feedback. I just want to confirm my understanding of the changes you want.

The PR in its current state artificially delays the destruction of the ROS2 subscriber to ensure that the 'finalized' pointer doesn't become invalidated until we're done with it, which somewhat violates the idea that the user is the owner and is managing the lifecycle of that subscriber.

Instead if we replace the raw pointers with weak_ptrs, we can safely dereference within rcl_wait when the user resets a sub pointer while preservering user-driven ownership.

The one thing I want to clarify is about having the SubscriptionBase::get_handle() return a shared_ptr. Isn't SubscriptionBase essentially the owner of the subscription_handle_ and all others are just observers, in which case the member itself would be of type std::unique_ptr<rcl_subscription_t> and get_handle() would return std::weak_ptr<rcl_subscription_t>? I'm not sure I see why we need shared_ptr at all. The rcl_wait_set_t would need to be modified to use the weak_ptr so that when rcl_wait tries to access, it can lock to determine if the pointer is still valid or not.

@deng02
Copy link
Contributor Author

deng02 commented Jan 17, 2018

Also I definitely agree with the need for an automated test around this. I will try to write one if the team does not have time to create one.

@wjwwood
Copy link
Member

wjwwood commented Jan 17, 2018

Hmm, I think we're not on the same page, let me try to clarify what I mean, but first let me respond to your latest feedback inline.

The PR in its current state artificially delays the destruction of the ROS2 subscriber to ensure that the 'finalized' pointer doesn't become invalidated until we're done with it, which somewhat violates the idea that the user is the owner and is managing the lifecycle of that subscriber.

Right, except we need to differentiate in the C++ class and the C rcl handle for the resource, e.g. rclcpp::Subscription versus rcl_subscription_t. Which due to intra-process communication there's not even a 1-to-1 mapping, e.g. a rclcpp::Subscription typically has shared ownership of two different rcl_subscription_t.

But you're right to summarize the issue as violating the contract with the user about the lifecycle of the rclcpp::Subscription.

The one thing I want to clarify is about having the SubscriptionBase::get_handle() return a shared_ptr. Isn't SubscriptionBase essentially the owner of the subscription_handle_ and all others are just observers, in which case the member itself would be of type std::unique_ptr<rcl_subscription_t> and get_handle() would return std::weak_ptr<rcl_subscription_t>?

I would say no, I would word it as: SubscriptionBase starts with unique ownership of the subscription_handle_ but may share the ownership with callers of get_handle(). In which case I think it does need to return a shared_ptr and not a weak_ptr (though the caller could always upgrade the weak_ptr to a shared_ptr)

I'm not sure I see why we need shared_ptr at all. The rcl_wait_set_t would need to be modified to use the weak_ptr so that when rcl_wait tries to access, it can lock to determine if the pointer is still valid or not.

Well, rcl_wait_set_t cannot use weak_ptr or shared_ptr since it's a C api. But the problem is that it has to lock it to put it into the wait set and pass it to dds_wait at the bottom. So while waiting it needs to have a lock on it. But while it has a lock on it you cannot call fini on it from the destructor of the C++ class.

So basically I'm proposing that we decouple the C++ class and the rcl type, such that the user can only control when the C++ class is destroyed. So basically the sequence would go like this:

  • [thread 1] user creates subscription and adds the node to an executor
  • [thread 1] user spins executor
    • each time it enters Executor::wait_for_work() (and calls AllocatorMemoryStrategy::collect_entities()) the executor will gain shared ownership of the rcl_*_t's it is waiting on (but not the rclcpp::* classes as this pr does)
  • [thread 2] user deletes subscription
    • after ~Subscription() returns the rcl_subscription_t will still exist (due to shared ownership with the executor) and therefore it will still appear to exist on the ROS graph, however data received for it will not be delivered
  • [thread 1] at some point (for any reason) the executor loops and lets go of it's shared ownership of the rcl_subscription_t, at which point it is passed to rcl_subscription_fini() in a custom destructor for the shared_ptr<rcl_subscription_t>

So the documentation for rclcpp::Subscription<T> (specifically rclcpp::~Subscriptions) would say something like "on deletion, the subscription is scheduled for removal from the ROS graph and will be actually removed at some undetermined point in the future".

This is obviously not ideal, as the user would like to have some more determinism in how the shutdown of the subscription works. However, this is the simplest solution for the errors/segfaults (and what is implicitly happening right now).


A separate, but more completely (and more complicated) solution is to make it so that the destructors of our C++ classes force the executor(s) to give up shared ownership of the rcl_*_t classes before calling the equivalent rcl_*_fini() function on them and returning. For example, here's a similar sequence of events as above:

  • [thread 1] user creates subscription and adds the node to an executor
  • [thread 1] user spins executor
    • each time it enters Executor::wait_for_work() (and calls AllocatorMemoryStrategy::collect_entities()) the executor will gain shared ownership of the rcl_*_t's it is waiting (still necessary to do the rcl_*_t instead of the rclcpp::* class otherwise the user cannot reliably trigger the destructor)
  • [thread 1 or 2] user deletes subscription
    • if in thread 1, that would be from within another callback
      • special consider would need to be given for deleting a subscription from within it's own callback
    • within ~Subscription() it would try to acquire unique ownership of the rcl_*_t by interrupting the executor and preventing it from entering Executor::wait_for_work() again
    • at this point, the executor is guaranteed to not be using the rcl_*_t
    • then rcl_*_fini() would be called and then the executor would be released so it could continue
    • when the executor continues and it calls get_handle() on this subscription (assuming it still gets the rclcpp::Subscription from the node or callback group) it will get nullptr back and will not add anything to the rcl_wait_set_t for this subscription (even if it continues before this destructor returns)

This would change the documentation of the classes, e.g. rclcpp::Subscription(), to be something more like "after destruction the subscription will have been removed from the ROS graph" which is much more deterministic and easier to understand for the user.

However, it is not a trivial thing to do (in my opinion), since you need to avoid deadlocks and starvation (usually using a pair of mutex, of which one is a "barrier" mutex, e.g. I use these in the rclcpp::GraphListener:

mutable std::mutex node_graph_interfaces_barrier_mutex_;
mutable std::mutex node_graph_interfaces_mutex_;
). It's possible, but tricky code. However, it might be the right solution long term. One downside is that the single threaded executors basically don't need all of the sophisticated checks, only the multi-threaded executors need everything, so it would be best if the single threaded executors could avoid this more sophisticated logic somehow (they still need to interrupt the executor, but they don't need the extra mutex to ensure they get the next lock since there's only one thread in the executor).

Hope that clears up what I mean.

@deng02
Copy link
Contributor Author

deng02 commented Jan 17, 2018

@wjwwood Yes that clears things up, thanks. I'll move forwards with the changes as described above.

@dirk-thomas dirk-thomas added the in progress Actively being worked on (Kanban column) label Jan 18, 2018
@guillaumeautran
Copy link
Contributor

@wjwwood I've re-worked the patch to convert the rcl_*_t to std::shared_ptr. Please take look and let me know if that works.

Note: I've tested this change quite a bit with topic subscriptions and also used valgrind to track memory leaks. What is the best way for me to test the other components?

@wjwwood
Copy link
Member

wjwwood commented Feb 7, 2018

@guillaumeautran awesome, I'll have a look at it as soon as I can, though I can't promise an immediate turn around right now.

@guillaumeautran
Copy link
Contributor

Ping @wjwwood :)

@wjwwood
Copy link
Member

wjwwood commented Feb 16, 2018

Sorry @guillaumeautran, I'll get to it as soon as I can, I just have a lot of other stuff on my plate at the moment.

@wjwwood wjwwood self-requested a review February 16, 2018 20:05
@wjwwood wjwwood added enhancement New feature or request in review Waiting for review (Kanban column) and removed in progress Actively being worked on (Kanban column) labels Feb 16, 2018
Copy link
Member

@wjwwood wjwwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The substance of the change lgtm, but I had a few stylistic comments.

@@ -38,7 +40,21 @@ ClientBase::ClientBase(
: node_graph_(node_graph),
node_handle_(node_base->get_shared_rcl_node_handle()),
service_name_(service_name)
{}
{

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: remove unnecessary leading blank line

@@ -33,6 +35,7 @@
#include "rmw/error_handling.h"
#include "rmw/rmw.h"


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: minimize vertical whitespace and also avoid pure whitespace changes in pr's

{
if (rcl_service_fini(service, node_handle_.get()) != RCL_RET_OK) {
RCUTILS_LOG_ERROR_NAMED(
"RCLCPP",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logger names ought to be lowercase.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this might be a good candidate for using a rclcpp sub logger of the node's logger so that it appears as my_node.rclcpp rather than just rclcpp. Imagine this error comes out on a process with several nodes in it. Having the node name as context would help you narrow it down.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and elsewhere in the pr.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've fixed the casing for RCLCPP => rclcpp.

Regarding the node sublogger, shouldn't this be a rcutils concerns (as opposed to encoding the node name when writing the log)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well not if you use the RCLCPP_* logging macros, then it’s rclcpp’s issue. Also how would rcutils know anything about node subloggers. It seems pretty obvious to me that the name needs to be injected by the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done it for this message only. But the code looks ugly to me. I wonder if there is a smarter way to simplify that...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the service had a handle to the C++ node, then it could call rclcpp::Node::get_logger(), which would be slightly cleaner looking, but it does exactly the same thing. I'll try to re-review this asap.

// check if service handle was initialized
// TODO(karsten1987): Take this verification
// directly in rcl_*_t
// see: https://github.com/ros2/rcl/issues/81
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like ros2/rcl#81 is closed, can you try replacing this TODO and code snippet with rcl_service_is_valid()? http://docs.ros2.org/ardent/api/rcl/service_8h.html#ae3f8159e4c6c43f9f2cc10dd4c5f5f3f

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

}


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: minimize vertical whitespace (for reference, this is part of the Google Style guide which our style is based on, see: https://google.github.io/styleguide/cppguide.html#Vertical_Whitespace)


auto custom_deletor = [=](rcl_subscription_t *rcl_subs)
{
std::cout << "Deleting subscription handle\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like vestigial debug output.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops...

@@ -13,6 +13,7 @@
// limitations under the License.

#include "rclcpp/timer.hpp"
#include "rcutils/logging_macros.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be included after the system headers.

@@ -21,11 +22,29 @@ using rclcpp::TimerBase;

TimerBase::TimerBase(std::chrono::nanoseconds period)
{

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: unnecessary vertical whitespace

std::string("Timer could not get time until next call: ") +
rcl_get_error_string_safe());
std::string("Timer could not get time until next call: ") +
rcl_get_error_string_safe());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you run ament_uncrustify or the test suite on this? This is a known oddity of how uncrustify makes use format the code. I would actually expect it to fail the tests with this diff.

new rcl_service_t, [=](rcl_service_t *service)
{
if (rcl_service_fini(service, node_handle_.get()) != RCL_RET_OK) {
RCUTILS_LOG_ERROR_NAMED(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use the RCLCPP version of these macros now, see:

http://docs.ros2.org/ardent/api/rclcpp/logging_8hpp.html

This also removes the need for (and therefore implicitly address several other style comments I made) the new rcutils includes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into it now...

Converts all rcl_*_t types in the memory allocation strategy to shared pointers to prevent crash happening when a subscriber is reset.

Issue: ros2#349
We always put the { on a new line for function definitions and class declarations.
Copy link
Member

@wjwwood wjwwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sorry for the delay in reviewing, I needed a block of time to wrap my head around the whole thing.

I fixed one style thing, which surprising uncrustify didn't complain about, I'll have to look into that later.

@wjwwood
Copy link
Member

wjwwood commented Mar 12, 2018

CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status (failure unrelated)
  • Windows Build Status

@wjwwood
Copy link
Member

wjwwood commented Mar 12, 2018

Looks good on CI too, thanks for your contribution and patience @guillaumeautran and @deng02!

@wjwwood wjwwood merged commit 3652646 into ros2:master Mar 12, 2018
@wjwwood wjwwood removed the in review Waiting for review (Kanban column) label Mar 12, 2018
@wjwwood
Copy link
Member

wjwwood commented Mar 12, 2018

@guillaumeautran if you have time to contribute a test for this that would be great too. Realistically, I'll not have time to do it, nor will anyone on our team most likely.

wjwwood added a commit that referenced this pull request Mar 14, 2018
wjwwood added a commit that referenced this pull request Mar 14, 2018
wjwwood added a commit that referenced this pull request Mar 14, 2018
wjwwood added a commit that referenced this pull request Mar 20, 2018
* Revert "Revert "Store the subscriber, client, service and timer (#431)" (#448)"

This reverts commit 168d75c.

* Convert all rcl_*_t types to shared pointers

Converts all rcl_*_t types in the memory allocation strategy to shared pointers to prevent crash happening when a subscriber is reset.

Issue: #349

* fixups
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants