Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asynchronously shutting down while using rclcpp::spin rclcpp:spin_some in another thread throws an exception #1139

Open
ivanpauno opened this issue May 27, 2020 · 36 comments
Labels

Comments

@ivanpauno
Copy link
Member

This nightly failure is an example of the problem:

What is happening is that one thread is creating an executor, while other is trying to shutdown.
If shutdown is called before the creation of the Executor, an exception is thrown:

void TearDown() override
{
rclcpp::shutdown();
spinner_.join();
}
void spin()
{
while (rclcpp::ok()) {
rclcpp::spin_some(lifecycle_node_->get_node_base_interface());
rclcpp::spin_some(lifecycle_client_);
std::this_thread::sleep_for(std::chrono::milliseconds(10));
}
}

@ivanpauno
Copy link
Member Author

@hidmic @wjwwood @jacobperron does this look like an user error (i.e. the test is not well written) or should be improve how shutdown works?

@hidmic
Copy link
Contributor

hidmic commented May 27, 2020

I'd say user error.

@jacobperron
Copy link
Member

Seems like user error; calling shutdown before spin. What's an alternative, have spin return if the context is invalid?

@ivanpauno
Copy link
Member Author

Seems like user error; calling shutdown before spin. What's an alternative, have spin return if the context is invalid?

The thing is that spin_some is called in a loop, "protected" in a rclcpp::ok check.
If the context is shutdown in the middle of the call to rclcpp::ok and the construction of the executor in rclcpp::spin_some, then the error happens.

IMO, it's a behavior error, and not an user one (I would expect that if I do something after checking for rclcpp::ok, there should be no problem).

IMO shutdown shouldn't affect executors/publishers/subscriptions/nodes/... etc, it should only make rclcpp::ok() return false and wake up executors.

In that way, we could shutdown asynchronously without this error (or other bunch of possible of similar errors).

@wjwwood
Copy link
Member

wjwwood commented May 27, 2020

I also would classify it as "user error", but not necessarily the user's fault. It's confusing and so the existing story could be improved.

This has been discussed in the past:

And perhaps others, though I couldn't find them. I know @sloretz and I spoke about it in person at length around the time ros2/rmw#154 was merged.

And this might be a duplicate?


Originally I thought about shutdown like KeyboardInterrupt in Python, but since this is C++ and the user isn't always inside on of our functions we cannot ensure it is raised everywhere. Instead we have (some of ) our functions raise and then the user can check rclcpp::ok() to know when to interrupt their own work.

The thinking was that if the user wrote something like this:

while (true) {
  rclcpp::spin_some(...);
}

Then they wouldn't get stuck in a loop due the mistake (imo) of not checking rclcpp::ok(). I think this fail fast approach will result in the fewest unexplained hangs for users.

However, that means to avoid tracebacks and crashes they need to catch and handle these errors (like you do with KeyboardInterrupt in Python), and it's also hard to know which things will work and which will not. For example, publishing will work, and destroying a publisher will work, but making a new publisher will not. This was essentially a compromise to enable things like "during shutdown I might want to send a message (publish) to notify someone", for example.

The rules not being clear on this is something that Apex folks struggled with, and it's clearly been an issue for us and our users. So I'd be open to reevaluating this decision. We could instead say that shutdown changes nothing but a flag in the context, and everything continues as normal and it's up to the user to notice and react to it.

@wjwwood
Copy link
Member

wjwwood commented May 27, 2020

IMO, it's a behavior error, and not an user one (I would expect that if I do something after checking for rclcpp::ok, there should be no problem).

This is a bad assumption, imo. It's like acquiring a lock, releasing it, and then assuming afterwards the resource is safe. I understand how it's an easy mistake to make, but imo it is a mistake. rclcpp::ok() just checks a state, it doesn't hold a resource or block other actions, and it doesn't indicate that to be the case.

@ivanpauno ivanpauno transferred this issue from ros2/rmw May 28, 2020
@ivanpauno
Copy link
Member Author

Transferred the issue to rclcpp, as I opened it in rmw by mistake.

@ivanpauno
Copy link
Member Author

This has been discussed in the past:

And this might be a duplicate?

Thanks for pointing out the duplicates.
I will try to consolidate all of them in a new issue --or edit an old one--, and close the others.

Then they wouldn't get stuck in a loop due the mistake (imo) of not checking rclcpp::ok(). I think this fail fast approach will result in the fewest unexplained hangs for users.

However, that means to avoid tracebacks and crashes they need to catch and handle these errors (like you do with KeyboardInterrupt in Python), and it's also hard to know which things will work and which will not. For example, publishing will work, and destroying a publisher will work, but making a new publisher will not. This was essentially a compromise to enable things like "during shutdown I might want to send a message (publish) to notify someone", for example.

The rules not being clear on this is something that Apex folks struggled with, and it's clearly been an issue for us and our users. So I'd be open to reevaluating this decision. We could instead say that shutdown changes nothing but a flag in the context, and everything continues as normal and it's up to the user to notice and react to it.

I think that both "exception like" and "notification like" behaviors are acceptable and have their pro/cons. But IMO, we're doing neither of both correctly.
If we prefer throwing an exception when the user tries "doing something" after shutdown, I think we should everywhere throw the same exception.
e.g.: In the case I pointed out, we're just throwing an RCLError with a custom message, how do I distinguish that from an actual error?
I also think that we should add the try/catch in our examples/demos.

@fujitatomoya
Copy link
Collaborator

@ivanpauno

I also think that we should add the try/catch in our examples/demos.

Just FYI, I think that I already did it with ros2/examples#270.

@ivanpauno
Copy link
Member Author

@ivanpauno

I also think that we should add the try/catch in our examples/demos.

Just FYI, I think that I already did it with ros2/examples#270.

That's exactly what I don't like of how shutdown works right now, you only can try/catch a generic RCLError. There should be a rclcpp::exceptions::Shutdown error (or similar).

@sloretz sloretz added this to To do in Galactic via automation Jun 11, 2020
@ivanpauno
Copy link
Member Author

We already have in rcl return values like RCL_RET_NODE_INVALID, RCL_RET_PUBLISHER_INVALID, etc.

After shutdown, you usually get one of those errors (e.g. https://github.com/ros2/rcl/blob/069d1f07290019b62a8297c5f95a80b58e63682a/rcl/src/rcl/publisher.c#L286).
Can we add a RCL_RET_IS_SHUTDOWN or RCL_RET_CONTEXT_INVALID in rcl and return that when appropriate?

We could then add in rclcpp a ShutdownError derived from RCLError and make sure we throw that one in throw_from_rcl_error. In rclpy, we should make sure to always raise KeybordInterrupt in those cases.

I think that with those changes, we will be able to handle an asynchronous shutdown gracefully.

This approach requires careful reviewing of all functions in rcl layer, to make sure they are returning the appropriate error (and careful handling of those errors in the upper client layer).

Note: It's impossible to recognize an already shutdown context of one that was never init, that should be clear in error messages.

@fujitatomoya
Copy link
Collaborator

i think that is feasible. one question, what is the difference between RCL_RET_IS_SHUTDOWN and RCL_RET_CONTEXT_INVALID? i was thinking RCL_RET_CONTEXT_INVALID is equal to RCL_RET_IS_SHUTDOWN. we should add specific state that tells us if shutdown is called or not? could you enlighten me a little bit?

@ivanpauno
Copy link
Member Author

i think that is feasible. one question, what is the difference between RCL_RET_IS_SHUTDOWN and RCL_RET_CONTEXT_INVALID?

Sorry, I think my comment wasn't clear.

Those are two names suggestions for the same new return code.
RCL_RET_CONTEXT_INVALID is named similarly to the other return codes.
RCL_RET_IS_SHUTDOWN is a more clear name for this case IMO.

@fujitatomoya
Copy link
Collaborator

@ivanpauno

thanks for the clarification, we will take care of this.

@ivanpauno
Copy link
Member Author

@ivanpauno

thanks for the clarification, we will take care of this.

Sorry, but up to now I'm the only one in favor of that change, we need to settle in what we want first.
I would like to hear other opinions before starting a change that may be rejected.

@wjwwood @hidmic @jacobperron Does a custom return code in rcl and a custom exception in rclcpp for shutdown sound like a good idea to solve the issue? Do you have any other alternative?

@jacobperron
Copy link
Member

jacobperron commented Jun 24, 2020

Introducing a new common rclcpp exception to make things consistent sounds like an incremental improvement to me +1

As a drastically different alternative, we could remove the exception and have a notification based approach by having spin return when it encounters an error (it could return a bool or int code). I don't know how technically feasible this is, e.g.

while (rclcpp::spin_some()) {
  // ...
}

And existing code written like the following should still work (but without exceptions):

while (rclcpp::ok()) {
  rclcpp::spin_some(...);
}

Of course, as @wjwwood points out, users can still make the error of writing something like

while (true) {
  rclcpp::spin_some(...);
}

But, we can warn them about not using the return code at compile time at least.

IMO, having spin return silently (or with a return code) on a shutdown would match user expectation more than throwing an exception. Especially, as our examples are written using rclcpp::ok().

@ivanpauno
Copy link
Member Author

As a drastically different alternative, we could remove the exception and have a notification based approach by having spin return when it encounters an error (it could return a bool or int code). I don't know how technically feasible this is, e.g.

I think that the notification based approach is fine.
It could even be configurable in InitOptions (I'm not sure if that's a good idea or not).

IMHO, changing spin* signatures won't help. Our loops currently look like:

while (rclcpp::ok()) {
  ...
  rclcpp::spin*(...);
  ...
}

That is perfectly suitable for a notification based approach, having a return value in spin* functions won't add much value.
If spin* functions get notified and stop handling new work, at some point you will reach rclcpp::ok again.
You can even have extra rclcpp::ok checks in the middle of the loop to break it if that's suitable to your application.

@wjwwood
Copy link
Member

wjwwood commented Jun 25, 2020

We could instead say that shutdown changes nothing but a flag in the context, and everything continues as normal and it's up to the user to notice and react to it.

Quoting myself, I think this might be the right path to take. Just make it so that shutdown does not affect the function of anything, it simply changes the state of rclcpp::Context::ok()/rclcpp::ok().

This runs the risk of the user not noticing shutdown, but we can't keep them from all pitfalls.

@ivanpauno
Copy link
Member Author

Quoting myself, I think this might be the right path to take. Just make it so that shutdown does not affect the function of anything, it simply changes the state of rclcpp::Context::ok()/rclcpp::ok().

This runs the risk of the user not noticing shutdown, but we can't keep them from all pitfalls.

👍 I like that option.

@jacobperron
Copy link
Member

SGTM

@fujitatomoya
Copy link
Collaborator

i guess we got to the consensus.

@ivanpauno
Copy link
Member Author

i guess we got to the consensus.

Yeap, I will check in tomorrow's meeting if everybody agrees.

Doing this change will involve coordinated work in rcl, rclpy and rclcpp to make sure that shutdown doesn't do more than triggering guard conditions once, invalidating the context, and calling shutdown callbacks. The only place where we should check if a context is valid or not is in rclcpp::ok(), rclpy.ok() (e.g. publish should continue working, etc).

Getting the refactor right might be a bit tricky, if you're planning to do I can help with the reviews.

@dirk-thomas
Copy link
Member

I think we should consider and address the bigger picture: how should a user perform multiple init / shutdown cycles cleanly? Atm the shutdown() call isn't blocking and you can't call init() right afterwards without being subject to a race or warnings like "logging already initialized".

@fujitatomoya
Copy link
Collaborator

shutdown() call isn't blocking and you can't call init() right afterwards without being subject to a race or warnings like "logging already initialized".

i think that we can, and shutdown() call blocks with

std::recursive_mutex init_mutex_;

maybe i am misunderstanding, could you elaborate?

@ivanpauno
Copy link
Member Author

I think we should consider and address the bigger picture: how should a user perform multiple init / shutdown cycles cleanly?

It should be safe to do it after rclcpp::ok() returns false.

Atm the shutdown() call isn't blocking and you can't call init() right afterwards without being subject to a race or warnings like "logging already initialized".

We're currently finishing loggers at shutdown (see here), and I think that part of the idea is to stop doing that and doing it when the context is destructed.
If we do that change, init should make sure to handle the problem correctly (it's possible with some extra lines here).

We could also provide a reinit method that takes no arguments. It will make sure that the context is valid again, and it uses the same init options passed before (i.e. it only switches a flag).

examples:

rclcpp::init(argc, argv, init_options);
while (rclcpp::ok()) {
  executor.spin_all(max_duration);
  ... handle more work here
}
rclcpp::reinit();
auto context = rclcpp::Context::make_shared();
context->init(argc, argv, init_options);
while (context->is_valid()) {
  executor.spin_all(max_duration);
  ... handle more work here
}
context->init(other_argc, other_argv, other_init_options);

We should make sure both examples above work.


If the user is trying to call rclcpp::Context::init from more than one thread, it will be a problem.
But I don't think we should provide such thread safety guarantee.

@wjwwood
Copy link
Member

wjwwood commented Jul 1, 2020

It should be safe to do it after rclcpp::ok() returns false.

It's not though... It's only safe to do that after the context is destructed right?

I haven't tried it recently, but you should be able to init/shutdown contexts in parallel as long as you're using different ones each time. Maybe the global logging setup doesn't make that work anymore.

@ivanpauno
Copy link
Member Author

It's not though... It's only safe to do that after the context is destructed right?

I think it's currently safe, as shutdown is finishing the logger and init will be able to setup it again correctly (supposing only one context).
If we move logging finishing to context destruction, we will need to handle that case in init.

I haven't tried it recently, but you should be able to init/shutdown contexts in parallel as long as you're using different ones each time. Maybe the global logging setup doesn't make that work anymore.

It should work if one context initializes the logging system and the other doesn't.
I think it would be great to move the global context configuration to one per Context, but that doesn't seem to be exactly related (I will create an issue and add it to the Galactic project board).


I think we should add test for all these cases when doing the change, so it will be clear what works and what doesn't.
Both cases can conceptually work correctly with the change in shutdown behavior.

@jacobperron
Copy link
Member

Although not required, it would be nice if the solution to this issue (or version of it) could be backported to Foxy.

@hidmic
Copy link
Contributor

hidmic commented Jul 20, 2020

A few comments here and there:

If the context is shutdown in the middle of the call to rclcpp::ok and the construction of the executor in rclcpp::spin_some, then the error happens.

IMO, it's a behavior error, and not an user one (I would expect that if I do something after checking for rclcpp::ok, there should be no problem).

I see. It's a valid point. Perhaps spinning a entity associated with a shutdown context shouldn't throw an exception?

Quoting myself, I think this might be the right path to take. Just make it so that shutdown does not affect the function of anything, it simply changes the state of rclcpp::Context::ok()/rclcpp::ok().

I guess the question is, what's the point of having a third shutdown state besides the usual init-fini cycle? If we only intend it to be a generic notification with no side-effects of its own, why making it a separate state? Why restricting it to a single event? Why do we call it shutdown?

As I see it, to shutdown means a context and all entities associated with it are soon to be finalized. That certainly bans entity creation and restricts the amount of work you can do (spinning is a good example of what should be allowed, or otherwise the added state is somewhat useless). And while we don't do anything meaningful about it right now, if anyone ever wants to:

  • Hint the system that an entity is on its way out
    • To avoid spurious service requests (way faster than a timeout).
    • To exclude them from discovery entirely (in DDS, or whatever transport protocol rmw uses)
  • Reschedule entity tasks to speed up termination.
  • Free/redistribute resource pools early

we'll need the concept.

I think we should consider and address the bigger picture: how should a user perform multiple init / shutdown cycles cleanly?

It should be safe to do it after rclcpp::ok() returns false.

It's not though... It's only safe to do that after the context is destructed right?

I think it's currently safe, as shutdown is finishing the logger and init will be able to setup it again correctly (supposing only one context).

I will say that's because rclcpp::Context::init silently destroys the underlying rcl context. Below rclcpp, it's init-shutdown-fini and only then a context can be reused.

Note: It's impossible to recognize an already shutdown context of one that was never init, that should be clear in error messages.

This is true because rcl uses an atomic flag to invalidate contexts, instead of relying on rmw which also has to track shutdown contexts (to comply with rmw_context_fini() API). Finalized and shutdown context states could be perfectly distinguishable.

@wjwwood
Copy link
Member

wjwwood commented Jul 20, 2020

As I see it, to shutdown means a context and all entities associated with it are soon to be finalized.

That's what I imagined as well, but I think there's a need for another use case too, but whether or not that should be part of the same mechanism, I'm not sure. For example, we need a "you should stop, cleanup, and exit your node's main functions". That's what shutdown() meant in ROS 1 for the most part.

@hidmic
Copy link
Contributor

hidmic commented Jul 21, 2020

Yeah, I see it. I'm just not convinced that's Context::shutdown() purpose. And for the record, I'm not advocating for having both if we don't think it's worth the effort. Perhaps what we're looking for is some Context::interrupt() primitive?

@ivanpauno
Copy link
Member Author

That's what I imagined as well, but I think there's a need for another use case too, but whether or not that should be part of the same mechanism, I'm not sure. For example, we need a "you should stop, cleanup, and exit your node's main functions". That's what shutdown() meant in ROS 1 for the most part.

In ROS 1, calling shutdown from a different thread was safe.
A lots of the functions were a noop after shutdown happen, and didn't raise an exception.
e.g:

  • publish
  • calling a service
  • If you called spin after shutdown, it returned immediately.
  • I'm not completely sure, but I think you could even create a "Publisher" after shutdown (it wasn't going to really advertise anything, but trying to create one wouldn't fail).

Based on that, I think that the current ROS 2 shutdown behavior isn't quite the same.

As I see it, to shutdown means a context and all entities associated with it are soon to be finalized.

Sure, but that doesn't mean that shutdown should destroy things or make things raise an unknown exception IMO.

Yeah, I see it. I'm just not convinced that's Context::shutdown() purpose. And for the record, I'm not advocating for having both if we don't think it's worth the effort. Perhaps what we're looking for is some Context::interrupt() primitive?

IMHO, it's not a good idea to add a different method that it's similar and doesn't do exactly the same that shutdown (except if we have a strong use case for that).
The problem is that shutdown isn't thread safe, and it was thread safe in ROS 1.
I think that we should fix that first. We can add a different mechanism later if we find the need.

@hidmic
Copy link
Contributor

hidmic commented Jul 21, 2020

Sure, but that doesn't mean that shutdown should destroy things or make things raise an unknown exception IMO.

Agree about not destroying things, and raising unknown exceptions is never an option. But that happens because we don't deal with shutdowns properly yet.

IMHO, it's not a good idea to add a different method that it's similar and doesn't do exactly the same that shutdown (except if we have a strong use case for that).
The problem is that shutdown isn't thread safe, and it was thread safe in ROS 1.

I can't argue about the need for use cases. And I agree that it should be made thread-safe. What I'm calling out, based on the preceding discussion, is the attempt to turn shutdown into something conceptually different in the process. If the plan to make it thread-safe is to make it a flag and leave side-effects to calling code, that's fair, but don't call it shutdown (just like KeyboardInterrupt isn't ShutdownRequest even though that's the typical use case). I don't even see the need for a separate context state really if that's all we want from it.

@ivanpauno
Copy link
Member Author

But that happens because we don't deal with shutdowns properly yet.

What does "dealing with shutdowns properly" mean?
I think that's the main question here, and I'm failing to get a consensus of what it means.

If the plan to make it thread-safe is to make it a flag and leave side-effects to calling code, that's fair, but don't call it shutdown (just like KeyboardInterrupt isn't ShutdownRequest even though that's the typical use case). I don't even see the need for a separate context state really if that's all we want from it.

AFAIS, in ROS 1 it was actually a flag.
After that flag was true, some things start behaving as a noop but never retrieved an error.

I don't even see the need for a separate context state really if that's all we want from it.

I'm not sure what "separate context state" is referring to.

just like KeyboardInterrupt isn't ShutdownRequest even though that's the typical use case

I would say that a KeyboardInterrupt can be considered a ShutdownRequest, but not the opposite way around.


We will probably need to set up a meeting to discuss this, as this discussion is getting quite long.

@hidmic
Copy link
Contributor

hidmic commented Jul 21, 2020

We will probably need to set up a meeting to discuss this, as this discussion is getting quite long.

Agreed.


But to entertain the discussion:

What does "dealing with shutdowns properly" mean?

IMHO to have APIs behave in a well-defined, thread-safe, and documented way when a shutdown occurs (no random exception, no UB, etc.).

AFAIS, in ROS 1 it was actually a flag.
After that flag was true, some things start behaving as a noop but never retrieved an error.

I'm not onboard with how ROS 1 did lots of things. Whether ROS 2 should mimic for the sake of familiarity or depart in an attempt to improve is a somewhat philosophical discussion.

I'm not sure what "separate context state" is referring to.

To have an invalidated context as a notification, instead of simply sending a notification.

@wjwwood
Copy link
Member

wjwwood commented Sep 22, 2020

We should definitely have some meetings to nail this down.


With respect to the topic of emulating ROS 1 or not in this case, I think we shouldn't. This is mostly based on my experience with clients like Apex, where consistent behavior is key, and having a variety of reactions from the API when shutdown is called makes it hard to know (without 100% familiarity with the entire API) what will happen.

To that end, I think the best idea is just making shutdown a flag, and have none of our API change its behavior when that occurs, instead, allowing the user to decide how to detect/check for it and handle it.

If there is then cases which are common place and are becoming boilerplate, we can add convenience to reduce code duplication, but should ideally be opt-in so the default behavior is predictable and unsurprising, even if they're a bit inconvenient.

This is just all my opinion at this point, and I definitely not sure it's the correct path, but we should discuss it and come to some consensus. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants