New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Asynchronously shutting down while using rclcpp::spin
rclcpp:spin_some
in another thread throws an exception
#1139
Comments
@hidmic @wjwwood @jacobperron does this look like an user error (i.e. the test is not well written) or should be improve how shutdown works? |
I'd say user error. |
Seems like user error; calling shutdown before spin. What's an alternative, have spin return if the context is invalid? |
The thing is that IMO, it's a behavior error, and not an user one (I would expect that if I do something after checking for IMO shutdown shouldn't affect executors/publishers/subscriptions/nodes/... etc, it should only make In that way, we could shutdown asynchronously without this error (or other bunch of possible of similar errors). |
I also would classify it as "user error", but not necessarily the user's fault. It's confusing and so the existing story could be improved. This has been discussed in the past:
And perhaps others, though I couldn't find them. I know @sloretz and I spoke about it in person at length around the time ros2/rmw#154 was merged. And this might be a duplicate? Originally I thought about shutdown like KeyboardInterrupt in Python, but since this is C++ and the user isn't always inside on of our functions we cannot ensure it is raised everywhere. Instead we have (some of ) our functions raise and then the user can check The thinking was that if the user wrote something like this: while (true) {
rclcpp::spin_some(...);
} Then they wouldn't get stuck in a loop due the mistake (imo) of not checking However, that means to avoid tracebacks and crashes they need to catch and handle these errors (like you do with KeyboardInterrupt in Python), and it's also hard to know which things will work and which will not. For example, publishing will work, and destroying a publisher will work, but making a new publisher will not. This was essentially a compromise to enable things like "during shutdown I might want to send a message (publish) to notify someone", for example. The rules not being clear on this is something that Apex folks struggled with, and it's clearly been an issue for us and our users. So I'd be open to reevaluating this decision. We could instead say that shutdown changes nothing but a flag in the context, and everything continues as normal and it's up to the user to notice and react to it. |
This is a bad assumption, imo. It's like acquiring a lock, releasing it, and then assuming afterwards the resource is safe. I understand how it's an easy mistake to make, but imo it is a mistake. |
Transferred the issue to |
Thanks for pointing out the duplicates.
I think that both "exception like" and "notification like" behaviors are acceptable and have their pro/cons. But IMO, we're doing neither of both correctly. |
Just FYI, I think that I already did it with ros2/examples#270. |
That's exactly what I don't like of how |
We already have in After shutdown, you usually get one of those errors (e.g. https://github.com/ros2/rcl/blob/069d1f07290019b62a8297c5f95a80b58e63682a/rcl/src/rcl/publisher.c#L286). We could then add in rclcpp a I think that with those changes, we will be able to handle an asynchronous shutdown gracefully. This approach requires careful reviewing of all functions in Note: It's impossible to recognize an already shutdown context of one that was never init, that should be clear in error messages. |
i think that is feasible. one question, what is the difference between RCL_RET_IS_SHUTDOWN and RCL_RET_CONTEXT_INVALID? i was thinking RCL_RET_CONTEXT_INVALID is equal to RCL_RET_IS_SHUTDOWN. we should add specific state that tells us if shutdown is called or not? could you enlighten me a little bit? |
Sorry, I think my comment wasn't clear. Those are two names suggestions for the same new return code. |
thanks for the clarification, we will take care of this. |
Sorry, but up to now I'm the only one in favor of that change, we need to settle in what we want first. @wjwwood @hidmic @jacobperron Does a custom return code in rcl and a custom exception in rclcpp for shutdown sound like a good idea to solve the issue? Do you have any other alternative? |
Introducing a new common rclcpp exception to make things consistent sounds like an incremental improvement to me +1 As a drastically different alternative, we could remove the exception and have a notification based approach by having while (rclcpp::spin_some()) {
// ...
} And existing code written like the following should still work (but without exceptions): while (rclcpp::ok()) {
rclcpp::spin_some(...);
} Of course, as @wjwwood points out, users can still make the error of writing something like while (true) {
rclcpp::spin_some(...);
} But, we can warn them about not using the return code at compile time at least. IMO, having |
I think that the notification based approach is fine. IMHO, changing while (rclcpp::ok()) {
...
rclcpp::spin*(...);
...
} That is perfectly suitable for a notification based approach, having a return value in |
Quoting myself, I think this might be the right path to take. Just make it so that shutdown does not affect the function of anything, it simply changes the state of This runs the risk of the user not noticing shutdown, but we can't keep them from all pitfalls. |
👍 I like that option. |
SGTM |
i guess we got to the consensus. |
Yeap, I will check in tomorrow's meeting if everybody agrees. Doing this change will involve coordinated work in Getting the refactor right might be a bit tricky, if you're planning to do I can help with the reviews. |
I think we should consider and address the bigger picture: how should a user perform multiple |
i think that we can, and rclcpp/rclcpp/include/rclcpp/context.hpp Line 345 in a8cd936
maybe i am misunderstanding, could you elaborate? |
It should be safe to do it after
We're currently finishing loggers at shutdown (see here), and I think that part of the idea is to stop doing that and doing it when the context is destructed. We could also provide a examples: rclcpp::init(argc, argv, init_options);
while (rclcpp::ok()) {
executor.spin_all(max_duration);
... handle more work here
}
rclcpp::reinit(); auto context = rclcpp::Context::make_shared();
context->init(argc, argv, init_options);
while (context->is_valid()) {
executor.spin_all(max_duration);
... handle more work here
}
context->init(other_argc, other_argv, other_init_options); We should make sure both examples above work. If the user is trying to call |
It's not though... It's only safe to do that after the context is destructed right? I haven't tried it recently, but you should be able to init/shutdown contexts in parallel as long as you're using different ones each time. Maybe the global logging setup doesn't make that work anymore. |
I think it's currently safe, as
It should work if one context initializes the logging system and the other doesn't. I think we should add test for all these cases when doing the change, so it will be clear what works and what doesn't. |
Although not required, it would be nice if the solution to this issue (or version of it) could be backported to Foxy. |
A few comments here and there:
I see. It's a valid point. Perhaps spinning a entity associated with a shutdown context shouldn't throw an exception?
I guess the question is, what's the point of having a third shutdown state besides the usual init-fini cycle? If we only intend it to be a generic notification with no side-effects of its own, why making it a separate state? Why restricting it to a single event? Why do we call it As I see it, to shutdown means a context and all entities associated with it are soon to be finalized. That certainly bans entity creation and restricts the amount of work you can do (spinning is a good example of what should be allowed, or otherwise the added state is somewhat useless). And while we don't do anything meaningful about it right now, if anyone ever wants to:
we'll need the concept.
I will say that's because
This is true because |
That's what I imagined as well, but I think there's a need for another use case too, but whether or not that should be part of the same mechanism, I'm not sure. For example, we need a "you should stop, cleanup, and exit your node's main functions". That's what |
Yeah, I see it. I'm just not convinced that's |
In ROS 1, calling shutdown from a different thread was safe.
Based on that, I think that the current ROS 2 shutdown behavior isn't quite the same.
Sure, but that doesn't mean that shutdown should destroy things or make things raise an unknown exception IMO.
IMHO, it's not a good idea to add a different method that it's similar and doesn't do exactly the same that shutdown (except if we have a strong use case for that). |
Agree about not destroying things, and raising unknown exceptions is never an option. But that happens because we don't deal with shutdowns properly yet.
I can't argue about the need for use cases. And I agree that it should be made thread-safe. What I'm calling out, based on the preceding discussion, is the attempt to turn |
What does "dealing with shutdowns properly" mean?
AFAIS, in ROS 1 it was actually a flag.
I'm not sure what "separate context state" is referring to.
I would say that a We will probably need to set up a meeting to discuss this, as this discussion is getting quite long. |
Agreed. But to entertain the discussion:
IMHO to have APIs behave in a well-defined, thread-safe, and documented way when a shutdown occurs (no random exception, no UB, etc.).
I'm not onboard with how ROS 1 did lots of things. Whether ROS 2 should mimic for the sake of familiarity or depart in an attempt to improve is a somewhat philosophical discussion.
To have an invalidated context as a notification, instead of simply sending a notification. |
We should definitely have some meetings to nail this down. With respect to the topic of emulating ROS 1 or not in this case, I think we shouldn't. This is mostly based on my experience with clients like Apex, where consistent behavior is key, and having a variety of reactions from the API when shutdown is called makes it hard to know (without 100% familiarity with the entire API) what will happen. To that end, I think the best idea is just making shutdown a flag, and have none of our API change its behavior when that occurs, instead, allowing the user to decide how to detect/check for it and handle it. If there is then cases which are common place and are becoming boilerplate, we can add convenience to reduce code duplication, but should ideally be opt-in so the default behavior is predictable and unsurprising, even if they're a bit inconvenient. This is just all my opinion at this point, and I definitely not sure it's the correct path, but we should discuss it and come to some consensus. 👍 |
This nightly failure is an example of the problem:
What is happening is that one thread is creating an executor, while other is trying to shutdown.
If shutdown is called before the creation of the
Executor
, an exception is thrown:rclcpp/rclcpp_lifecycle/test/test_lifecycle_service_client.cpp
Lines 218 to 231 in 87bb9f9
The text was updated successfully, but these errors were encountered: