Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug that prevented Executor::spin_some() from exiting. #472

Closed
wants to merge 2 commits into from

Conversation

tj10200
Copy link

@tj10200 tj10200 commented May 4, 2018

Added batching functionality where all pending executables are
collected prior to being executed. This should prevent
long running executables from backing up the processing when calling
spin_some().

Connects to #471

Added batching functionality where all pending executables are
collected prior to being executed. This should prevent
long running executables from backing up the processing.
@tfoote tfoote added the in review Waiting for review (Kanban column) label May 4, 2018
Copy link
Member

@wjwwood wjwwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The basic approach is good in my opinion. There are some code-style issues which our tests and tools can help you work out (you can use the ament_uncrustify path/to/rclcpp --reformat and ament_cpplint path/to/cpplint tools to help you).

I'd also like to see a test which tests this case and ensures the desired behavior is happening. @tj10200 are you interested in doing that too?

If so, I think you could create a test case which does something like this:

  • create a node
  • create three condition variables: A, B, C
  • create bool test_finished = false
  • create a callback_count = 0
  • create a subscription and matching publisher
    • ensure the subscription has a queue of at least 10 (either keep all or keep last >=10)
    • in the callback for the subscription:
      • have it first increment callback_count
      • then notify the condition variable A
      • then if !test_finished:
        • wait on B
      • these steps allow you to known when the callback has started and then asynchronously control when the callback finishes
  • add node to executor (single threaded should be ok)
  • publish twice, to "queue up" two pieces of work
  • create a thread that:
    • calls spin_some
    • notify C
  • wait for A
    • with a predicate that callback_count == 1
    • this will let us know when spin_some has started executing the callback for the first time
  • publish again (queuing more work during spin_some which should not be handled)
  • notify B
  • wait for A
    • with a predicate that callback_count == 2, and
    • with a high timeout (~10s?)
    • if timeout (callback never called again)
      • FAIL test
      • cancel executor
      • join thread
  • test_finished = true (to prevent callback from waiting in the future)
  • notify B (in case already waiting)
  • wait for C
    • with a high timeout (~10s?)
    • if timeout (spin_some never exited)
      • FAIL test
      • cancel executor
      • join thread
  • gtest ASSERT_EQ(callback_count, 2), if 3 then the spin_some did too much work
  • join thread

Sorry, I know that was a lot... lol.

I put "good first issue", but it's deceptively complex. I think if you fix the style issues and make a test then this will be a good incremental improvement.

However, reviewing this code I found three more issues with our implementation. So I'd merge this and make new issues for those.

Briefly I'll describe them here:

  • spinning should be checked between execution of each piece of work
    • otherwise, it might take a long time for Executor::cancel to stop an in-progress spin_some

  • if spin_some stops before finishing all work, claimed work is not returned to the executor and may be lost

  • if spin_some is changed to consider spinning more often, then there is a race condition on spinning
    • consider two threads both doing if (spinning.load()) executor.cancel(); executor.spin_some();
    • consider this sequence:
      • >> enter thread 1
      • start spin_some()
      • spin_some() sets spinning = true
      • >> context switch to thread 2
      • calls executor.cancel() because spinning is true
      • start second spin_some()
      • second spin_some() sets spinning = true again
      • >> context switch to thread 1
      • spin_some()'s "on scope exit" sets spinning to false again
      • >> thread 1 ends, switch back to thread 2
      • second spin_some() exits because subsequent spinning check finds it false (yet no cancel has been called)
    • expectation is that second spin_some() runs uninterrupted to completion
    • actually second spin_some() sometimes exits early as-if canceled, despite not being canceled

Those don't need to be addressed in this pr or by you, but I will make issues for those cases when this is finished.

In summary, I'm "requesting changes (red X)" for style reasons and because I think it needs a test.

Thanks for the contribution so far!

Reformatted executor.cpp using ament_uncrustify and ament_cpplint
@tj10200
Copy link
Author

tj10200 commented May 8, 2018

Hi @wjwwood - Yes I'd like to write a test before getting this in. I assume you want to commit the test on this same pull request?

@wjwwood
Copy link
Member

wjwwood commented May 8, 2018

Hi @wjwwood - Yes I'd like to write a test before getting this in. I assume you want to commit the test on this same pull request?

Yes, please.

@mikaelarguedas mikaelarguedas added requires-changes in progress Actively being worked on (Kanban column) and removed in review Waiting for review (Kanban column) labels May 10, 2018
@tj10200
Copy link
Author

tj10200 commented May 11, 2018

hi @wjwwood, I was having some trouble getting this all to hook together in the unit test. I'm hoping you might be able to point out where I've messed up.

So I have something that looks like this (trimmed down version):

  rclcpp::init(0, nullptr);
  rclcpp::executors::SingleThreadedExecutor executor;
  rclcpp::Node::SharedPtr subscriber =  std::make_shared<SpinSomeTestSubscriber>();
  rclcpp::Node::SharedPtr publisher =  std::make_shared<SpinSomeTestPublisher>();
  executor.add_node(publisher);
  executor.add_node(subscriber);

  publisher->publish_test_data();

  executor.spin_some();

The problem I'm having is that when calling spin_some(), the function get_next_executable() always returns false and the subscriber callback is never called. The publisher and subscriber are the same ones under the minimal_publisher/subscriber lambda examples, I've basically just renamed them and changed publisher to use an explicit callback rather than use a timer.

Did I miss something here in the initialization steps? Looking through the other examples, I can't really see anything that's different in my UT.

@tj10200
Copy link
Author

tj10200 commented May 17, 2018

Started looking at this again today. I noticed through gdb that occasionally the executor would pick up the published messages. Through some debugging I found that it was related to no having the intra process flag switched on when I created my nodes. Saying that, I ran into another problem immediately after initializing the nodes correctly.

In the executor on the collection of the second published message: It aborts on this line here - https://github.com/ros2/rclcpp/blob/master/rclcpp/src/rclcpp/executor.cpp#L595

I think I can make a change to the publisher and subscribers in the test to make the groups Reentrant types, but does this seem like the right behavior? It would mean that the application would abort any time a pub/sub was still mutually exclusive callback group types.

It seems that I might need to take a different approach to my original solution. I'm thinking of trying to work out a method to count the number of pending messages, and using that as a way to break out of the spin_some loop. This would be very similar to the current pull request, with the exception that the AnyExecutable objects would be collected and executed as a single operation.

Sudo Code:

Executor::spin_some()
...
  int pending_executables = get_pending_executables();
  while (spinning.load() && pending_executables) {
    //Same as https://github.com/ros2/rclcpp/blob/master/rclcpp/src/rclcpp/executor.cpp#L210
    --pending_executables;
  }
...

Thoughts?

@mikaelarguedas mikaelarguedas added in review Waiting for review (Kanban column) and removed in progress Actively being worked on (Kanban column) labels Jul 5, 2018
@cottsay
Copy link
Member

cottsay commented Feb 14, 2019

@tj10200, it looks like there hasn't been much progress on this in a while. Have you been able to make any progress, or do you have any questions we can help answer?

@cottsay
Copy link
Member

cottsay commented Feb 21, 2019

I'm going to move this to the backlog for now, since there doesn't appear to be much activity. @tj10200, if you find time to revisit this, please feel free to let us know.

@cottsay cottsay added backlog and removed in review Waiting for review (Kanban column) labels Feb 21, 2019
@ivanpauno
Copy link
Member

#844 addressed the same problem this PR was trying to solve.
Closing as #844 has been merged.

@tj10200 Feel free to reopen or open a new issue/PR if you think that something has to be changed.

@ivanpauno ivanpauno closed this Nov 22, 2019
@wuffle-ros wuffle-ros bot removed the backlog label Nov 22, 2019
DensoADAS pushed a commit to DensoADAS/rclcpp that referenced this pull request Aug 5, 2022
Signed-off-by: Dirk Thomas <dirk-thomas@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants