executor could take more than once incorrectly #383

wjwwood · 2017-10-04T20:17:38Z

@sloretz pointed this issue out to me when he was working on the Python executor. This is an issue that only occurs when the multi-threaded executor and reentrant callback groups are used together.

If you imagine each thread in the multi-threaded executor doing this:

acquire lock
wait for work
release lock
take (or "claim") work (rcl_take for subscribers, call for timers)
pass work to user callback
loop

If a context switch happens directly after releasing the lock, it's possible another thread can take the same work. This is because of how the "wait for work" part works, which is to:

for each callback group, if it can be taken from, loop over its items
- items include subscriptions, timers, and client/servers for services
add each item to the wait set
wait
return the next thing that is ready when wait wakes up

So if all of this happens in the other thread again, before the original thread can take or claim the work to be done, then a second thread would try to take or claim the same work.

For subscriptions and clients/servers of services this is protected by some form of rcl_take which can fail if called twice on the same subscription and only one piece of data is available. However timers could be called multiple times, since whether or not a timer should be called and calling it are decoupled.

In this pr I was able to expose this issue by adding an option for the multithreaded executor to yield after releasing the lock and using a test which has a reentrant callback group and timer. Reproducing it with a subscription would be a good bit harder I think due to the asynchronous nature of the process.

I'm opening this as a work in progress for visibility, but I'm on the right track to having a reproducible test failure.

For the fix, which has not been made yet, it will require taking or claiming the work to be done inside of the multithreading lock, which might require a change to the executor API.

mjcarroll · 2018-03-28T20:43:36Z

@wjwwood and @mikaelarguedas This is a fix, I don't know if it's the one that we want.

As described above, the bug occurs if Executor::get_next_timer gets called multiple times before the timer actually gets executed in Executor::execute_any_executable. To prevent this, I've added a set of timers to track if a timer is currently 'in-flight' or scheduled for execution.

The timer is added to the set in get_next_timer and removed in execute_any_executable.

The alternative would be to include this functionality in the timer itself, but I'm not sure if that would be a violation of encapsulation?

wjwwood · 2018-03-28T23:56:22Z

I spoke with @mjcarroll off-line and we discussed putting the "tracking" information in the MultiThreadedExecutor class since it is not needed in the SingleThreadedExecutor, and also because future executors might want to make similar, but different constraints on what should and should not be returned by the get_next_executable() function.

mjcarroll · 2018-06-19T00:50:47Z

Shallow CI:

Linux
Linux-aarch64
macOS
Windows

sloretz

One cmake think I think needs to be fixed, and me not understanding some code in another comment.

sloretz · 2018-06-19T01:07:22Z

rclcpp/test/executors/test_multi_threaded_executor.cpp

@@ -0,0 +1,91 @@
+// Copyright 2017 Open Source Robotics Foundation, Inc.


sloretz · 2018-06-19T01:10:57Z

rclcpp/CMakeLists.txt

@@ -387,6 +387,14 @@ if(BUILD_TESTING)
      "rcl")
    target_link_libraries(test_utilities ${PROJECT_NAME})
  endif()
+
+  ament_add_gtest(test_multi_threaded_executor test/executors/test_multi_threaded_executor.cpp


Does this need to be wrapped in if(UNIX) due to sched_setscheduler() in the test?

I added an #ifdef around the corresponding sched_setscheduler(), because the bug originally manifested on OSX, so we wanted to test against that. That trick additionally isn't necessary on OSX because the bug shows up with the default scheduler.

sloretz · 2018-06-19T01:13:55Z

rclcpp/test/executors/test_multi_threaded_executor.cpp

+        double diff = labs((now - last).nanoseconds()) / 1.0e9;
+        last = now;
+
+        if (diff < 0.009 || diff > 0.011) {


Recommend replacing numbers with something likediff < PERIOD - TOLERANCE || diff > PERIOD + TOLERANCE to make it clear where they come from.

sloretz · 2018-06-19T01:15:17Z

rclcpp/test/executors/test_multi_threaded_executor.cpp

+  }
+
+  rclcpp::executors::MultiThreadedExecutor executor(
+    rclcpp::executor::create_default_executor_arguments(), 0, true);


Personal preference is yield_before_execute = true; then passing that in. I'll approve either way.

sloretz · 2018-06-19T01:16:06Z

rclcpp/src/rclcpp/executors/multi_threaded_executor.cpp

+    {
+      std::lock_guard<std::mutex> lock(scheduled_mutex_);
+      auto it = scheduled_.find(any_exec);
+      if (it != scheduled_.end()) {


When is an executable not found in scheduled_? Is it a case worth logging a warning?

It should never happen, afaict.

sloretz · 2018-06-19T01:23:29Z

rclcpp/src/rclcpp/executors/multi_threaded_executor.cpp

        continue;
      }
+      {
+        std::lock_guard<std::mutex> lock(scheduled_mutex_);
+        if (scheduled_.count(any_exec) != 0) {


I'm a bit confused by this line. AFAIK count() uses operator==() of any_exec which is std::shared_ptr<executor::AnyExecutable>. The docs say that compares the address of the pointer held by the shared_ptr (link here). The address is of a newly default constructed object in std::make_shared() above. Won't this count() always return 0?

mjcarroll · 2018-06-19T01:47:12Z

Linux
Linux-aarch64
macOS
Windows

mjcarroll · 2018-06-19T03:11:21Z

Now with expanded test tolerance:

Linux
Linux-aarch64
macOS
Windows

wjwwood

lgtm

wjwwood · 2018-06-19T03:26:58Z

@dhood is going to give a review as well.

Otherwise it seemed to me like it would yield twice.

wjwwood added in progress Actively being worked on (Kanban column) bug Something isn't working labels Oct 4, 2017

mikaelarguedas added ready Work is about to start (Kanban column) and removed in progress Actively being worked on (Kanban column) labels Oct 26, 2017

mjcarroll self-assigned this Mar 27, 2018

mjcarroll added in progress Actively being worked on (Kanban column) and removed ready Work is about to start (Kanban column) labels Mar 27, 2018

mjcarroll force-pushed the fix_executor_extra_take branch 2 times, most recently from a3a15ca to 1488317 Compare March 28, 2018 17:43

mjcarroll changed the title ~~executor could take more than once incorrectly~~ wip: executor could take more than once incorrectly Mar 28, 2018

mjcarroll changed the title ~~wip: executor could take more than once incorrectly~~ executor could take more than once incorrectly Mar 28, 2018

mjcarroll force-pushed the fix_executor_extra_take branch from 1488317 to c94c152 Compare March 28, 2018 18:13

mjcarroll force-pushed the fix_executor_extra_take branch from 73c408b to 071e658 Compare May 2, 2018 15:00

mjcarroll added 4 commits June 18, 2018 19:12

Baseline test and force threads to yield.

7d80613

Add timer tracking for executor.

6853019

Add locking and test happy-path exit conditions.

a2c7e9e

Move logic to multi_threaded_executor

38c48d1

mjcarroll force-pushed the fix_executor_extra_take branch from 071e658 to 38c48d1 Compare June 19, 2018 00:49

sloretz reviewed Jun 19, 2018

View reviewed changes

Address reviewer feedback by reducing scope of set.

d51bf80

Expand tolerance on testing.

fdcfd01

wjwwood commented Jun 19, 2018

View reviewed changes

mjcarroll added in review Waiting for review (Kanban column) and removed in progress Actively being worked on (Kanban column) labels Jun 19, 2018

comment fixup

1b5d3fd

Otherwise it seemed to me like it would yield twice.

dhood approved these changes Jun 19, 2018

View reviewed changes

mjcarroll merged commit 62c8c5b into master Jun 19, 2018

mjcarroll deleted the fix_executor_extra_take branch June 19, 2018 03:46

dirk-thomas removed the in review Waiting for review (Kanban column) label Jun 19, 2018

mjcarroll mentioned this pull request Jun 19, 2018

OSX multi-threaded executor test failed ros2/build_farmer#131

Closed

wjwwood mentioned this pull request Feb 21, 2019

Avoid race that triggers timer too often #621

Merged

hidmic mentioned this pull request Nov 12, 2019

rclcpp multithreaded executor tests are failing ros2/build_farmer#250

Closed

brawner mentioned this pull request Mar 1, 2020

test_multi_threaded_executor timer_over_take is flaky due to timer jitter #1008

Closed

ivanpauno mentioned this pull request May 1, 2020

Relax multithreaded test constraint #907

Merged

ivanpauno mentioned this pull request Oct 5, 2020

Timers being executed more than once by the multithread executor #1374

Closed

iuhilnehc-ynos mentioned this pull request Dec 11, 2020

Publishing is slow in Docker with MutliThreadedExecutor #1487

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

executor could take more than once incorrectly #383

executor could take more than once incorrectly #383

wjwwood commented Oct 4, 2017 •

edited by mjcarroll

Loading

mjcarroll commented Mar 28, 2018

wjwwood commented Mar 28, 2018

mjcarroll commented Jun 19, 2018

sloretz left a comment

sloretz Jun 19, 2018

sloretz Jun 19, 2018

mjcarroll Jun 19, 2018

sloretz Jun 19, 2018

sloretz Jun 19, 2018

sloretz Jun 19, 2018

mjcarroll Jun 19, 2018

sloretz Jun 19, 2018

mjcarroll commented Jun 19, 2018

mjcarroll commented Jun 19, 2018

wjwwood left a comment

wjwwood commented Jun 19, 2018

		@@ -0,0 +1,91 @@
		// Copyright 2017 Open Source Robotics Foundation, Inc.

executor could take more than once incorrectly #383

executor could take more than once incorrectly #383

Conversation

wjwwood commented Oct 4, 2017 • edited by mjcarroll Loading

mjcarroll commented Mar 28, 2018

wjwwood commented Mar 28, 2018

mjcarroll commented Jun 19, 2018

sloretz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjcarroll commented Jun 19, 2018

mjcarroll commented Jun 19, 2018

wjwwood left a comment

Choose a reason for hiding this comment

wjwwood commented Jun 19, 2018

wjwwood commented Oct 4, 2017 •

edited by mjcarroll

Loading