Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nav2 lifecycle manager won't startup nodes #2917

Closed
AlexeyMerzlyakov opened this issue Apr 15, 2022 · 21 comments
Closed

Nav2 lifecycle manager won't startup nodes #2917

AlexeyMerzlyakov opened this issue Apr 15, 2022 · 21 comments

Comments

@AlexeyMerzlyakov
Copy link
Collaborator

AlexeyMerzlyakov commented Apr 15, 2022

Bug report

Required Info:

  • Operating System:
    • Ubuntu focal 20.04.4
  • ROS2 Version:
    • ROS2 rolling built with sources from Apr 13 2022
  • Version or commit hash:
  • DDS implementation:
    • Fast-RTPS

Steps to reproduce issue

Run nav2_lifecycle_manager with any lifecylce node e.g.:

ros2 launch nav2_bringup bringup_launch.py map:=${HOME}/nav2_ws/src/navigation2/nav2_bringup/maps/turtlebot3_world.yaml | grep -e Configuring -e Activating

Expected behavior

Lifecycle nodes are going to "Configured" and "Active" state:

$ ros2 launch nav2_bringup bringup_launch.py map:=${HOME}/nav2_ws/src/navigation2/nav2_bringup/maps/turtlebot3_world.yaml | grep -e Configuring -e Activating

[component_container_isolated-1] [INFO] [1650040180.873389080] [lifecycle_manager_localization]: Configuring map_server
[component_container_isolated-1] [INFO] [1650040180.874164522] [map_server]: Configuring
[component_container_isolated-1] [INFO] [1650040180.919901945] [lifecycle_manager_localization]: Configuring amcl
[component_container_isolated-1] [INFO] [1650040180.920120541] [amcl]: Configuring
[component_container_isolated-1] [INFO] [1650040180.974971983] [lifecycle_manager_localization]: Activating map_server
[component_container_isolated-1] [INFO] [1650040180.975233144] [map_server]: Activating
[component_container_isolated-1] [INFO] [1650040181.081511733] [lifecycle_manager_localization]: Activating amcl
[component_container_isolated-1] [INFO] [1650040181.081762152] [amcl]: Activating
[component_container_isolated-1] [INFO] [1650040181.249707477] [lifecycle_manager_navigation]: Configuring controller_server
[component_container_isolated-1] [INFO] [1650040181.249960165] [controller_server]: Configuring controller interface
[component_container_isolated-1] [INFO] [1650040181.251433253] [local_costmap.local_costmap]: Configuring
[component_container_isolated-1] [INFO] [1650040181.370370000] [lifecycle_manager_navigation]: Configuring smoother_server
[component_container_isolated-1] [INFO] [1650040181.370620758] [smoother_server]: Configuring smoother server
[component_container_isolated-1] [INFO] [1650040181.395300276] [lifecycle_manager_navigation]: Configuring planner_server
[component_container_isolated-1] [INFO] [1650040181.395925354] [planner_server]: Configuring
[component_container_isolated-1] [INFO] [1650040181.395997608] [global_costmap.global_costmap]: Configuring
[component_container_isolated-1] [INFO] [1650040181.443931586] [planner_server]: Configuring plugin GridBased of type NavfnPlanner
[component_container_isolated-1] [INFO] [1650040181.457533277] [lifecycle_manager_navigation]: Configuring recoveries_server
[component_container_isolated-1] [INFO] [1650040181.457913917] [recoveries_server]: Configuring
[component_container_isolated-1] [INFO] [1650040181.475915198] [recoveries_server]: Configuring spin
[component_container_isolated-1] [INFO] [1650040181.487101451] [recoveries_server]: Configuring backup
[component_container_isolated-1] [INFO] [1650040181.495568349] [recoveries_server]: Configuring wait
[component_container_isolated-1] [INFO] [1650040181.506974136] [lifecycle_manager_navigation]: Configuring bt_navigator
[component_container_isolated-1] [INFO] [1650040181.507486913] [bt_navigator]: Configuring
[component_container_isolated-1] [INFO] [1650040181.594729494] [lifecycle_manager_navigation]: Configuring waypoint_follower
[component_container_isolated-1] [INFO] [1650040181.594987080] [waypoint_follower]: Configuring
[component_container_isolated-1] [INFO] [1650040181.610006747] [lifecycle_manager_navigation]: Activating controller_server
[component_container_isolated-1] [INFO] [1650040181.610493051] [controller_server]: Activating
[component_container_isolated-1] [INFO] [1650040181.610563532] [local_costmap.local_costmap]: Activating

Actual behavior

See nothing in console: Nav2 lifecycle manager will never switch nodes to configured or active state.

Problem analysis

Nodes are not going to configured-active states because of LifecycleManger::startup() routine is not being executed. This appears due to outer and inner wall timers are belonging to different callback groups:

$ cat nav2_lifecycle_manager/src/lifecycle_manager.cpp
...
  init_timer_ = this->create_wall_timer(
    0s,
    [this]() -> void {
      init_timer_->cancel();
      createLifecycleServiceClients();
      if (autostart_) {
        init_timer_ = this->create_wall_timer(
          0s,
          [this]() -> void {
            init_timer_->cancel(); <- this part of code is not executed on my PC
            startup(); <- this part of code is not executed on my PC
          },
          callback_group_); <- this callback group will run manually in separate service_thread_
      }
    }); <- default_callback_group_ of rclcpp::Node

It looks like depending on which thread will be executed first (default Node's or service_thread_), inner part of code will or won't be executed. I've attached a simple testcase (dev_ws.zip) containing similar to above snap of code to easily check this situation:

$ cat dev_ws/src/timer/src/main.cpp
...
    timer_ = this->create_wall_timer(
      0s,
      [this]() -> void {
        timer_->cancel();
        RCLCPP_INFO(get_logger(), "Hey from outer wall timer");
        timer_ = this->create_wall_timer(
          0s,
          [this]() -> void {
            timer_->cancel();
            RCLCPP_INFO(get_logger(), "Hey from nested wall timer");
          },
          callback_group_);
      });

There are two timers: outer and nested. In correct case both of them should be executed:

[amerzlyakov@amerzlyakov-PC dev_ws]$ ./build/timer/timer_node 
[INFO] [1650042005.459642477] [timer_node]: Hey from outer wall timer
[INFO] [1650042005.460093744] [timer_node]: Hey from nested wall timer

In failed case only outer wall timer lambda is being executed:

[amerzlyakov@amerzlyakov-PC dev_ws]$ ./build/timer/timer_node 
[INFO] [1650041959.195186584] [timer_node]: Hey from outer wall timer

Possible solutions

Removing inner lambda callback from callback_group_ will automatically add it to default callback group of Node, that resolves situation. However, it looks like it was made intentionally in 89499ab to cover some cases (@robotechvision could I ask you to share more details about which situation does it resolve?).

If we can't move both outer and inner wall timers to one group, another solution here is to add lock-synchronization between LifecycleManager constructor and init_timer_ callback in order to prevent initial crash from #2689. Something like this:

diff --git a/nav2_lifecycle_manager/include/nav2_lifecycle_manager/lifecycle_manager.hpp b/nav2_lifecycle_manager/include/nav2_lifecycle_manager/lifecycle_manager.hpp
index e218871c..0f9ec4cd 100644
--- a/nav2_lifecycle_manager/include/nav2_lifecycle_manager/lifecycle_manager.hpp
+++ b/nav2_lifecycle_manager/include/nav2_lifecycle_manager/lifecycle_manager.hpp
@@ -191,6 +191,10 @@ protected:
   // Whether to automatically start up the system
   bool autostart_;
 
+  // Indicates whether LifecycleManager is constructed. Used for synchronization.
+  bool created_{false};
+  std::mutex created_mutex_;
+
   bool system_active_{false};
 };
 
diff --git a/nav2_lifecycle_manager/src/lifecycle_manager.cpp b/nav2_lifecycle_manager/src/lifecycle_manager.cpp
index 8474969f..816fe46f 100644
--- a/nav2_lifecycle_manager/src/lifecycle_manager.cpp
+++ b/nav2_lifecycle_manager/src/lifecycle_manager.cpp
@@ -77,23 +77,24 @@ LifecycleManager::LifecycleManager(const rclcpp::NodeOptions & options)
     std::string("Shutting down ");
 
   init_timer_ = this->create_wall_timer(
-    0s,
+    std::chrono::nanoseconds(10),
     [this]() -> void {
-      init_timer_->cancel();
-      createLifecycleServiceClients();
-      if (autostart_) {
-        init_timer_ = this->create_wall_timer(
-          0s,
-          [this]() -> void {
-            init_timer_->cancel();
-            startup();
-          },
-          callback_group_);
+      std::lock_guard<std::mutex> lock(created_mutex_);
+      if (created_) {
+        init_timer_->cancel();
+        createLifecycleServiceClients();
+        if (autostart_) {
+          startup();
+        }
       }
-    });
+    },
+    callback_group_);
   auto executor = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();
   executor->add_callback_group(callback_group_, get_node_base_interface());
   service_thread_ = std::make_unique<nav2_util::NodeThread>(executor);
+
+  std::lock_guard<std::mutex> lock(created_mutex_);
+  created_ = true;
 }
 
 LifecycleManager::~LifecycleManager()

However, I am not totally sure that this is right solution. For example, I could miss something or we could play with callback groups that will allow us to make more elegant solution. @robotechvision, @SteveMacenski, do you have any other thoughts on that?

@SteveMacenski
Copy link
Member

I have not run into this, but I'm also using Rolling binaries, not source. The binaries stopped updates a bit ago (not super long, but also its been a few weeks) since Rolling has moved to support 22.04 and releases binaries there for the Humble release in late May.

More context #2689

@AlexeyMerzlyakov
Copy link
Collaborator Author

Yes, this is the most probably the problem of newer ROS2 API change. I understood what was resolved in the #2689. And I am not sure that making nested timers belonging to different callback groups will correctly work further. That is why this ticket was initiated with lock-synchronization approach proposed: to cover initially raised in #2689 problem (I've got the same testcase and played with it).

Since currently it is not related to binary-distributed ROS2 Rolling, we do not need to much care about it. If in near (or far) future this problem will appear again, we could return back to it.
I'll also plan to spend some time on WW18-19 for analysis of ROS2 changes and will try to figure-out what caused the basic example from this ticket header has stopped to work. If I will have any update on it, I will also share it here.
Until this time, you can postpone or close this issue.

@SteveMacenski
Copy link
Member

SteveMacenski commented Apr 26, 2022

Yes, this is the most probably the problem of newer ROS2 API change ... we do not need to much care about it. If in near (or far) future this problem will appear again, we could return back to it.

Might be worth spending an hour and figuring out what changed in Rolling to cause that issue and finding out if that's a regression or not. If it is a regression, we need to point that out ASAP to OR as part of the Humble testing to see if that was intentional or not. The rolling released version will be updated with what you built with source, so it will be an issue in the near term future.

https://discourse.ros.org/t/branched-humble-call-for-beta-testing/25284 they're starting testing now, I'd try to see if you see it in the Humble setup that is intended for release.

@clalancette

@clalancette
Copy link
Contributor

I can reproduce the issue, but I'm not that familiar with what is supposed to be happening. That said, do feel free to open a new bug against https://github.com/ros2/rclcpp, and we can triage it during the test phase here to see if that is intentional behavior.

@SteveMacenski
Copy link
Member

@AlexeyMerzlyakov can you do this and see if this is intentional behavior? Time is a little short during the testing phase

@afrixs
Copy link
Contributor

afrixs commented May 5, 2022

The lock solution doesn't guarantee that shared pointer has been created when startup(); is called (it could still be called between deleting the lock at the end of the node constructor and creating the shared pointer in std::make_shared), so I'm afraid it's not crash-safe.

But I've had recent experiences when executor didn't check for new timers/subscriptions/etc if they were added from different thread. That could be the cause of the problem. Fortunately, rclcpp has a solution for that: you just need to notify the executor. @AlexeyMerzlyakov could you try to reset to the original code and add these lines after creating the inner timer and test if it works?

try {
  get_node_base_interface()->get_notify_guard_condition().trigger();
} catch (const rclcpp::exceptions::RCLError & ex) {
  throw std::runtime_error("Failed to notify service executor");
}

@AlexeyMerzlyakov
Copy link
Collaborator Author

AlexeyMerzlyakov commented May 5, 2022

I've came with small update today on this problem. Rolling-back of rclcpp to January's version did not bring any result: attached testcase still fails. I've continued to check what's happened between Jan-May 2022 in ros2_rolling/src/ros2 packages and found out that updating of FastRTPS is causing the issue to uncover. Moreover, if we will switch to CycloneDDS, the problem will also disappear on latest ROS2 sources.

It was discovered that issue appears since ros2/rmw_fastrtps@66926c7 commit where new RMW_EVENT_OFFERED_QOS_INCOMPATIBLE/RMW_EVENT_REQUESTED_QOS_INCOMPATIBLE publisher/subscriber event handlers were introduced. Before this commit, all these types of events were marked as Unknown in FastRTPS and did not handled by RMW. However, Cyclone DDS has already these event handlers implemented but everything works fine, which brings to mind the problem is in FastDDS. I need to get closer to this issue and chain between incompatible QoS events and nested timers execution to get more understanding whether it is the problem of FastDDS or somewhere above.

The lock solution doesn't guarantee that shared pointer has been created when startup(); is called (it could still be called between deleting the lock at the end of the node constructor and creating the shared pointer in std::make_shared), so I'm afraid it's not crash-safe.

Yes, that how it should be. Thanks for noting this!
For handling it, we can use something like std::atomic_bool shared_ptr_initialized_; variable for synchronization instead of locking solution (just for the case, if we won't find better solution by playing with callback groups and/or it is the problem of RMW/FastRTPS).

Fortunately, rclcpp has a solution for that: you just need to notify the executor. @AlexeyMerzlyakov could you try to reset to the original code and add these lines after creating the inner timer and test if it works?

I've checked above lines of code with the testcase from issue header. Unfortunately, it does not work. Maybe, we need to notify inner timer executor from custom callback_group_ instead of outer one from given by get_node_base_interface()?

@afrixs
Copy link
Contributor

afrixs commented May 5, 2022

Maybe, we need to notify inner timer executor from custom callback_group_

I was hoping that the node base interface notifies both callback groups (since it is the node base for both of them), but apparently it does not... A separate node for service_thread_ could be used but this seems too complicated

For handling it, we can use something like std::atomic_bool shared_ptr_initialized_;

Yes, that could solve it - the variable could be set to true in the "outer" timer (since as far as I know there is no "callback" which is fired when the shared_ptr gets created) and the "inner" timer can be extracted out and its callback can be repeated until shared_ptr_initialized_ is true, i.e. something like this:

  shared_ptr_init_timer_ = this->create_wall_timer(
      0s,
      [this]() -> void {
        shared_ptr_init_timer_->cancel();
        shared_ptr_initialized_.store(true);
      });
  if (autostart_) {
    service_startup_timer_ = this->create_wall_timer(
        10ns,
        [this]() -> void {
          if (shared_ptr_initialized_.load()) {
            service_startup_timer_->cancel();
            startup();
          }
        },
        callback_group_);
  }
  auto executor = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();
  executor->add_callback_group(callback_group_, get_node_base_interface());
  service_thread_ = std::make_unique<rtv_lifecycle::NodeThread>(executor);

@afrixs
Copy link
Contributor

afrixs commented May 5, 2022

But now comes to mind: we could create the service_thread_ inside the outer timer (i.e. at the time we already know that shared_ptr has been initialized and inner timer has been created) and thus we don't need to notify anything (this looks simpler to me):

  init_timer_ = this->create_wall_timer(
    0s,
    [this]() -> void {
      init_timer_->cancel();
      createLifecycleServiceClients();
      if (autostart_) {
        init_timer_ = this->create_wall_timer(
          0s,
          [this]() -> void {
            init_timer_->cancel();
            startup();
          },
          callback_group_);
      }
      auto executor = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();
      executor->add_callback_group(callback_group_, get_node_base_interface());
      service_thread_ = std::make_unique<rtv_lifecycle::NodeThread>(executor);
    });

@AlexeyMerzlyakov
Copy link
Collaborator Author

But now comes to mind: we could create the service_thread_ inside the outer timer

I think this should be the best solution: we are calling one thread executor directly from another. I've checked it on local testcase and in nav2_lifecycle_manager. For both cases it works fine. Also, it was checked that this solution does not cause the regression from #2689, so that should work.
Thank you for the solution proposed!
@SteveMacenski, are you OK with such solution in mainstream?

Despite it looks like the fix was found, I still not fully undestand the root cause of the problem. Working on it.

@SteveMacenski
Copy link
Member

SteveMacenski commented May 6, 2022

That seems like a fine solution to me.

Though I would like notify folks ASAP about this since Humble's full release in 2 weeks.

Rolling-back of rclcpp to January's version did not bring any result: attached testcase still fails. I've continued to check what's happened between Jan-May 2022 in ros2_rolling/src/ros2 packages and found out that updating of FastRTPS is causing the issue to uncover. Moreover, if we will switch to CycloneDDS, the problem will also disappear on latest ROS2 sources.

It was discovered that issue appears since ros2/rmw_fastrtps@66926c7 commit where new RMW_EVENT_OFFERED_QOS_INCOMPATIBLE/RMW_EVENT_REQUESTED_QOS_INCOMPATIBLE publisher/subscriber event handlers were introduced. Before this commit, all these types of events were marked as Unknown in FastRTPS and did not handled by RMW. However, Cyclone DDS has already these event handlers implemented but everything works fine, which brings to mind the problem is in FastDDS. I need to get closer to this issue and chain between incompatible QoS events and nested timers execution to get more understanding whether it is the problem of FastDDS or somewhere above.

That sounds like a bug/change in behavior to me that we should notify folks about, @clalancette agree?

CC @JaimeMartin @EduPonz

@AlexeyMerzlyakov
Copy link
Collaborator Author

AlexeyMerzlyakov commented May 12, 2022

Some info on the issue. The ros2/rmw_fastrtps@66926c7 commit just uncovered the hidden problem, but it is not a root cause of it. This commit contains the logic cycles for RMW_EVENT_OFFERED_QOS_INCOMPATIBLE/RMW_EVENT_REQUESTED_QOS_INCOMPATIBLE events which appear during the testcase run; so we have an additional delays in the end of LifecycleManager (or testcase) constructor when making executor and service thread. Before this commit, all events of "incompatible QoS" type were considered as "Unknown", they deallocated and never handled. So for that case we have "flew" the constructor faster and for some reasons managed to register inner timer with custom executor.

This could be easily confirmed by adding sleep_for(1000) milliseconds in the end of the testcase (or LifecycleManager) constructor. With this action even before ros2/rmw_fastrtps@66926c7 the testcase fails (which proves that this commit is not a root cause). Moreover, the problem appears on binary versions of Rolling, as well as on Galactic for FastRTPS. On CycloneDDS for all of these ROS2 distributions testcase passes:

[leha@leha-PC dev_ws]$ RMW_IMPLEMENTATION=rmw_fastrtps_cpp ./build/timer/timer_node 
[INFO] [1652375869.114717023] [timer_node]: Outer timer created
[INFO] [1652375870.116447548] [timer_node]: Hey from outer wall timer
[INFO] [1652375870.116686324] [timer_node]: Inner timer created
^C[INFO] [1652375872.401716100] [rclcpp]: signal_handler(signum=2)
[leha@leha-PC dev_ws]$ RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ./build/timer/timer_node
[INFO] [1652375876.817119600] [timer_node]: Outer timer created
[INFO] [1652375877.818420836] [timer_node]: Hey from outer wall timer
[INFO] [1652375877.818571282] [timer_node]: Inner timer created
[INFO] [1652375877.818786694] [timer_node]: Hey from nested wall timer
^C[INFO] [1652375880.014118853] [rclcpp]: signal_handler(signum=2)

Attaching updated testcase with removed nav2_util dependency and added sleep into the end of the testcase constructor: dev_ws_v2.zip.
Should I submit the ticket to the rmw_fastrtps and ask whether it is normal behavior or not?

@SteveMacenski
Copy link
Member

Yes, that would be good!

@EduPonz
Copy link

EduPonz commented May 13, 2022

If I may chip in, as @AlexeyMerzlyakov pointed out, it can only be that ros2/rmw_fastrtps@66926c7 has surfaced and already existing problem, whether that problem lies in nav, rmw_fastrtps, or elsewhere I do not know. I think it'd be great if you could describe more precisely what is not working according to your expectations, what are those, and probably a minimal reproducer to make understanding and debugging easier.

Thanks guys!

@AlexeyMerzlyakov
Copy link
Collaborator Author

AlexeyMerzlyakov commented May 13, 2022

@EduPonz, I'll try to figure out the information you are requesting:
In the following snippet of code both wall timers are being executed when running with CycloneDDS, while only outer one is executed with FastRTPS:

  TimerNode(std::string timer_node)
  : Node(timer_node)
  {
    callback_group_ = create_callback_group(rclcpp::CallbackGroupType::MutuallyExclusive, false);

    timer1_ = this->create_wall_timer(
      0s,
      [this]() -> void {
        timer1_->cancel();
        RCLCPP_INFO(get_logger(), "Hey from outer wall timer");
        timer2_ = this->create_wall_timer(
          0s,
          [this]() -> void {
            timer2_->cancel();
            RCLCPP_INFO(get_logger(), "Hey from nested wall timer");
          },
          callback_group_);
      });

    executor_ = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();
    executor_->add_callback_group(callback_group_, get_node_base_interface());
    service_thread_ = std::make_unique<std::thread>([&]() {executor_->spin();});

    std::this_thread::sleep_for(std::chrono::milliseconds(1000));
  }

Our expectations are to see both timers: outer and nested are being executed as it being when running with CycloneDDS.

The simple repro-case is attached in a zip-archive at the end of this message. To execute the testcase: please build dev_ws workspace and then run:

$ RMW_IMPLEMENTATION=rmw_fastrtps_cpp ./build/timer/timer_node
or
$ RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ./build/timer/timer_node

@afrixs
Copy link
Contributor

afrixs commented May 13, 2022

@AlexeyMerzlyakov the reason this code does not work is simply that it is not written correctly, i.e. not regarding how executors work. Inside spin() method executor repeatedly enters wait for work state, waiting for any events of timers/subscriptions/... previously assigned to it. So if you assign a timer while the executor is already waiting for work, the executor doesn't stop waiting, because it didn't wait for new timers to be assigned, just for already existing timers to be fired (which are, in the case of executor_, not present, so it technically waits forever).

The executor can be waken up externally using the guard condition, as I first proposed, but seemingly, as you tested it without success, it works only when there is just one executor per node (otherwise maybe only one of the executors gets notified, which might be an issue of rclcpp to be raised).

Another way around is to avoid spinning executors before assigning work to them, which is the solution that finally worked.

I think that the reason why the code worked with CycloneDDS is that it was a bit faster/slower and timer1_ callback got called before the executor_ added handles inside the wait_for_work method

@AlexeyMerzlyakov
Copy link
Collaborator Author

AlexeyMerzlyakov commented May 13, 2022

@afrixs, thank you for the explanation. However, I still have some doubts about it. If the executor should never launch the timers registered after it being created, second timer placed in any place after executor creation shouldn't work, e.g. as in following part of code:

    callback_group_ = create_callback_group(rclcpp::CallbackGroupType::MutuallyExclusive, false);

    timer1_ = this->create_wall_timer(
      0s,
      [this]() -> void {
        timer1_->cancel();
        RCLCPP_INFO(get_logger(), "Hey from outer wall timer");
      });

    executor_ = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();
    executor_->add_callback_group(callback_group_, get_node_base_interface());
    service_thread_ = std::make_unique<std::thread>([&]() {executor_->spin();});

    std::this_thread::sleep_for(std::chrono::milliseconds(1000));

    timer2_ = this->create_wall_timer(
      0s,
      [this]() -> void {
        timer2_->cancel();
        RCLCPP_INFO(get_logger(), "Hey from nested wall timer");
      },
      callback_group_);

However, it works on FastRTPS: it does not work only if timer2_ is placed inside timer1_ lambda, which is strange for me.

I think that the reason why the code worked with CycloneDDS is that it was a bit faster/slower and timer1_ callback got called before the executor_ added handles inside the wait_for_work method

For that I've added 1 second sleep std::this_thread::sleep_for(std::chrono::milliseconds(1000)); in the testcase at the end of the constructor (before executing timer1_ callback and registering new timer2_), to exclude the possibility of timing overlays. This still works on CycloneDDS.

@afrixs
Copy link
Contributor

afrixs commented May 13, 2022

@AlexeyMerzlyakov hmm, that's interesting, this leads me to more exploration, finding that the guard condition actually gets triggered automatically inside the create_wall_timer/add_timer method.

So the difference is that in your last testcase (the one with non-nested timer2_) the main node executor does not exist yet (it's created and assigned after the constructor in main()), so the guard condition works because the node has only one executor assigned at that time. That means that the problem could be formulated as "Node having multiple callback groups in multiple executors is unable to notify all of the executors using its guard condition".

@MiguelCompany
Copy link

@afrixs

Node having multiple callback groups in multiple executors is unable to notify all of the executors using its guard condition

Sounds exactly like the problem ros2/rclcpp#1640 is trying to fix!

@AlexeyMerzlyakov
Copy link
Collaborator Author

@afrixs, may I ask you to submit a PR with fix agreed from this comment? I think you as the author have the original right to this.

@dcconner
Copy link

dcconner commented May 19, 2022

Just noting that I have this same problem in Galactic (binaries) when I switched to rmw_fastrtps_cpp to test another issue

robotechvision pushed a commit to robotechvision/navigation2 that referenced this issue May 20, 2022
SteveMacenski pushed a commit that referenced this issue May 20, 2022
Co-authored-by: Matej Vargovcik <vargovcik@robotechvision.com>
redvinaa pushed a commit to EnjoyRobotics/navigation2 that referenced this issue Jun 30, 2022
…#2960)

Co-authored-by: Matej Vargovcik <vargovcik@robotechvision.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants