Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible race condition in controller manager on humble #979

Open
ejalaa12 opened this issue Mar 30, 2023 · 12 comments
Open

Possible race condition in controller manager on humble #979

ejalaa12 opened this issue Mar 30, 2023 · 12 comments
Labels

Comments

@ejalaa12
Copy link
Contributor

Describe the bug
There is a race condition on the call to rclcpp_lifecycle::State::get_state() in the controller manager.

The get_state() method is not thread-safe as explained in this issue. This was fixed in rolling, but not on humble.
So any call to the get_state() method can potentially lead to a std::runtime_error raising the following error:

Error in state! Internal state_handle is NULL.

To Reproduce
It is hard to reproduce, because it happens randomly.
One of the controllers that we implemented used the same get_state() method which triggered the error and crashed the controller_manager.

But basically any call to get_state() can lead to that issue.
There are three "paths" of calls that can lead to this method being called:

  • any service call, but since they are mutually exclusive thanks to the service mutex, we can consider it as one path
  • diagnostic updater
  • update method

bug-ros2-control-thread

We have crafted a node that is a copy of the ros2_control_node, where we purposely call the list_controllers_service_cb in a separate thread, many times to trigger the crash.

Here is a git diff of the changes that you can apply to this repo to reproduce the crash.
reproductible_bug.diff.txt

note that I had to add the .txt extension for github to accept uploading the file

Expected behavior
No crash. We might need to protect the call to get_state() with yet another mutex, but the best solution would be for rclcpp_lifecycle to backport their fix to humble.

Environment (please complete the following information):

  • OS: ubuntu 22.04
  • Version humble
@ejalaa12 ejalaa12 added the bug label Mar 30, 2023
@ejalaa12 ejalaa12 changed the title Possible race condition in controller manager in humble Possible race condition in controller manager on humble Mar 30, 2023
@DatSpace
Copy link

DatSpace commented Mar 30, 2023

I can verify I am facing the same issue on Humble 22.04 with the latest packages of this month.

No custom controller, just using joint_trajectory_controller/JointTrajectoryController!

Bellow is the stacktrace after it crashes. It happens very often (almost 50%) of the time but randomly.

[ros2_control_node-7] terminate called after throwing an instance of 'std::runtime_error'
[ros2_control_node-7]   what():  Error in state! Internal state_handle is NULL.
[ros2_control_node-7] Stack trace (most recent call last) in thread 167431:
[ros2_control_node-7] #16   Object "", at 0xffffffffffffffff, in 
[ros2_control_node-7] #15   Source "../sysdeps/unix/sysv/linux/x86_64/clone.S", line 100, in __clone [0x7f6b1284dbb3]
[ros2_control_node-7] #14   Source "./nptl/pthread_create.c", line 442, in start_thread [0x7f6b127bcb42]
[ros2_control_node-7] #13   Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f6b12a4c2b2, in 
[ros2_control_node-7] #12   Object "/opt/ros/humble/lib/librclcpp.so", at 0x7f6b12c89909, in rclcpp::executors::MultiThreadedExecutor::run(unsigned long)
[ros2_control_node-7] #11   Object "/opt/ros/humble/lib/librclcpp.so", at 0x7f6b12c82650, in rclcpp::Executor::execute_any_executable(rclcpp::AnyExecutable&)
[ros2_control_node-7] #10   Object "/opt/ros/humble/lib/libcontroller_manager.so", at 0x7f6b12e56c94, in 
[ros2_control_node-7] #9    Object "/opt/ros/humble/lib/libcontroller_manager.so", at 0x7f6b12e0f22c, in diagnostic_updater::Updater::update()
[ros2_control_node-7] #8    Object "/opt/ros/humble/lib/libcontroller_manager.so", at 0x7f6b12e23748, in controller_manager::ControllerManager::controller_activity_diagnostic_callback(diagnostic_updater::DiagnosticStatusWrapper&)
[ros2_control_node-7] #7    Object "/opt/ros/humble/lib/librclcpp_lifecycle.so", at 0x7f6b1257f981, in 
[ros2_control_node-7] #6    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f6b12a1e517, in __cxa_throw
[ros2_control_node-7] #5    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f6b12a1e2b6, in std::terminate()
[ros2_control_node-7] #4    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f6b12a1e24b, in 
[ros2_control_node-7] #3    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f6b12a12bbd, in 
[ros2_control_node-7] #2    Source "./stdlib/abort.c", line 79, in abort [0x7f6b127507f2]
[ros2_control_node-7] #1    Source "../sysdeps/posix/raise.c", line 26, in raise [0x7f6b1276a475]
[ros2_control_node-7] #0  | Source "./nptl/pthread_kill.c", line 89, in __pthread_kill_internal
[ros2_control_node-7]     | Source "./nptl/pthread_kill.c", line 78, in __pthread_kill_implementation
[ros2_control_node-7]       Source "./nptl/pthread_kill.c", line 44, in __pthread_kill [0x7f6b127bea7c]
[ros2_control_node-7] Aborted (Signal sent by tkill() 167250 1000)
[ERROR] [ros2_control_node-7]: process has died [pid 167250, exit code -6, cmd '/opt/ros/humble/lib/controller_manager/ros2_control_node --ros-ar...

@roncapat
Copy link

Unfortunately I asked in ros2/rclcpp#1756 and there won't be a backport, seems it is an ABI-breaking fix.

@ejalaa12
Copy link
Contributor Author

@DatSpace how did you reproduce this ?
@roncapat I didn't notice that you asked for the backport, that's a shame. It needs users need to protect the access to this method, which is nor ideal, or safe.

@pepeRossRobotics
Copy link

I just encountered the same issue when soak testing my robot (running the robot non stop for days)
this is the output of the terminal.

[ros2_control_node-1] terminate called after throwing an instance of 'std::runtime_error'
[ros2_control_node-1]   what():  Error in state! Internal state_handle is NULL.
[ros2_control_node-1] Stack trace (most recent call last) in thread 44154:
[ros2_control_node-1] #14   Object "/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1", at 0xffffffffffffffff, in 
[ros2_control_node-1] #13   Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0x7f96385d1b, in 
[ros2_control_node-1] #12   Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0x7f9631d5c7, in 
[ros2_control_node-1] #11   Object "/usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.30", at 0x7f965531fb, in 
[ros2_control_node-1] #10   Object "/opt/ros/humble/lib/controller_manager/ros2_control_node", at 0x557944384b, in 
[ros2_control_node-1] #9    Object "/opt/ros/humble/lib/libcontroller_manager.so", at 0x7f9695e3af, in controller_manager::ControllerManager::update(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-1] #8    Object "/opt/ros/humble/lib/libdiff_drive_controller.so", at 0x7f904cb64f, in diff_drive_controller::DiffDriveController::update(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-1] #7    Object "/opt/ros/humble/lib/librclcpp_lifecycle.so", at 0x7f95ffc5bb, in rclcpp_lifecycle::State::id() const
[ros2_control_node-1] #6    Object "/usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.30", at 0x7f96522d23, in __cxa_throw
[ros2_control_node-1] #5    Object "/usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.30", at 0x7f96522a3f, in std::terminate()
[ros2_control_node-1] #4    Object "/usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.30", at 0x7f965229db, in 
[ros2_control_node-1] #3    Object "/usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.30", at 0x7f965251fb, in __gnu_cxx::__verbose_terminate_handler()
[ros2_control_node-1] #2    Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0x7f962c712f, in abort
[ros2_control_node-1] #1    Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0x7f962da67b, in raise
[ros2_control_node-1] #0    Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0x7f9631f200, in 
[ros2_control_node-1] Aborted (Signal sent by tkill() 43824 0)

Is there any way to fix this? Using Humble 22.04

@pepeRossRobotics
Copy link

Quick follow up question. In my case respawining the ros2_control_node when it crashes would fix the issue, but because I am using the controller_manager spawner to do the join_state_broadcaster and the base_controller of my robot, which they "die" after loading the controller, respawning the ros2_control_node does not fully does it. Is there a way to write the launch file in python so that if the ros2_control_node dies if relaunches along with the two controller_manager spawners?
My current relevant section of my launch files is the following:

    control_node = Node(
        package="controller_manager",
        executable="ros2_control_node",
        parameters=[robot_description, robot_controllers],
        remappings=[("/diagnostics", DIAGNOSTICS_CONTROL_TOPIC)],
        output={
            "stdout": "screen",
            "stderr": "screen",
        },
        respawn = True,
        respawn_delay = 2
    )

    joint_state_broadcaster_spawner = Node(
        package="controller_manager",
        executable="spawner",
        arguments=["joint_state_broadcaster",
                   "-c",
                   "/controller_manager",
                   "--controller-manager-timeout",
                   "10"]
    )

    robot_controller_spawner = Node(
        package="controller_manager",
        executable="spawner",
        arguments=["base_controller",
                   "-c",
                   "/controller_manager",
                   "--controller-manager-timeout",
                   "10"]
    )

@ejalaa12
Copy link
Contributor Author

@pepeRossRobotics The only way we found to prevent the crash from happening is forking ros2_control and surround the place where the problematic method is called with a while/try/catch but that is not a good way to do it.

Concerning the launch file, I believe you can use event handlers to start the "spawner" when the controller_manager process starts. I haven't tried it myself, but I think it should work.

@pepeRossRobotics
Copy link

Many thanks for the help, I assumed something like this was possible, but didn't know how to implement them. I will give them a go now.

@destogl
Copy link
Member

destogl commented Apr 20, 2023

Concerning the launch file, I believe you can use event handlers to start the "spawner" when the controller_manager process starts. I haven't tried it myself, but I think it should work.

There is some of those in our example code. Maybe you find this useful: https://github.com/StoglRobotics/ros_team_workspace/blob/master/templates/ros2_control/robot_ros2_control.launch.py#L186

@pepeRossRobotics
Copy link

Concerning the launch file, I believe you can use event handlers to start the "spawner" when the controller_manager process starts. I haven't tried it myself, but I think it should work.

There is some of those in our example code. Maybe you find this useful: https://github.com/StoglRobotics/ros_team_workspace/blob/master/templates/ros2_control/robot_ros2_control.launch.py#L186

this is really helpful, many thanks!

@schornakj
Copy link
Contributor

Just wanted to add another stack trace to the pile. This is also on Humble on Ubuntu 22.04 LTS.

[ros2_control_node-1] terminate called after throwing an instance of 'std::runtime_error'
[ros2_control_node-1]   what():  Error in state! Internal state_handle is NULL.
[ros2_control_node-1] Stack trace (most recent call last) in thread 178:
[ros2_control_node-1] #13   Object "", at 0xffffffffffffffff, in 
[ros2_control_node-1] #12   Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fead00ce9ff, in 
[ros2_control_node-1] #11   Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fead003cb42, in 
[ros2_control_node-1] #10   Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fead02cc2b2, in 
[ros2_control_node-1] #9    Object "/opt/ros/humble/lib/controller_manager/ros2_control_node", at 0x564dbe0793db, in 
[ros2_control_node-1] #8    Object "/opt/ros/humble/lib/libcontroller_manager.so", at 0x7fead069f61c, in controller_manager::ControllerManager::update(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-1] #7    Object "/opt/ros/humble/lib/librclcpp_lifecycle.so", at 0x7feacfe3d981, in 
[ros2_control_node-1] #6    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fead029e517, in __cxa_throw
[ros2_control_node-1] #5    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fead029e2b6, in std::terminate()
[ros2_control_node-1] #4    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fead029e24b, in 
[ros2_control_node-1] #3    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fead0292bbd, in 
[ros2_control_node-1] #2    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7feacffd07f2, in abort
[ros2_control_node-1] #1    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7feacffea475, in raise
[ros2_control_node-1] #0    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fead003ea7c, in pthread_kill
[ros2_control_node-1] Aborted (Signal sent by tkill() 104 0)

@schornakj
Copy link
Contributor

Two updates on this:

I think in the near-term #922 should be reverted. While diagnostic info is nice to have, it's not acceptable for the robot control nodes to segfault like this.

@ejalaa12
Copy link
Contributor Author

@schornakj Thanks for the work and the PR to rclcpp.
Yes we also noticed that the issue was introduced when the diagnostics were added but we weren't sure because that it was the only source of errors.
As you can see in the diagram I posted in the first message, there are three call hierarchy that leads to calling get_state():

  • service callbacks (which are mutually exclusive)
  • update method
  • diagnostic timer

We guessed that there is some mechanism that prevents the get_state() method being called from both the update method and service callbacks, but we didn't look to much into it.
But, yeah, once you add the diagnostic timer, there is definitely a race with the two other threads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants