Skip to content

Race condition when shutting down contexts (or ctrl+c) #812

@ivanpauno

Description

@ivanpauno

Bug report

Required Info:

  • Operating System:
    • Ubuntu 18.04 (it should be reproducible in others)
  • Installation type:
    • from source
  • Version or commit hash:
  • DDS implementation:
    • fastrtps (it should happen in the others)
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

Modify demo_nodes_cpp talker publish rate:

-    timer_ = this->create_wall_timer(1s, publish_message);
+    timer_ = this->create_wall_timer(10ms, publish_message);

Use the following script:

#!/bin/bash 

set -e
COUNTER=0
while [  $COUNTER -lt 1000 ]; do
    echo "Retrying. Counter: $COUNTER"
    exec -a "TALKER${COUNTER}" <demo_nodes_cpp PREFIX>/lib/demo_nodes_cpp/talker > /dev/null &
    sleep 0.1
    pkill --signal 2 -f "TALKER${COUNTER}"
    let COUNTER=COUNTER+1 
done

Expected behavior

Script should finish without showing any error message.

Actual behavior

I got once this:

Retrying. Counter: 940
Retrying. Counter: 941
Failed to publish log message to rosout: publisher's context is invalid, at /home/ivanpauno/ros2_ws/src/ros2/rcl/rcl/src/rcl/publisher.c:343
Retrying. Counter: 942
Retrying. Counter: 943
Retrying. Counter: 944

The script continued, so the return code was 0.
The script does one thousands try, and I got just once of this errors.

Additional information

rclcpp shuts down context when handling SIGINT:

void
SignalHandler::deferred_signal_handler()
{
while (true) {
if (signal_received_.exchange(false)) {
RCLCPP_DEBUG(get_logger(), "deferred_signal_handler(): SIGINT received, shutting down");
for (auto context_ptr : rclcpp::get_contexts()) {
if (context_ptr->get_init_options().shutdown_on_sigint) {
RCLCPP_DEBUG(
get_logger(),
"deferred_signal_handler(): "
"shutting down rclcpp::Context @ %p, because it had shutdown_on_sigint == true",
context_ptr.get());
context_ptr->shutdown("signal handler");
}
}
}
if (!is_installed()) {
RCLCPP_DEBUG(get_logger(), "deferred_signal_handler(): signal handling uninstalled");
break;
}
RCLCPP_DEBUG(get_logger(), "deferred_signal_handler(): waiting for SIGINT or uninstall");
wait_for_signal();
RCLCPP_DEBUG(get_logger(), "deferred_signal_handler(): woken up due to SIGINT or uninstall");
}
}

There is no synchronization between this shut down and the executors (which are still running).
It may happen that something that is being executed needs a context that has been shut down, so an error is possible.
Depending on how an invalid context is handled in each case, it could be just an error message or something else (e.g.: throwing an exception).

The error is unlikely as most of the time the executor is waiting for work, and after waiting we check if the context is still valid. But without any synchronization between shutdown and the executors, I think that a variety of different errors are possible.

Metadata

Metadata

Assignees

No one assigned

    Labels

    backlogbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions