Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Install signal handlers in recorder only inside record method #1464

Conversation

MichaelOrlov
Copy link
Contributor

@MichaelOrlov MichaelOrlov commented Sep 9, 2023

Notes for reviewers:

  • It turns out that a possible situation when Recorder instance is instantiated and needs to spin the node for some time before calling Recorder::record(..) method for instance handling some service events before initiating recording. During those times signal handlers shall not be overridden.
  • I removed rclcpp initialization with rclcpp::SignalHandlerOptions::None and shutdown_on_signal options since it doesn't have any effect because we are replacing signals with our own signal handlers even if they we installed before or if context has already been initialized before.

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>
@MichaelOrlov MichaelOrlov force-pushed the morlov/dont_override_signal_handler_in_recorder_constructor branch from ff476d4 to d950c67 Compare September 9, 2023 02:28
@MichaelOrlov MichaelOrlov marked this pull request as ready for review September 9, 2023 04:20
@MichaelOrlov MichaelOrlov requested a review from a team as a code owner September 9, 2023 04:20
@MichaelOrlov MichaelOrlov requested review from gbiggs, james-rms and emersonknapp and removed request for a team and gbiggs September 9, 2023 04:20
@MichaelOrlov
Copy link
Contributor Author

@tonynajjar Please review.

@tonynajjar
Copy link

I cherrypicked and tested it on iron and it works as intended 👍 I'd appreciate a backport of this to iron when merged 🙏

@tonynajjar
Copy link

I cherrypicked and tested it on iron and it works as intended 👍 I'd appreciate a backport of this to iron when merged 🙏

Correction, now exiting works if we are not recording but if CTRL-C is pressed while recording then the node still doesn't exit

@MichaelOrlov
Copy link
Contributor Author

@tonynajjar Can you please double-check or clarify how are you reproducing failure "if CTRL-C is pressed while recording then the node still doesn't exit"?
We do have integration tests for scenarios when we call rosbag2 record command from one process and then calling SIGINT and SIGTERM in another process and they are passing green. I've double-checked them locally.
For reference, we have record_end_to_end_test_start_paused

TEST_P(RecordFixture, record_end_to_end_test_start_paused) {
auto message = get_messages_strings()[0];
message->string_value = "test";
rosbag2_test_common::PublicationManager pub_manager;
pub_manager.setup_publisher("/test_topic", message, 10);
auto process_handle = start_execution(
get_base_record_command() + " --max-cache-size 0 --start-paused /test_topic");
auto cleanup_process_handle = rcpputils::make_scope_exit(
[process_handle]() {
stop_execution(process_handle);
});
ASSERT_TRUE(pub_manager.wait_for_matched("/test_topic")) <<
"Expected find rosbag subscription";
wait_for_storage_file();
pub_manager.run_publishers();
stop_execution(process_handle);
cleanup_process_handle.cancel();
finalize_metadata_kludge();
wait_for_metadata();
auto test_topic_messages = get_messages_for_topic<test_msgs::msg::Strings>("/test_topic");
EXPECT_THAT(test_topic_messages, IsEmpty());
}

and record_end_to_end_exits_gracefully_on_sigterm
TEST_P(RecordFixture, record_end_to_end_exits_gracefully_on_sigterm) {
const std::string topic_name = "/test_sigterm";
auto message = get_messages_strings()[0];
message->string_value = "test";
rosbag2_test_common::PublicationManager pub_manager;
pub_manager.setup_publisher(topic_name, message, 10);
auto process_handle = start_execution(
get_base_record_command() + " " + topic_name);
ASSERT_TRUE(pub_manager.wait_for_matched(topic_name.c_str())) <<
"Expected find rosbag subscription";
wait_for_storage_file();
pub_manager.run_publishers();
stop_execution(process_handle, SIGTERM);
wait_for_metadata();
}

@MichaelOrlov
Copy link
Contributor Author

@tonynajjar Friendly ping on the question in this pull request ^^^

@tonynajjar
Copy link

tonynajjar commented Sep 27, 2023

Here is an example code purely for reproduction purposes:

from rosbag2_py import Recorder, RecordOptions, StorageOptions
import time

import rclpy
from rclpy.node import Node

class RosBagRecorder(Node):
    def __init__(self):
        super().__init__("rosbag_recorder")
        storage_options = StorageOptions(
            uri=f"/tmp/{time.strftime('%Y-%m-%d-%H-%M-%S')}.mcap",
            storage_id="mcap",
        )
        record_options = RecordOptions()
        record_options.all = True
        recorder = Recorder()
        print("created recorder")
        time.sleep(3)
        recorder.record(storage_options, record_options)
    
def main():
    rclpy.init()
    node = RosBagRecorder()
    rclpy.spin(node)


if __name__ == "__main__":
    main()

I you run this node as part of a launch file and do CTRL-C while it's recording then we get:

[INFO] [launch]: All log files can be found below /data/logs/ros/2023-09-27-14-09-51-029032-logi-XMG-CORE-REN-M20-2078587
[INFO] [launch]: Default logging verbosity is set to INFO
[INFO] [record-1]: process started with pid [2078591]
[INFO] [1695816591.226398854] [launch_ros_2078587]: process record-1 started with pid [2078591]
[record-1] created recorder
[record-1] stdin is not a terminal device. Keyboard handling disabled.[INFO] [1695816594.413853745] [rosbag2_recorder]: Press SPACE for pausing/resuming
[record-1] [INFO] [1695816594.415423513] [rosbag2_recorder]: Listening for topics...
[record-1] [INFO] [1695816594.415444336] [rosbag2_recorder]: Event publisher thread: Starting
[record-1] [INFO] [1695816594.417384573] [rosbag2_recorder]: Subscribed to topic '/rosout'
[record-1] [INFO] [1695816594.418710217] [rosbag2_recorder]: Subscribed to topic '/parameter_events'
[record-1] [INFO] [1695816594.419183524] [rosbag2_recorder]: Subscribed to topic '/events/write_split'
[record-1] [INFO] [1695816594.419223394] [rosbag2_recorder]: Recording...
^C[WARNING] [launch]: user interrupted with ctrl-c (SIGINT)
[record-1] [INFO] [1695816595.955508147] [rosbag2_recorder]: Event publisher thread: Exiting
[record-1] [INFO] [1695816595.955803867] [rosbag2_recorder]: Recording stopped
[record-1] [INFO] [1695816595.957987791] [rosbag2_recorder]: Recording stopped
[ERROR] [record-1]: process[record-1] failed to terminate '5' seconds after receiving 'SIGINT', escalating to 'SIGTERM'
[INFO] [record-1]: sending signal 'SIGTERM' to process[record-1]

It shows that when CTRL-C (^C), was pressed, the recording indeed stopped but not the node. The launch had to resort to sending SIGTERM after 5 seconds to shutdown the node properly which is not ideal

@MichaelOrlov
Copy link
Contributor Author

@tonynajjar Thanks for providing a reproducer and sorry for the delay.
I will try to reproduce and debug on the earliest of the next week.

@MichaelOrlov
Copy link
Contributor Author

@tonynajjar The problem is not in the rosbag2 recorder but rather in the way how you are trying to use it.
You are trying to create a node for the rosbag2 recorded via inheritance in the rosBagRecorder class and then spin it in the main function via rclpy.spin(node).

But this is a wrong approach.
Because we already creating node and spinning it inside recorder.record(storage_options, record_options)
See

auto recorder = std::make_shared<rosbag2_transport::Recorder>(
std::move(writer), storage_options, record_options, node_name);
recorder->record();
exec->add_node(recorder);
// Run exec->spin() in a separate thread, because we need to call exec->cancel() after
// recorder->stop() to be able to send notifications about bag split and close.
auto spin_thread = std::thread(
[&exec]() {
exec->spin();
});
{
// Release the GIL for long-running record, so that calling Python code
// can use other threads
py::gil_scoped_release release;
std::unique_lock<std::mutex> lock(wait_for_exit_mutex_);
wait_for_exit_cv_.wait(lock, [] {return rosbag2_py::Recorder::exit_.load();});
recorder->stop();
}
exec->cancel();
if (spin_thread.joinable()) {
spin_thread.join();
}
exec->remove_node(recorder);

What happened in your reproducer is the following:

  1. The one node will be created in the rosBagRecorder via inheritance on the upper python layer.
  2. Code will run until recorder.record(storage_options, record_options) since this is a blocking call.
  3. We will create one more node inside rosbag2 recorder and spin it in the aforementioned rosbag2_py wrapper
  4. We will gracefully handle SIGINT inside rosbag2_py layer and exit from the recorder.record(storage_options, record_options)
  5. Then finally the code will reach out to the rclpy.spin(node) line and will hang out there since SIGINT has already been processed and rosbag2 recorder successfully finished its work.

Please note that the composable rosbag2 recorder and player are not supported yet and this feature is still WIP Make Player and Recorder Composable (#902) #1419

@MichaelOrlov
Copy link
Contributor Author

@clalancette May I have approval for this PR?
The author is not responding for a reasonable time and I truly believe that this PR will fix potential issues with signal handlers.
Since now we are overriding signal handlers only during our rosbag2::Recorder::record() blocking call. Before we were starting overriding signal handlers in the constructor of the rosbag2::Recorder() class.
The changes are pretty trivial.

@tonynajjar
Copy link

tonynajjar commented Nov 10, 2023

But this is a wrong approach.

Hello @MichaelOrlov, can you give me a minimal example of a correct approach? Basically I need a node with a service server whose callback starts recording a bag using the Recorder. I need to spin the external node so that the service server works, how else would you do it?

@MichaelOrlov
Copy link
Contributor Author

But this is a wrong approach.

Hello @MichaelOrlov, can you give me a minimal example of a correct approach? Basically I need a node with a service server whose callback starts recording a bag using the Recorder. I need to spin the external node so that the service server works, how else would you do it?

See

  1. Then finally the code will reach out to the rclpy.spin(node) line and will hang out there since SIGINT has already been processed and rosbag2 recorder successfully finished its work.

Just remove those rclpy.spin(node) from your main function.

@tonynajjar
Copy link

But then the service server callback would not be called (not shown in code)

@MichaelOrlov
Copy link
Contributor Author

But then the service server callback would not be called (not shown in code)

@tonynajjar I can't comment on what is not included in the minimum valuable reproducer.
I only can guess what you are trying to achieve.
I am guessing that for your use case, you will need to have a main node with your services and instance of the rosbag2 Recorder and spin this node to handle service's callbacks in a separate thread. Although recorder.record() shall be called after processing service callback. Calling it from the callback directly is not a good idea since as I've said before the recorder.record() is a blocking call and has its own node and executor which is spinning it inside.
You will need to orchestrate your code in such a way that you will trigger some flag from the callback to start recording and wait for those flag in another worker thread which will actually be calling the recorder.record() method. Otherwise if you will call record() from the callback - it will block the executor thread and you will not be able to handle any events during recording. Writing this on Python with multi-threading support and synchronization primitives is going to be a non-trivial task.
I would recommend using C++.

With the provided minimum valuable reproducer removing rclpy.spin(node) from the main function will make it work properly without hngout.

@MichaelOrlov
Copy link
Contributor Author

@clalancette @fujitatomoya I would appreciate a review and approval of this PR.
IMO this PR does what is described in the description and addressing the problem described in the #1458 in the context of the provided reproduction steps.

@MichaelOrlov
Copy link
Contributor Author

@clalancette @fujitatomoya Friendly ping for review.

@fujitatomoya
Copy link
Contributor

@tonynajjar just a comment. rosbag2_py::Recorder is just a python wrapper class to be called by python.(i am not sure original intention for rosbag2_py is for rclpy or just a python...) all implementation is rclcpp underneath, not rclpy. that means i do not think we can manage these python classes with rclpy interfaces... i think current rosbag2_py is not rclpy friendly at all, but for doing that we need surgery to add more rclpy based interfaces under rosbag2_py.

IMO, i agree that it makes sense to manage rosbag2_py::Recorder as inherited from rclpy::Node, so that we can add that Recorder in our own executor in rclpy. i am also guessing that is what you wanted to do originally?

Copy link
Contributor

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as described in #1464 (comment), there would be more user-friendly enhancement for rosbag2_py::Recorder, but that can be categorized as an another topic. lgtm for this fix.

@MichaelOrlov
Copy link
Contributor Author

Gist: https://gist.githubusercontent.com/MichaelOrlov/c860b7642218ea1655d1b603c91a9ff7/raw/42757621600bdaf3d7e3c479b7c313022320fc3a/ros2.repos
BUILD args: --packages-above-and-dependencies rosbag2_py
TEST args: --packages-above rosbag2_py
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/13090

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

@MichaelOrlov
Copy link
Contributor Author

@clalancette @emersonknapp Could you please formally approve this PR to be able to merge it?

@MichaelOrlov MichaelOrlov merged commit 195e406 into rolling Jan 4, 2024
14 checks passed
@delete-merged-branch delete-merged-branch bot deleted the morlov/dont_override_signal_handler_in_recorder_constructor branch January 4, 2024 07:40
@MichaelOrlov
Copy link
Contributor Author

https://github.com/Mergifyio backport iron humble

Copy link

mergify bot commented Jan 4, 2024

backport iron humble

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Jan 4, 2024
Signed-off-by: Michael Orlov <michael.orlov@apex.ai>
(cherry picked from commit 195e406)

# Conflicts:
#	rosbag2_py/src/rosbag2_py/_transport.cpp
mergify bot pushed a commit that referenced this pull request Jan 4, 2024
Signed-off-by: Michael Orlov <michael.orlov@apex.ai>
(cherry picked from commit 195e406)
MichaelOrlov added a commit that referenced this pull request Jan 4, 2024
…#1525)

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>
(cherry picked from commit 195e406)

Co-authored-by: Michael Orlov <michael.orlov@apex.ai>
MichaelOrlov added a commit that referenced this pull request Jan 9, 2024
…d (backport #1464) (#1526)

* Install signal handlers in recorder only inside record method (#1464)

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>
(cherry picked from commit 195e406)

# Conflicts:
#	rosbag2_py/src/rosbag2_py/_transport.cpp

* Address merge conflicts

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>

---------

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>
Co-authored-by: Michael Orlov <michael.orlov@apex.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

python node with Recorder does not exit with CTRL-C
4 participants