fix: prevent SIGSEGV on shutdown in plugins and demo nodes#361
Merged
Conversation
…on Humble Add shutdown_requested_ atomic flag, explicit destructors calling shutdown(), and early-bail guards in callbacks for ParameterBeaconPlugin, TopicBeaconPlugin, GraphProviderPlugin, and SovdServiceInterface. On Humble, timer/subscription callbacks can fire after cancel() returns because cancel is non-blocking and does not drain the executor ready list. Without these guards, callbacks dereference partially-destroyed plugin state during gateway shutdown, causing intermittent SIGSEGV (-11). Closes #359
Move initial_check_timer self-cancel inside callback_mutex to prevent race with destructor resetting the timer. Cancel old scan_timer before recreation in parameter change callback. The initial_check_timer callback previously called check_and_report_faults() (which acquires/releases callback_mutex_) then accessed initial_check_timer_ without the mutex. The destructor could reset the timer between these two operations, causing a null pointer dereference.
Cancel old timer before recreation in engine_temp_sensor parameter callback. Cancel discovery_retry_timer_ in RosbagCapture::stop() for consistency with post_fault_timer_. Add destructor to LongCalibrationAction calling prepare_shutdown(). Add defensive executor->cancel() in test TearDown.
Contributor
There was a problem hiding this comment.
Pull request overview
Hardens shutdown/teardown behavior across gateway plugins and demo nodes to prevent timing-dependent use-after-free crashes (SIGSEGV), especially during CI/integration-test teardown on Humble where cancel/reset does not necessarily drain queued callbacks.
Changes:
- Add
shutdown_requested_guards + explicit destructors/shutdown idempotency in multiple plugins to prevent late callbacks from touching partially torn-down state. - Fix demo node timer/shutdown races by adding mutex-guarded checks and explicit timer cancellation before recreation.
- Add additional teardown safety in a gateway SSE handler unit test and in fault-manager rosbag capture timer cleanup.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ros2_medkit_plugins/ros2_medkit_sovd_service_interface/src/sovd_service_interface.hpp | Adds atomic shutdown flag and destructor declaration for service-interface plugin. |
| src/ros2_medkit_plugins/ros2_medkit_sovd_service_interface/src/sovd_service_interface.cpp | Makes shutdown idempotent and guards service callbacks during shutdown. |
| src/ros2_medkit_plugins/ros2_medkit_graph_provider/include/ros2_medkit_graph_provider/graph_provider_plugin.hpp | Adds shutdown flag and explicit destructor/shutdown override. |
| src/ros2_medkit_plugins/ros2_medkit_graph_provider/src/graph_provider_plugin.cpp | Resets diagnostics subscription on shutdown and guards diagnostics callback. |
| src/ros2_medkit_discovery_plugins/ros2_medkit_topic_beacon/include/ros2_medkit_topic_beacon/topic_beacon_plugin.hpp | Adds shutdown flag and destructor declaration. |
| src/ros2_medkit_discovery_plugins/ros2_medkit_topic_beacon/src/topic_beacon_plugin.cpp | Makes shutdown idempotent and guards beacon callback during shutdown. |
| src/ros2_medkit_discovery_plugins/ros2_medkit_param_beacon/include/ros2_medkit_param_beacon/param_beacon_plugin.hpp | Adds explicit destructor override. |
| src/ros2_medkit_discovery_plugins/ros2_medkit_param_beacon/src/param_beacon_plugin.cpp | Makes shutdown safer (thread join + guarded polling) and clears client caches under lock. |
| src/ros2_medkit_integration_tests/demo_nodes/lidar_sensor.cpp | Fixes initial timer self-cancel race and cancels scan timer before recreating it. |
| src/ros2_medkit_integration_tests/demo_nodes/engine_temp_sensor.cpp | Cancels publish timer before recreating it after parameter updates. |
| src/ros2_medkit_integration_tests/demo_nodes/long_calibration_action.cpp | Ensures shutdown preparation runs from destructor as well. |
| src/ros2_medkit_gateway/test/test_sse_fault_handler.cpp | Cancels executor in TearDown to avoid teardown-time races. |
| src/ros2_medkit_fault_manager/src/rosbag_capture.cpp | Cancels/resets discovery retry timer during stop() to avoid late callbacks during teardown. |
Comments suppressed due to low confidence (1)
src/ros2_medkit_discovery_plugins/ros2_medkit_param_beacon/src/param_beacon_plugin.cpp:147
shutdown()uses an atomic + mutex, but the initial guard is a plainload(). Usingshutdown_requested_.exchange(true)(like the other plugins in this PR) would make the shutdown path strictly single-entry and avoid duplicated notify/join work if shutdown is ever called concurrently/re-entrantly.
void ParameterBeaconPlugin::shutdown() {
if (shutdown_requested_.load()) {
return;
}
{
std::lock_guard<std::mutex> lock(shutdown_mutex_);
shutdown_requested_ = true;
}
src/ros2_medkit_plugins/ros2_medkit_sovd_service_interface/src/sovd_service_interface.hpp
Outdated
Show resolved
Hide resolved
...ry_plugins/ros2_medkit_topic_beacon/include/ros2_medkit_topic_beacon/topic_beacon_plugin.hpp
Outdated
Show resolved
Hide resolved
src/ros2_medkit_plugins/ros2_medkit_sovd_service_interface/src/sovd_service_interface.cpp
Show resolved
Hide resolved
src/ros2_medkit_discovery_plugins/ros2_medkit_topic_beacon/src/topic_beacon_plugin.cpp
Show resolved
Hide resolved
src/ros2_medkit_plugins/ros2_medkit_graph_provider/src/graph_provider_plugin.cpp
Show resolved
Hide resolved
src/ros2_medkit_discovery_plugins/ros2_medkit_param_beacon/src/param_beacon_plugin.cpp
Show resolved
Hide resolved
Include integration tests in sanitizer CI jobs to catch use-after-free and data races in the full gateway + HTTP + subscription path. This is a diagnostic change to investigate the Rolling SIGSEGV in test_data_read. Changes: - ASAN job: ctest filter from -LE 'linter|integration' to -LE 'linter' - TSAN job: same filter change - Both: timeout increased to 30min, integration test timeout 120s -> 360s
…pshotCapture Add node_ops_mutex_ to serialize create_callback_group() and create_generic_subscription() calls across concurrent capture threads. TSAN found a data race: when multiple faults are reported simultaneously, each spawns a capture thread that calls capture_topic_on_demand(), which calls node_->create_generic_subscription(). rclcpp node internals (rcutils_hash_map) are not thread-safe for concurrent entity creation, causing concurrent read/write on the subscription tracking hash map. This race can corrupt internal rclcpp state and cause SIGSEGV on distros with different DDS timing (observed on Rolling CI).
Enable bugprone-dangling-handle, bugprone-exception-escape, bugprone-use-after-move, and cppcoreguidelines-special-member-functions in .clang-tidy config. These checks catch object lifecycle and thread safety issues at compile time. Fix all 8 special-member-functions warnings by adding explicit deleted copy/move operators to classes with non-default destructors (GatewayNode, FaultManagerNode, PluginManager, TriggerManager, UpdateManager) and defaulted destructors to classes with explicit copy/move (FutureAndRequestId, RouteDescriptions, ScriptManager).
Add override specifier to TopicBeaconPlugin and SovdServiceInterface destructors for consistency with ParameterBeaconPlugin and GraphProviderPlugin, and to catch base class changes at compile time.
Verify that plugin callbacks are no-ops after shutdown(): - TopicBeaconPlugin: message after shutdown does not mutate store - ParamBeaconPlugin: poll cycle after shutdown does not grow store - GraphProviderPlugin: diagnostics callback after shutdown is safe - SovdServiceInterface: services unavailable after shutdown
…ationale comments Replace load()+store() TOCTOU pattern with exchange(true) for consistency with TopicBeaconPlugin, GraphProviderPlugin, and SovdServiceInterface. Restore rationale comments on deleted copy/move in PluginManager (owns dlopen handles), TriggerManager (owns trigger state), UpdateManager (owns async tasks) that were removed during clang-tidy fixes.
mfaferek93
approved these changes
Apr 7, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
shutdown_requested_atomic flag, explicit destructors, and callback guards to 4 gateway plugins (ParameterBeacon, TopicBeacon, GraphProvider, SovdServiceInterface)initial_check_timer_self-cancel race and timer recreation race inlidar_sensordemo nodecreate_generic_subscription()inSnapshotCapture(TSAN-confirmed data race)bugprone-*andcppcoreguidelines-special-member-functionsclang-tidy checksIssue
Type
Testing
Root causes found:
Shutdown race (issue [BUG] Flaky test_beacon_param: gateway SIGSEGV on shutdown #359): On Humble,
timer->cancel()is non-blocking and does not drain the executor ready list. Callbacks already queued can fire after cancel returns, dereferencing partially-destroyed plugin state during gateway shutdown.SnapshotCapture data race (TSAN): Concurrent
handle_report_faultcalls spawn capture threads that both callnode_->create_generic_subscription()without synchronization.rcutils_hash_map(rclcpp internal) is not thread-safe for concurrent entity creation - corrupts node state, can cause SIGSEGV.Scope: Systematic codebase audit found shutdown race patterns in 4 plugins and 2 demo nodes beyond the 2 files reported in the issue. All gateway managers already followed the correct pattern. Enabling TSAN for integration tests revealed the SnapshotCapture race that unit tests never exercised.
Checklist