fix: three-layer protection against unbounded snapshot growth#324
fix: three-layer protection against unbounded snapshot growth#324
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent unbounded fault snapshot growth by adding rate/volume limiting at multiple layers (fault reporter filtering, storage caps, and fault-manager recapture throttling) to address issue #308.
Changes:
- Add PASSED-event tracking/filtering logic to
LocalFilterand corresponding unit tests. - Add a per-fault snapshot cap to
FaultStoragebackends (SQLite + in-memory) and unit tests for SQLite. - Add a snapshot recapture cooldown to
FaultManagerNodeand new ROS parameters (snapshot.max_per_fault,snapshot.recapture_cooldown_sec).
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ros2_medkit_fault_reporter/test/test_local_filter.cpp | Adds PASSED filtering unit tests for LocalFilter. |
| src/ros2_medkit_fault_reporter/src/local_filter.cpp | Implements PASSED filtering state + tracker reset/clear changes. |
| src/ros2_medkit_fault_reporter/include/ros2_medkit_fault_reporter/local_filter.hpp | Exposes should_forward_passed() and adds a PASSED tracker map. |
| src/ros2_medkit_fault_manager/test/test_sqlite_storage.cpp | Adds tests validating snapshot cap behavior for SQLite storage. |
| src/ros2_medkit_fault_manager/src/sqlite_fault_storage.cpp | Implements reject-new snapshot limiting for SQLite backend. |
| src/ros2_medkit_fault_manager/src/fault_storage.cpp | Implements reject-new snapshot limiting for in-memory backend. |
| src/ros2_medkit_fault_manager/src/fault_manager_node.cpp | Adds snapshot-related params, applies snapshot cap, adds recapture cooldown logic. |
| src/ros2_medkit_fault_manager/include/ros2_medkit_fault_manager/sqlite_fault_storage.hpp | Adds snapshot cap setter + member state. |
| src/ros2_medkit_fault_manager/include/ros2_medkit_fault_manager/fault_storage.hpp | Extends FaultStorage interface with snapshot cap setter; adds state to in-memory backend. |
| src/ros2_medkit_fault_manager/include/ros2_medkit_fault_manager/fault_manager_node.hpp | Adds cooldown config/state (mutex + map) for per-fault capture throttling. |
| /// Check if a PASSED event should be forwarded | ||
| bool should_forward_passed(const std::string & fault_code); | ||
|
|
There was a problem hiding this comment.
LocalFilter::should_forward_passed() is added and tested, but FaultReporter::report_passed() still bypasses local filtering and always forwards PASSED events (see src/ros2_medkit_fault_reporter/src/fault_reporter.cpp). As-is, the new PASSED filtering will never be applied in production; update the PASSED reporting path to call the new method (or reuse should_forward(...)) before sending the request.
| void InMemoryFaultStorage::store_snapshot(const SnapshotData & snapshot) { | ||
| std::lock_guard<std::mutex> lock(mutex_); | ||
| if (max_snapshots_per_fault_ > 0) { | ||
| size_t count = 0; | ||
| for (const auto & s : snapshots_) { | ||
| if (s.fault_code == snapshot.fault_code) { | ||
| ++count; | ||
| } | ||
| } | ||
| if (count >= max_snapshots_per_fault_) { | ||
| return; // Reject new - keep earliest snapshots | ||
| } | ||
| } | ||
| snapshots_.push_back(snapshot); | ||
| } |
There was a problem hiding this comment.
InMemoryFaultStorage::store_snapshot() now scans the entire snapshots_ vector to count per-fault snapshots on every insert. This is O(N) per snapshot and can become a bottleneck even with a small per-fault limit if many faults are present. Consider tracking per-fault snapshot counts in a map (and decrementing on clear/delete) to keep this check O(1).
There was a problem hiding this comment.
Acceptable with max_per_fault limit (default 10). Total snapshots bounded by num_faults * 10.
| /// Set maximum snapshots per fault code (0 = unlimited) | ||
| virtual void set_max_snapshots_per_fault(size_t max_count) = 0; | ||
|
|
There was a problem hiding this comment.
FaultStorage is a public abstract interface, and adding a new pure-virtual method (set_max_snapshots_per_fault) is a source/ABI breaking change for any downstream storage implementations. If backwards compatibility matters, consider providing a default no-op implementation in the base class (non-pure virtual), or bump the relevant version / clearly mark this as a breaking change.
There was a problem hiding this comment.
Fixed - changed to virtual with default no-op to avoid breaking downstream.
Fixes unbounded snapshot accumulation when faults cycle between CONFIRMED and CLEARED states (issue #308). Root cause: diagnostic bridge forwarded every PASSED event without filtering, causing rapid fault status cycling and snapshot capture on each re-confirmation. Three-layer defense: 1. FaultReporter PASSED debounce: should_forward_passed() in LocalFilter applies same threshold/window filtering as FAILED. Previously report_passed() bypassed all local filtering. 2. Snapshot storage limit: set_max_snapshots_per_fault() on both SQLite and InMemory storage (default 10, reject-new strategy). Keeps earliest snapshots which are most valuable for diagnostics. 3. Snapshot recapture cooldown: snapshots.recapture_cooldown_sec parameter (default 60s) skips capture if same fault_code was captured within the cooldown window. Evicted on fault clear including auto-cleared correlation symptoms. Also updates diagnostic_bridge integration test to send enough OK diagnostics to meet the PASSED filtering threshold. Closes #308
aa74acf to
9340878
Compare
9340878 to
7f5bbf3
Compare
Add OperationManager::shutdown() called from GatewayNode destructor to clear subscriptions, tracked goals, and service clients while executor can still process pending callbacks safely. Fix test action server to abort unfinished goals on shutdown - destroying rclcpp_action::ServerGoalHandle without calling succeed/cancel/abort triggers "terminate called without an active exception" (SIGABRT). This was the root cause of persistent test_operation_handlers crashes across all distros.
7f5bbf3 to
3bf7c40
Compare
Summary
Three-layer protection against unbounded snapshot growth (issue #308) plus fix for persistent test_operation_handlers SIGABRT.
Snapshot storm fix:
should_forward_passed()applies same threshold/window filtering as FAILED events (previously all PASSED bypassed filtering)snapshots.max_per_fault(default 10, reject-new strategy - keeps earliest snapshots)snapshots.recapture_cooldown_sec(default 60s, evicted on fault clear including auto-cleared symptoms)Action goal teardown fix:
OperationManager::shutdown()called from GatewayNode destructor to clear subscriptions/clients while executor is still runninggoal_handle->abort()on shutdown exit - destroyingServerGoalHandlewithout finalization causes SIGABRTIssue
Type
Testing
New parameters
snapshots.max_per_faultsnapshots.recapture_cooldown_secChecklist