fix(gateway): scope per-entity fault GET/DELETE by hosted-app FQNs#401
Open
bburda wants to merge 4 commits into
Open
fix(gateway): scope per-entity fault GET/DELETE by hosted-app FQNs#401bburda wants to merge 4 commits into
bburda wants to merge 4 commits into
Conversation
Per-entity fault routes used the addressed entity's namespace_path as a
transport-level prefix filter. When namespace_path was empty (host-derived
or synthetic components, manifest components without an explicit
`namespace` field, Areas, Functions), the filter short-circuited:
- GET /{entity}/faults/{code} returned faults reported by apps that lived
in entirely different entities.
- DELETE /{entity}/faults/{code} cleared them, since ClearFault.srv has
no scope argument and the handler did not check ownership first.
Both handlers now resolve the entity to the set of FQNs of every hosted
app via the new HandlerContext::resolve_entity_source_fqns helper and
require the fault's reporting_sources to intersect that set. Empty set
means "nothing in scope" -> 404. DELETE additionally fetches the fault
and verifies scope before issuing the clear.
The GetFault.srv / ClearFault.srv contracts are unchanged.
The first pass on #395 only fixed the per-fault GET/DELETE routes. The collection routes (handle_list_faults, handle_clear_all_faults) still passed entity_info.namespace_path as the transport-level prefix filter for App entities. Apps with a wildcard ros_binding.namespace_pattern yield an empty effective_fqn, the transport silently disabled the filter, and the list/clear-all routes leaked - or cleared - every fault in the system. Both App branches now route through resolve_entity_source_fqns plus filter_faults_by_sources, matching how Component / Area / Function are already handled. Also: - Move the #395 CHANGELOG entry to Breaking Changes (the 200 -> 404 flip is observable to clients). - Document 404 on GET /{entity}/faults/{fault_code} in docs/api/rest.rst. - Add @verifies REQ_INTEROP_013 / REQ_INTEROP_015 tags to the scope-isolation tests so the generated verification matrix picks them up. - Add a DELETE happy-path test that clears the fault on the owning component and asserts 204. - Add unit coverage for resolve_entity_source_fqns on FUNCTION entities (aggregation + skip-unbound-hosts cases). - Note the wildcard-App limitation in the helper doxygen. - Drop unused launch_testing.actions import in the integration test.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes per-entity fault scoping so fault reads and clears are based on hosted app FQNs instead of empty or overly broad namespace prefixes.
Changes:
- Adds
HandlerContext::resolve_entity_source_fqns()and uses it in fault handlers. - Updates per-entity fault GET/list/DELETE paths to apply hosted-app source filtering.
- Adds regression tests plus API/changelog documentation for the behavior change.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
src/ros2_medkit_gateway/src/http/handlers/fault_handlers.cpp |
Applies FQN-based fault scoping before returning or clearing faults. |
src/ros2_medkit_gateway/src/http/handlers/handler_context.cpp |
Adds helper to resolve entity-owned source FQNs. |
src/ros2_medkit_gateway/include/ros2_medkit_gateway/http/handlers/handler_context.hpp |
Declares and documents the new source-FQN resolver. |
src/ros2_medkit_gateway/test/test_handler_context.cpp |
Adds unit coverage for source-FQN resolution. |
src/ros2_medkit_integration_tests/test/features/test_faults_scope_isolation.test.py |
Adds integration regression coverage for cross-component fault leakage. |
src/ros2_medkit_gateway/CHANGELOG.rst |
Documents the fault scoping behavior change. |
docs/api/rest.rst |
Updates fault endpoint response-code documentation. |
Comments suppressed due to low confidence (2)
src/ros2_medkit_gateway/src/http/handlers/fault_handlers.cpp:610
- This uses the existing prefix-based collection filter with exact app FQNs. For an app
/ns/node, a fault reported by/ns/node_extrawill be included because it starts with the owned FQN, so the collection route can still leak similarly named nodes' faults. The collection filtering should use exact FQN membership (or a path-boundary-aware match) when the filter set contains app FQNs.
const auto & cache = ctx_.node()->get_thread_safe_cache();
auto app_fqns = HandlerContext::resolve_entity_source_fqns(cache, entity_info);
auto result = fault_mgr->list_faults("", filter.include_pending, filter.include_confirmed, filter.include_cleared,
filter.include_healed, include_muted, include_clusters);
if (result.success) {
json filtered_faults = filter_faults_by_sources(result.data["faults"], app_fqns);
src/ros2_medkit_gateway/src/http/handlers/fault_handlers.cpp:1029
faults_to_clearis selected by intersection, but the later clear call is still byfault_codeonly. A fault with one reporting source in this app and another source outside it will be selected here and then cleared globally, including the out-of-scope source. Exclude or reject mixed-scope faults before issuing unscoped clears.
const auto & cache = ctx_.node()->get_thread_safe_cache();
auto app_fqns = HandlerContext::resolve_entity_source_fqns(cache, entity_info);
faults_to_clear = filter_faults_by_sources(result.data["faults"], app_fqns);
…onents Address the eight review findings from the post-push self-review pass: - Source matching now enforces a path boundary: a node is in scope iff it equals one of the entity's owned FQNs or is a strict path-child (`<fqn>/<...>`). Pre-fix, `rfind(prefix, 0) == 0` let sibling nodes like `/ns/node_extra` pass for an entity owning `/ns/node`. - `fault_in_source_scope` now requires ALL reporting sources to be in scope. The old "any match" semantic exposed two cross-entity paths: the GET response carries the fault's full `reporting_sources` and environment data; the DELETE issues an unscoped `ClearFault.srv` that clears the aggregated record across every source for the code. `filter_faults_by_sources` inherits the same semantic so per-entity collection routes and per-fault routes agree on what counts as in scope. - `resolve_entity_source_fqns` AREA branch walks `get_subareas()` recursively. The manifest model nests areas (e.g. `engine` under `powertrain`) and components attach to subareas, so a top-level area query previously returned an empty set. - `resolve_entity_source_fqns` FUNCTION branch handles host IDs that are components (Function.hosts is documented as "App or Component IDs"), expanding component hosts to their apps. The cache's `function_to_apps_` index only resolves app hosts. - `fault_in_source_scope` is now a public static method on FaultHandlers so it can be unit-tested directly without a launch test. Added test_fault_handlers cases for exact match, path-child match, prefix collision, multi-source all-in-scope, multi-source partial, empty-scope, missing field, malformed entry, and root-FQN edge. - Added test_handler_context cases for nested-subarea recursion and function-hosting-component expansion. - Integration test setUp now verifies both the owning and the unrelated component were actually discovered before asserting 404. Without it, a 404 from an undiscovered entity is indistinguishable from the 404 the scope filter produces, so the regression would silently pass even with the fix reverted. - Doxygen for `resolve_entity_source_fqns` reworded to be generic about what callers do with an empty result (per-fault routes 404, list 200 with empty `items`, clear-all 204).
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
src/ros2_medkit_gateway/src/http/handlers/fault_handlers.cpp:844
- With the new all-sources scope rule, this error also fires for mixed-source faults where one reporting source is inside the entity and another is outside. The message says the fault is not reported by any app in scope, which is inaccurate for that case and makes 404s misleading; it should describe that not all reporting sources are within the entity scope.
HandlerContext::send_error(res, 404, ERR_RESOURCE_NOT_FOUND, "Fault not found",
{{"details", "Fault is not reported by any app within this entity's scope"},
{entity_info.id_field, entity_id},
N1: Per-entity App fault list response no longer emits the global muted_count / cluster_count / muted_faults / clusters correlation metadata. include_muted / include_clusters URL params are inert on per-entity routes - they remain on the global GET /api/v1/faults route where the aggregates are meaningful. N2: ClearFault.srv gains a `skip_correlation_auto_clear` request flag. fault_manager_node honors it: when true, the correlation engine is NOT asked to compute the auto-clear cascade for the cleared fault code, and auto_cleared_codes in the response is empty. Per-entity DELETE routes (both single-fault and clear-all) set the flag to true so a scoped clear cannot reach faults reported by apps in other entities via auto_clear_with_root rules. Global DELETE /api/v1/faults keeps the flag false to preserve cluster-wide clearing. N3: handle_list_faults and handle_clear_all_faults Function/Area branches now route through HandlerContext::resolve_entity_source_fqns. Previously the Function branch used cache.get_entity_configurations() which only resolves App hosts (so a Function whose `hosts` list carried a Component ID returned an empty FQN set), and the Area branch walked cache.get_components_for_area() without recursing into subareas (so a top-level area like `powertrain` whose components lived in the `engine` subarea returned an empty set). Both classes are fixed via the shared helper. N4: 404 messages on per-fault GET/DELETE now describe the actual all-sources rule: a fault must have every reporting_source in the entity's scope; mixed-source faults that include any out-of-entity reporter are rejected. The previous wording said "not reported by any app within scope" which was wrong for the mixed-source case. N5: CHANGELOG behavior-change description now matches the implementation - the rule is "every reporting_source must be in scope", not just "reporting_sources must intersect". Documents the include_muted / include_clusters drop and the new ClearFault flag for per-entity DELETE. Test changes: - MockFaultServiceTransport in test_fault_manager_routing tracks the new skip_correlation_auto_clear arg. - Two new FaultManagerRoutingTest cases pin the default (false, for global routes) and the per-entity routing (true, for scoped DELETE).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Per-entity fault routes (
GET/DELETE /{entity-path}/faults/{fault_code}) leaked faults across entities. When the addressed entity had an emptynamespace_path(host-derived or synthetic components, manifest components without anamespacefield, Areas, Functions), the transport-level prefix filter short-circuited:GETreturned faults reported by apps that lived in entirely different entities.DELETEcleared them too, becauseClearFault.srvcarries no scope argument and the handler issued the clear without checking ownership first.Both handlers now resolve the addressed entity to the set of FQNs of every hosted app via a new
HandlerContext::resolve_entity_source_fqnshelper and require the fault'sreporting_sourcesto intersect that set. An empty set means "no app is in scope" -> 404.DELETEadditionally fetches the fault and verifies scope before issuing the clear, so a per-entity DELETE can never clear a fault that does not belong to the entity.The
GetFault.srv/ClearFault.srvcontracts are unchanged.Issue
Type
Testing
TDD red-green verified locally:
test_faults_scope_isolation(new integration test,lidar_sensor+temp_sensorin hybrid mode withdemo_nodes_manifest.yaml):test_get_fault_returns_404_on_unrelated_componentgets200with the leaked lidar fault body andtest_clear_fault_returns_404_on_unrelated_componentgets204(the unrelated component cleared the lidar fault).HandlerContext::resolve_entity_source_fqnscovering App / Component / Area / Unknown plus empty-effective-fqn and missing-from-cache edges.test_faults_api,test_locking_faults,test_triggers_faults,test_scenario_fault_inspection,test_scenario_fault_lifecycle) pass.colcon test --packages-select ros2_medkit_gatewaylinters (clang_format, cppcheck, flake8, lint_cmake, pep257, xmllint) andclang-tidybuild green.Reviewer steps:
colcon build --packages-up-to ros2_medkit_integration_testscolcon test --packages-select ros2_medkit_gateway ros2_medkit_integration_tests --ctest-args -R 'test_faults_scope_isolation|test_handler_context|test_fault_handlers'Checklist
Backwards compatibility note
Clients that previously received
200/204for a fault outside an entity's scope now get404. This is a behavior change for any client that was relying on the leaked response. The CHANGELOG entry documents the contract change.