Skip to content

fix(gateway): scope per-entity fault GET/DELETE by hosted-app FQNs#401

Open
bburda wants to merge 4 commits into
mainfrom
fix/issue-395-fault-scope-leak
Open

fix(gateway): scope per-entity fault GET/DELETE by hosted-app FQNs#401
bburda wants to merge 4 commits into
mainfrom
fix/issue-395-fault-scope-leak

Conversation

@bburda
Copy link
Copy Markdown
Collaborator

@bburda bburda commented May 14, 2026

Summary

Per-entity fault routes (GET / DELETE /{entity-path}/faults/{fault_code}) leaked faults across entities. When the addressed entity had an empty namespace_path (host-derived or synthetic components, manifest components without a namespace field, Areas, Functions), the transport-level prefix filter short-circuited:

  • GET returned faults reported by apps that lived in entirely different entities.
  • DELETE cleared them too, because ClearFault.srv carries no scope argument and the handler issued the clear without checking ownership first.

Both handlers now resolve the addressed entity to the set of FQNs of every hosted app via a new HandlerContext::resolve_entity_source_fqns helper and require the fault's reporting_sources to intersect that set. An empty set means "no app is in scope" -> 404. DELETE additionally fetches the fault and verifies scope before issuing the clear, so a per-entity DELETE can never clear a fault that does not belong to the entity.

The GetFault.srv / ClearFault.srv contracts are unchanged.

Issue

Type

  • Bug fix
  • New feature or tests
  • Breaking change
  • Documentation only

Testing

TDD red-green verified locally:

  • test_faults_scope_isolation (new integration test, lidar_sensor + temp_sensor in hybrid mode with demo_nodes_manifest.yaml):
    • With the handler reverted, test_get_fault_returns_404_on_unrelated_component gets 200 with the leaked lidar fault body and test_clear_fault_returns_404_on_unrelated_component gets 204 (the unrelated component cleared the lidar fault).
    • With the fix in place, all 4 cases pass (404 on the unrelated component, 200 on the owning component, fault still present after the rejected DELETE).
  • 7 new unit tests for HandlerContext::resolve_entity_source_fqns covering App / Component / Area / Unknown plus empty-effective-fqn and missing-from-cache edges.
  • Full gateway unit suite passes; existing fault integration tests (test_faults_api, test_locking_faults, test_triggers_faults, test_scenario_fault_inspection, test_scenario_fault_lifecycle) pass.
  • colcon test --packages-select ros2_medkit_gateway linters (clang_format, cppcheck, flake8, lint_cmake, pep257, xmllint) and clang-tidy build green.

Reviewer steps:

  1. colcon build --packages-up-to ros2_medkit_integration_tests
  2. colcon test --packages-select ros2_medkit_gateway ros2_medkit_integration_tests --ctest-args -R 'test_faults_scope_isolation|test_handler_context|test_fault_handlers'

Checklist

  • Breaking changes are clearly described (and announced in docs / changelog if needed)
  • Tests were added or updated if needed
  • Docs were updated if behavior or public API changed

Backwards compatibility note

Clients that previously received 200 / 204 for a fault outside an entity's scope now get 404. This is a behavior change for any client that was relying on the leaked response. The CHANGELOG entry documents the contract change.

bburda added 2 commits May 14, 2026 13:00
Per-entity fault routes used the addressed entity's namespace_path as a
transport-level prefix filter. When namespace_path was empty (host-derived
or synthetic components, manifest components without an explicit
`namespace` field, Areas, Functions), the filter short-circuited:

- GET /{entity}/faults/{code} returned faults reported by apps that lived
  in entirely different entities.
- DELETE /{entity}/faults/{code} cleared them, since ClearFault.srv has
  no scope argument and the handler did not check ownership first.

Both handlers now resolve the entity to the set of FQNs of every hosted
app via the new HandlerContext::resolve_entity_source_fqns helper and
require the fault's reporting_sources to intersect that set. Empty set
means "nothing in scope" -> 404. DELETE additionally fetches the fault
and verifies scope before issuing the clear.

The GetFault.srv / ClearFault.srv contracts are unchanged.
The first pass on #395 only fixed the per-fault GET/DELETE routes. The
collection routes (handle_list_faults, handle_clear_all_faults) still
passed entity_info.namespace_path as the transport-level prefix filter
for App entities. Apps with a wildcard ros_binding.namespace_pattern
yield an empty effective_fqn, the transport silently disabled the
filter, and the list/clear-all routes leaked - or cleared - every
fault in the system.

Both App branches now route through resolve_entity_source_fqns plus
filter_faults_by_sources, matching how Component / Area / Function are
already handled.

Also:
- Move the #395 CHANGELOG entry to Breaking Changes (the 200 -> 404
  flip is observable to clients).
- Document 404 on GET /{entity}/faults/{fault_code} in docs/api/rest.rst.
- Add @verifies REQ_INTEROP_013 / REQ_INTEROP_015 tags to the
  scope-isolation tests so the generated verification matrix picks them up.
- Add a DELETE happy-path test that clears the fault on the owning
  component and asserts 204.
- Add unit coverage for resolve_entity_source_fqns on FUNCTION entities
  (aggregation + skip-unbound-hosts cases).
- Note the wildcard-App limitation in the helper doxygen.
- Drop unused launch_testing.actions import in the integration test.
@bburda bburda self-assigned this May 14, 2026
@bburda bburda requested a review from mfaferek93 May 14, 2026 12:10
@bburda bburda marked this pull request as ready for review May 14, 2026 12:10
Copilot AI review requested due to automatic review settings May 14, 2026 12:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes per-entity fault scoping so fault reads and clears are based on hosted app FQNs instead of empty or overly broad namespace prefixes.

Changes:

  • Adds HandlerContext::resolve_entity_source_fqns() and uses it in fault handlers.
  • Updates per-entity fault GET/list/DELETE paths to apply hosted-app source filtering.
  • Adds regression tests plus API/changelog documentation for the behavior change.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/ros2_medkit_gateway/src/http/handlers/fault_handlers.cpp Applies FQN-based fault scoping before returning or clearing faults.
src/ros2_medkit_gateway/src/http/handlers/handler_context.cpp Adds helper to resolve entity-owned source FQNs.
src/ros2_medkit_gateway/include/ros2_medkit_gateway/http/handlers/handler_context.hpp Declares and documents the new source-FQN resolver.
src/ros2_medkit_gateway/test/test_handler_context.cpp Adds unit coverage for source-FQN resolution.
src/ros2_medkit_integration_tests/test/features/test_faults_scope_isolation.test.py Adds integration regression coverage for cross-component fault leakage.
src/ros2_medkit_gateway/CHANGELOG.rst Documents the fault scoping behavior change.
docs/api/rest.rst Updates fault endpoint response-code documentation.
Comments suppressed due to low confidence (2)

src/ros2_medkit_gateway/src/http/handlers/fault_handlers.cpp:610

  • This uses the existing prefix-based collection filter with exact app FQNs. For an app /ns/node, a fault reported by /ns/node_extra will be included because it starts with the owned FQN, so the collection route can still leak similarly named nodes' faults. The collection filtering should use exact FQN membership (or a path-boundary-aware match) when the filter set contains app FQNs.
    const auto & cache = ctx_.node()->get_thread_safe_cache();
    auto app_fqns = HandlerContext::resolve_entity_source_fqns(cache, entity_info);
    auto result = fault_mgr->list_faults("", filter.include_pending, filter.include_confirmed, filter.include_cleared,
                                         filter.include_healed, include_muted, include_clusters);

    if (result.success) {
      json filtered_faults = filter_faults_by_sources(result.data["faults"], app_fqns);

src/ros2_medkit_gateway/src/http/handlers/fault_handlers.cpp:1029

  • faults_to_clear is selected by intersection, but the later clear call is still by fault_code only. A fault with one reporting source in this app and another source outside it will be selected here and then cleared globally, including the out-of-scope source. Exclude or reject mixed-scope faults before issuing unscoped clears.
      const auto & cache = ctx_.node()->get_thread_safe_cache();
      auto app_fqns = HandlerContext::resolve_entity_source_fqns(cache, entity_info);
      faults_to_clear = filter_faults_by_sources(result.data["faults"], app_fqns);

Comment thread src/ros2_medkit_gateway/src/http/handlers/fault_handlers.cpp Outdated
Comment thread src/ros2_medkit_gateway/src/http/handlers/fault_handlers.cpp Outdated
Comment thread src/ros2_medkit_gateway/src/http/handlers/fault_handlers.cpp Outdated
Comment thread src/ros2_medkit_gateway/src/http/handlers/handler_context.cpp Outdated
Comment thread src/ros2_medkit_gateway/src/http/handlers/handler_context.cpp Outdated
Comment thread src/ros2_medkit_gateway/src/http/handlers/fault_handlers.cpp Outdated
…onents

Address the eight review findings from the post-push self-review pass:

- Source matching now enforces a path boundary: a node is in scope iff it
  equals one of the entity's owned FQNs or is a strict path-child
  (`<fqn>/<...>`). Pre-fix, `rfind(prefix, 0) == 0` let sibling nodes like
  `/ns/node_extra` pass for an entity owning `/ns/node`.

- `fault_in_source_scope` now requires ALL reporting sources to be in
  scope. The old "any match" semantic exposed two cross-entity paths:
  the GET response carries the fault's full `reporting_sources` and
  environment data; the DELETE issues an unscoped `ClearFault.srv` that
  clears the aggregated record across every source for the code.
  `filter_faults_by_sources` inherits the same semantic so per-entity
  collection routes and per-fault routes agree on what counts as in
  scope.

- `resolve_entity_source_fqns` AREA branch walks `get_subareas()`
  recursively. The manifest model nests areas (e.g. `engine` under
  `powertrain`) and components attach to subareas, so a top-level area
  query previously returned an empty set.

- `resolve_entity_source_fqns` FUNCTION branch handles host IDs that are
  components (Function.hosts is documented as "App or Component IDs"),
  expanding component hosts to their apps. The cache's `function_to_apps_`
  index only resolves app hosts.

- `fault_in_source_scope` is now a public static method on FaultHandlers
  so it can be unit-tested directly without a launch test. Added
  test_fault_handlers cases for exact match, path-child match, prefix
  collision, multi-source all-in-scope, multi-source partial,
  empty-scope, missing field, malformed entry, and root-FQN edge.

- Added test_handler_context cases for nested-subarea recursion and
  function-hosting-component expansion.

- Integration test setUp now verifies both the owning and the unrelated
  component were actually discovered before asserting 404. Without it, a
  404 from an undiscovered entity is indistinguishable from the 404 the
  scope filter produces, so the regression would silently pass even with
  the fix reverted.

- Doxygen for `resolve_entity_source_fqns` reworded to be generic about
  what callers do with an empty result (per-fault routes 404, list 200
  with empty `items`, clear-all 204).
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

src/ros2_medkit_gateway/src/http/handlers/fault_handlers.cpp:844

  • With the new all-sources scope rule, this error also fires for mixed-source faults where one reporting source is inside the entity and another is outside. The message says the fault is not reported by any app in scope, which is inaccurate for that case and makes 404s misleading; it should describe that not all reporting sources are within the entity scope.
      HandlerContext::send_error(res, 404, ERR_RESOURCE_NOT_FOUND, "Fault not found",
                                 {{"details", "Fault is not reported by any app within this entity's scope"},
                                  {entity_info.id_field, entity_id},

Comment thread src/ros2_medkit_gateway/src/http/handlers/fault_handlers.cpp Outdated
Comment thread src/ros2_medkit_gateway/src/http/handlers/fault_handlers.cpp
Comment thread src/ros2_medkit_gateway/src/http/handlers/handler_context.cpp
Comment thread src/ros2_medkit_gateway/src/http/handlers/fault_handlers.cpp Outdated
Comment thread src/ros2_medkit_gateway/CHANGELOG.rst Outdated
N1: Per-entity App fault list response no longer emits the global
muted_count / cluster_count / muted_faults / clusters correlation
metadata. include_muted / include_clusters URL params are inert on
per-entity routes - they remain on the global GET /api/v1/faults route
where the aggregates are meaningful.

N2: ClearFault.srv gains a `skip_correlation_auto_clear` request flag.
fault_manager_node honors it: when true, the correlation engine is NOT
asked to compute the auto-clear cascade for the cleared fault code, and
auto_cleared_codes in the response is empty. Per-entity DELETE routes
(both single-fault and clear-all) set the flag to true so a scoped clear
cannot reach faults reported by apps in other entities via
auto_clear_with_root rules. Global DELETE /api/v1/faults keeps the flag
false to preserve cluster-wide clearing.

N3: handle_list_faults and handle_clear_all_faults Function/Area
branches now route through HandlerContext::resolve_entity_source_fqns.
Previously the Function branch used cache.get_entity_configurations()
which only resolves App hosts (so a Function whose `hosts` list carried
a Component ID returned an empty FQN set), and the Area branch walked
cache.get_components_for_area() without recursing into subareas (so a
top-level area like `powertrain` whose components lived in the `engine`
subarea returned an empty set). Both classes are fixed via the shared
helper.

N4: 404 messages on per-fault GET/DELETE now describe the actual
all-sources rule: a fault must have every reporting_source in the
entity's scope; mixed-source faults that include any out-of-entity
reporter are rejected. The previous wording said "not reported by any
app within scope" which was wrong for the mixed-source case.

N5: CHANGELOG behavior-change description now matches the implementation
- the rule is "every reporting_source must be in scope", not just
"reporting_sources must intersect". Documents the include_muted /
include_clusters drop and the new ClearFault flag for per-entity DELETE.

Test changes:
- MockFaultServiceTransport in test_fault_manager_routing tracks the
  new skip_correlation_auto_clear arg.
- Two new FaultManagerRoutingTest cases pin the default (false, for
  global routes) and the per-entity routing (true, for scoped DELETE).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] handle_get_fault returns cross-entity faults for synthetic components (scope leak)

2 participants