Skip to content

Prevent configuration timeouts from wedging gateway workers#335

Closed
eclipse0922 wants to merge 1 commit intoselfpatch:mainfrom
eclipse0922:emdash/open-lamps-matter-6ev
Closed

Prevent configuration timeouts from wedging gateway workers#335
eclipse0922 wants to merge 1 commit intoselfpatch:mainfrom
eclipse0922:emdash/open-lamps-matter-6ev

Conversation

@eclipse0922
Copy link
Copy Markdown
Contributor

Pull Request

Summary

Prevent configuration requests from permanently wedging the gateway when a ROS 2 parameter service stalls.

  • Bound ConfigurationManager spin-lock acquisition and SyncParametersClient IPC to the configured timeout
  • Surface explicit TIMEOUT failures and negative-cache slow/unresponsive nodes to avoid repeated worker starvation
  • Add regression coverage for slow parameter responses and concurrent configuration requests

Issue

Link the related issue (required):


Type

  • Bug fix
  • New feature or tests
  • Breaking change
  • Documentation only

Testing

  • Devcontainer image build
  • rosdep install --from-paths src --ignore-src -r -y inside the devcontainer
  • colcon build --build-base /tmp/colcon-build --install-base /tmp/colcon-install --packages-up-to ros2_medkit_gateway --cmake-args -DCMAKE_DISABLE_PRECOMPILE_HEADERS=ON
  • colcon test --build-base /tmp/colcon-build --install-base /tmp/colcon-install --packages-select ros2_medkit_gateway --ctest-args -R test_configuration_manager --event-handlers console_direct+

Checklist

  • Breaking changes are clearly described (and announced in docs / changelog if needed)
  • Tests were added or updated if needed
  • Docs were updated if behavior or public API changed

Notes:

  • Devcontainer GCC 13 hit an internal compiler error in the PCH path, so verification used -DCMAKE_DISABLE_PRECOMPILE_HEADERS=ON.

Bound ConfigurationManager spin lock acquisition and SyncParametersClient IPC to the configured service timeout. Slow or hung parameter services now surface TIMEOUT, feed the negative cache, and have regression coverage for slow responses and concurrent contention.

Constraint: SyncParametersClient spins a shared node and may block indefinitely when a remote parameter service never replies
Rejected: Per-request parameter client nodes | larger structural change without addressing IPC timeout semantics
Rejected: HTTP-layer timeouts only | would still leave spin_mutex_ contention inside ConfigurationManager
Confidence: high
Scope-risk: moderate
Reversibility: clean
Directive: Keep parameter IPC time-bounded while holding spin_mutex_; do not reintroduce unbounded SyncParametersClient calls
Tested: Devcontainer rosdep install; colcon build --packages-up-to ros2_medkit_gateway with -DCMAKE_DISABLE_PRECOMPILE_HEADERS=ON; colcon test --packages-select ros2_medkit_gateway --ctest-args -R test_configuration_manager
Not-tested: Full ros2_medkit_gateway test suite; PCH-enabled GCC 13 devcontainer build path (compiler ICE)
@eclipse0922 eclipse0922 changed the title [codex] Prevent configuration timeouts from wedging gateway workers Prevent configuration timeouts from wedging gateway workers Apr 1, 2026
@mfaferek93
Copy link
Copy Markdown
Collaborator

Hey @eclipse0922
This is now resolved by #333 which was merged. It uses std::timed_mutex with try_lock_for (timeout derived from parameter_service_timeout_sec + 1s margin) to prevent thread pool exhaustion. Same root cause, same fix approach. Thanks for picking this up :) Closing as duplicate.

@mfaferek93 mfaferek93 closed this Apr 1, 2026
@mfaferek93
Copy link
Copy Markdown
Collaborator

This duplicates #333

@eclipse0922
Copy link
Copy Markdown
Contributor Author

@mfaferek93 my apologies — I didn’t notice that #333 was already assigned. Thanks for the clarification!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway becomes permanently unresponsive when configurations endpoint blocks spin_mutex

2 participants