Skip to content

Fix potential deadlock in TimeSource::destroy_clock_sub#3110

Open
PavelGuzenfeld wants to merge 2 commits intoros2:rollingfrom
PavelGuzenfeld:fix/time-source-deadlock
Open

Fix potential deadlock in TimeSource::destroy_clock_sub#3110
PavelGuzenfeld wants to merge 2 commits intoros2:rollingfrom
PavelGuzenfeld:fix/time-source-deadlock

Conversation

@PavelGuzenfeld
Copy link

Summary

Fixes #2962

destroy_clock_sub() held clock_sub_lock_ while calling clock_executor_thread_.join(). If the executor thread's callback tried to access state protected by clock_sub_lock_ before fully exiting, this produced a deadlock: the main thread waits for the executor thread to finish, while the executor thread waits for the main thread to release the lock.

The fix: Move the thread, executor, and callback group into local variables under the lock, release the lock, then join outside the critical section. This ensures the executor thread can complete its final callback without contending on clock_sub_lock_.

Before:

Main thread: lock(clock_sub_lock_) → cancel() → join() [BLOCKED] → unlock
Executor thread: callback needs clock_sub_lock_ [BLOCKED] → exit
→ DEADLOCK

After:

Main thread: lock(clock_sub_lock_) → cancel() → move thread out → unlock → join() [WAITS]
Executor thread: callback completes freely → exit
→ JOIN SUCCEEDS

Test plan

  • test_time_source: 5/5 runs pass (16 tests each) — no freezes
  • test_node construction/destruction: 3/3 runs pass — no freezes
  • Full rclcpp build: clean

@PavelGuzenfeld PavelGuzenfeld force-pushed the fix/time-source-deadlock branch from bba7586 to 97c11a0 Compare March 21, 2026 21:13
@christophebedard
Copy link
Member

Thank you for the PR! Please see my note here: #3109 (comment)

@PavelGuzenfeld PavelGuzenfeld changed the base branch from humble to rolling March 22, 2026 05:43
@PavelGuzenfeld PavelGuzenfeld force-pushed the fix/time-source-deadlock branch 2 times, most recently from bc5e080 to 10967dd Compare March 22, 2026 06:15
@PavelGuzenfeld
Copy link
Author

The CI failure is not caused by this PR. The build error is in subscription_base.cpp:486:

error: 'rcl_subscription_is_cft_supported' was not declared in this scope

This function was added to rclcpp rolling HEAD by #3089, but the CI's rcl package hasn't been updated to include the corresponding rcl_subscription_is_cft_supported API yet. The same failure affects all open rolling PRs — e.g., #3105 (unrelated to this PR) also fails identically.

This PR only modifies time_source.cpp and does not touch subscription_base.cpp. The failure will resolve once the CI's rcl package catches up with rolling HEAD.

@jmachowinski
Copy link
Collaborator

This PR would create a dangling thread in the test conditions from #2962.

@PavelGuzenfeld PavelGuzenfeld force-pushed the fix/time-source-deadlock branch from 10967dd to 4959f5c Compare March 22, 2026 12:26
@PavelGuzenfeld
Copy link
Author

PavelGuzenfeld commented Mar 22, 2026

@jmachowinski Thanks for the review!

Regarding the "dangling thread" concern — the thread is moved to a local variable but is unconditionally joined before destroy_clock_sub() returns. It is never detached or leaked.

Improvement in this push (addressing your feedback): The subscription (clock_subscription_) is now kept alive until after the thread is joined. In the previous version, it was reset inside the lock before join, which could theoretically allow the executor thread to dispatch a callback on a destroyed subscription. The new order is:

cancel() → move thread out → release lock → join() → remove_callback_group()
→ reacquire lock → reset subscription

Why moving the thread is necessary:
Without moving, two concurrent calls to destroy_clock_sub() (e.g., from detachNode() and post_set_parameters()) could both see clock_executor_thread_.joinable() == true and attempt join() simultaneously — which is undefined behavior. Moving the thread ensures only one caller gets the joinable thread.

Added 4 stress tests (test_time_source_deadlock) that reproduce the conditions from #2962:

  1. Rapid construct/destroy — 50 cycles of creating and immediately destroying sim-time nodes
  2. Concurrent construct/destroy — 4 threads × 10 iterations, creating/destroying sim-time nodes
  3. Toggle use_sim_time while spinning — exercises create_clock_sub/destroy_clock_sub interplay
  4. No lingering thread — verifies the context is clean after destruction

All 4 tests pass. The full rclcpp suite (2962 tests) also passes with 0 errors. The existing test_time_source passes 5/5 consecutive runs with no deadlocks.

If this approach doesn't fully address the concern, I'm happy to explore adding a guard flag to prevent create_clock_sub() from running while a join is pending (Option B) as a follow-up.

destroy_clock_sub() held clock_sub_lock_ while calling
clock_executor_thread_.join(). If the executor thread's shutdown path
contends on clock_sub_lock_, this produces a deadlock.

Fix: move the thread into a local variable under the lock, release the
lock, then join outside the critical section. The thread is not dangling
— it is unconditionally joined before the function returns.

Key improvement over the naive move approach: the subscription is kept
alive until after join completes, ensuring the executor thread can
finish any in-flight callback without accessing a destroyed subscription.

Cleanup order: cancel → (release lock) → join → remove_callback_group
→ (reacquire lock) → reset subscription.

Fixes ros2#2962

Signed-off-by: Pavel Guzenfeld <pavelguzenfeld@gmail.com>
Tests reproduce the conditions from ros2#2962:
- Rapid construct/destroy of nodes with use_sim_time=true (50 cycles)
- Concurrent construct/destroy from 4 threads
- Toggle use_sim_time while spinning
- Verify no lingering thread after destruction

If the deadlock is present, these tests will hang and be caught by the
60-second timeout.

Signed-off-by: Pavel Guzenfeld <pavelguzenfeld@gmail.com>
@PavelGuzenfeld PavelGuzenfeld force-pushed the fix/time-source-deadlock branch from 4959f5c to e0274ab Compare March 22, 2026 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deadlock in destroy_clock_sub waiting for clock_executor_thread_ to join

3 participants