Fix potential deadlock in TimeSource::destroy_clock_sub#3110
Fix potential deadlock in TimeSource::destroy_clock_sub#3110PavelGuzenfeld wants to merge 2 commits intoros2:rollingfrom
Conversation
bba7586 to
97c11a0
Compare
|
Thank you for the PR! Please see my note here: #3109 (comment) |
bc5e080 to
10967dd
Compare
|
The CI failure is not caused by this PR. The build error is in This function was added to rclcpp rolling HEAD by #3089, but the CI's This PR only modifies |
|
This PR would create a dangling thread in the test conditions from #2962. |
10967dd to
4959f5c
Compare
|
@jmachowinski Thanks for the review! Regarding the "dangling thread" concern — the thread is moved to a local variable but is unconditionally joined before Improvement in this push (addressing your feedback): The subscription ( Why moving the thread is necessary: Added 4 stress tests (
All 4 tests pass. The full rclcpp suite (2962 tests) also passes with 0 errors. The existing If this approach doesn't fully address the concern, I'm happy to explore adding a guard flag to prevent |
destroy_clock_sub() held clock_sub_lock_ while calling clock_executor_thread_.join(). If the executor thread's shutdown path contends on clock_sub_lock_, this produces a deadlock. Fix: move the thread into a local variable under the lock, release the lock, then join outside the critical section. The thread is not dangling — it is unconditionally joined before the function returns. Key improvement over the naive move approach: the subscription is kept alive until after join completes, ensuring the executor thread can finish any in-flight callback without accessing a destroyed subscription. Cleanup order: cancel → (release lock) → join → remove_callback_group → (reacquire lock) → reset subscription. Fixes ros2#2962 Signed-off-by: Pavel Guzenfeld <pavelguzenfeld@gmail.com>
Tests reproduce the conditions from ros2#2962: - Rapid construct/destroy of nodes with use_sim_time=true (50 cycles) - Concurrent construct/destroy from 4 threads - Toggle use_sim_time while spinning - Verify no lingering thread after destruction If the deadlock is present, these tests will hang and be caught by the 60-second timeout. Signed-off-by: Pavel Guzenfeld <pavelguzenfeld@gmail.com>
4959f5c to
e0274ab
Compare
Summary
Fixes #2962
destroy_clock_sub()heldclock_sub_lock_while callingclock_executor_thread_.join(). If the executor thread's callback tried to access state protected byclock_sub_lock_before fully exiting, this produced a deadlock: the main thread waits for the executor thread to finish, while the executor thread waits for the main thread to release the lock.The fix: Move the thread, executor, and callback group into local variables under the lock, release the lock, then join outside the critical section. This ensures the executor thread can complete its final callback without contending on
clock_sub_lock_.Before:
After:
Test plan
test_time_source: 5/5 runs pass (16 tests each) — no freezestest_nodeconstruction/destruction: 3/3 runs pass — no freezes