Fix panics on updating DelayQueue entries #3270

wabain · 2020-12-14T06:06:23Z

Motivation

This PR addresses two related bugs related to edge cases in tokio_util::time::DelayQueue.

When an entry is inserted into the queue with a deadline that has already elapsed it is flagged and put on a special-cased subqueue. If the entry's key is subsequently reset with a deadline in the future, the expired flag is not cleared. Trying to use DelayQueue::{reset, reset_at, remove} with the same key a second time will either cause a debug assert or trigger various kinds of breakage in builds without debug assertions—the key may be left in the timer wheel after a remove call and then reused for a separate storage entry, etc.
Under specific circumstances, calling DelayQueue::{reset, reset_at, remove} with an entry which has expired but has not yet been retrieved with DelayQueue::poll_expiration will cause a panic. For this to occur multiple entries need to be inserted with a deadline that happens to be mapped to slot 0 of a level in the timer wheel. When the first of these entries expires and is retrieved the wheel's internal time offset is set to the entries' deadline. Trying to look up another of these entries by key then causes the timer wheel to panic because the call to compute the wheel level for the entry incidentally relies on either the entry's expiration being in the future or the slot index being non-zero.

Solution

Clear the expired flag in DelayQueue::reset_at before re-inserting the entry into the queue, which will set the flag again if needed.
I believe the Wheel::remove method is the only point where the wheel implementation does not gracefully handle there being an entry that expires at the current time offset. All the logic which can run from within the wheel's poll method already has to handle that at least transiently being the case, which accounts for pretty much all of the wheel's surface area apart from the insert and remove methods. There is nothing special about slot 0 in a wheel level, so this PR modifies the level computation to work at the current time offset with any slot value.

I ported the change in the level computation back to the tokio core timer wheel, but both the current core timer implementation and the 0.2 time driver, which shared its Wheel implementation with DelayQueue, remove expired entries from the wheel all at once in a non-preemptive loop, so neither can hit the panic. (Although PR time: Fix race condition in timer drop #3229 recently addressed a bug in the core timer which is very similar to these.)

The DelayQueue remove and reset methods can panic when attempting to remove certain keys whose deadlines have been reached but which have not yet been retrieved using poll_expired. The panic occurs in the remove method of the timer wheel implementation ported from the tokio core, which asserts on an attempt to convert from a millisecond offset to an internal wheel level if given a value where (a) both the level and slot index would be 0 and (b) the value is equal to the wheel's current time offset. That assert would be valid on the assumption that the wheel should not contain an entry which expires at the wheel's current time, but there may be multiple wheel entries with the same deadline and only one entry is removed from the wheel per poll call, with the wheel advancing its internal time to those entries' deadline as it shifts entries between levels before popping the first of the entries. Since the general case where there are multiple entries with the same deadline is already well handled, it's only necessary to avoid panicking in Wheel::remove on the special case where the slot index is 0, which we do by masking the slot bits before computing the level.

Darksonn

I assume that we have an identical implementation in the main Tokio crate as well. Should that be updated too?

Darksonn · 2020-12-26T21:53:15Z

tokio-util/src/time/delay_queue.rs

@@ -548,6 +548,8 @@ impl<T> DelayQueue<T> {
        let when = self.normalize_deadline(when);

        self.slab[key.index].when = when;
+        self.slab[key.index].expired = false;


So, just to confirm, the reason this is correct is that the call to insert_idx will set it back to true if it is expired?

Yes, remove_key combined with these assignments should leave the slab entry in the same state as if it was a newly created Data value in insert_at. Then insert_idx is responsible for placing the entry in the wheel or expired queue and setting the flag.

wabain · 2020-12-27T22:30:37Z

I assume that we have an identical implementation in the main Tokio crate as well. Should that be updated too?

The scenario I listed first in the pull request description is specific to the delay queue. (Previously #3229 fixed a logically similar bug about dropping Tokio timer entries which had been moved from the wheel into the pending list, but the implementation was separate.)

The second scenario concerns timer wheel functions which do have a similar implementation in the main Tokio crate and I ported the changes to those over, although in Tokio 0.2–1 (I haven't checked 0.1) it's impossible to trigger the assert through the time driver given its access patterns.

Darksonn · 2020-12-27T22:33:54Z

Alright, thanks. I hope to take another look at the PR tomorrow.

wabain · 2020-12-27T22:41:57Z

tokio-util/src/time/wheel/mod.rs

    pub(crate) fn remove(&mut self, item: &T::Borrowed, store: &mut T::Store) {
        let when = T::when(item, store);
+
+        assert!(


By the way, I added this assert because while the assert in level_for wasn't valid in general, it turned out to be helpful in surfacing some forms of inconsistent state; that's how issue #2473 (looking up entries in the wheel which had been moved to the expired queue) manifested. Asserting this invariant can catch at least some cases of invalid API use or delay queue corruption.

I added the same condition as a debug_assert in the main Tokio crate, where I don't think it could be hit unless internal state were corrupted.

Darksonn

The changes here seem good to me, but it'd be great to have someone more familiar with the timer take a look too. Especially regarding the asserts.

I expect others will be able to review once we get out of vacation territory.

wabain · 2020-12-30T03:21:06Z

@Darksonn Thanks for taking a look!

paulhauner · 2021-01-20T22:46:40Z

We've been observing this panic since May 2020 (#2473) and are still observing it to this day in production.

I'm not trying to be pushy, just communicating that this is an issue in production and we're keen to see a fix. Thanks to the maintainers and the PR author for their great efforts.

paulhauner · 2021-01-21T23:40:48Z

Thank you @Darksonn, much appreciated!

wabain added 4 commits December 13, 2020 23:08

time: port wheel boundary asserts from delay queue

7653585

util: fix reset of DelayQueue expired queue entry

8fac182

docs: update stray mentions of NotReady

736de3c

Darksonn added A-tokio-util Area: The tokio-util crate M-time Module: tokio/time labels Dec 14, 2020

AgeManning mentioned this pull request Dec 21, 2020

tokio-runtime-worker panicked sigp/lighthouse#2099

Closed

Darksonn reviewed Dec 26, 2020

View reviewed changes

wabain added 2 commits December 27, 2020 16:26

time: avoid magic number in mask calculation

09f8daf

util: add more description of regression test

053effe

wabain commented Dec 27, 2020

View reviewed changes

Darksonn reviewed Dec 29, 2020

View reviewed changes

AgeManning mentioned this pull request Dec 29, 2020

Shorten discv5 vote timeout and time pings per peer sigp/lighthouse#2131

Closed

Darksonn approved these changes Jan 21, 2021

View reviewed changes

Darksonn merged commit c4f66ed into tokio-rs:master Jan 21, 2021

Darksonn mentioned this pull request Jan 21, 2021

Prepare tokio-util 0.6.2 #3453

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix panics on updating DelayQueue entries #3270

Fix panics on updating DelayQueue entries #3270

wabain commented Dec 14, 2020 •

edited

Loading

Darksonn left a comment

Darksonn Dec 26, 2020

wabain Dec 27, 2020 •

edited

Loading

wabain commented Dec 27, 2020

Darksonn commented Dec 27, 2020

wabain Dec 27, 2020 •

edited

Loading

Darksonn left a comment

wabain commented Dec 30, 2020

paulhauner commented Jan 20, 2021

paulhauner commented Jan 21, 2021

Fix panics on updating DelayQueue entries #3270

Fix panics on updating DelayQueue entries #3270

Conversation

wabain commented Dec 14, 2020 • edited Loading

Motivation

Solution

Darksonn left a comment

Choose a reason for hiding this comment

Darksonn Dec 26, 2020

Choose a reason for hiding this comment

wabain Dec 27, 2020 • edited Loading

Choose a reason for hiding this comment

wabain commented Dec 27, 2020

Darksonn commented Dec 27, 2020

wabain Dec 27, 2020 • edited Loading

Choose a reason for hiding this comment

Darksonn left a comment

Choose a reason for hiding this comment

wabain commented Dec 30, 2020

paulhauner commented Jan 20, 2021

paulhauner commented Jan 21, 2021

wabain commented Dec 14, 2020 •

edited

Loading

wabain Dec 27, 2020 •

edited

Loading

wabain Dec 27, 2020 •

edited

Loading