Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix panics on updating DelayQueue entries #3270

Merged
merged 6 commits into from
Jan 21, 2021

Conversation

wabain
Copy link
Contributor

@wabain wabain commented Dec 14, 2020

Motivation

This PR addresses two related bugs related to edge cases in tokio_util::time::DelayQueue.

  1. When an entry is inserted into the queue with a deadline that has already elapsed it is flagged and put on a special-cased subqueue. If the entry's key is subsequently reset with a deadline in the future, the expired flag is not cleared. Trying to use DelayQueue::{reset, reset_at, remove} with the same key a second time will either cause a debug assert or trigger various kinds of breakage in builds without debug assertions—the key may be left in the timer wheel after a remove call and then reused for a separate storage entry, etc.

  2. Under specific circumstances, calling DelayQueue::{reset, reset_at, remove} with an entry which has expired but has not yet been retrieved with DelayQueue::poll_expiration will cause a panic. For this to occur multiple entries need to be inserted with a deadline that happens to be mapped to slot 0 of a level in the timer wheel. When the first of these entries expires and is retrieved the wheel's internal time offset is set to the entries' deadline. Trying to look up another of these entries by key then causes the timer wheel to panic because the call to compute the wheel level for the entry incidentally relies on either the entry's expiration being in the future or the slot index being non-zero.

Solution

  1. Clear the expired flag in DelayQueue::reset_at before re-inserting the entry into the queue, which will set the flag again if needed.

  2. I believe the Wheel::remove method is the only point where the wheel implementation does not gracefully handle there being an entry that expires at the current time offset. All the logic which can run from within the wheel's poll method already has to handle that at least transiently being the case, which accounts for pretty much all of the wheel's surface area apart from the insert and remove methods. There is nothing special about slot 0 in a wheel level, so this PR modifies the level computation to work at the current time offset with any slot value.

    I ported the change in the level computation back to the tokio core timer wheel, but both the current core timer implementation and the 0.2 time driver, which shared its Wheel implementation with DelayQueue, remove expired entries from the wheel all at once in a non-preemptive loop, so neither can hit the panic. (Although PR time: Fix race condition in timer drop #3229 recently addressed a bug in the core timer which is very similar to these.)

The DelayQueue remove and reset methods can panic when attempting to
remove certain keys whose deadlines have been reached but which have not
yet been retrieved using poll_expired.

The panic occurs in the remove method of the timer wheel implementation
ported from the tokio core, which asserts on an attempt to convert from
a millisecond offset to an internal wheel level if given a value where
(a) both the level and slot index would be 0 and (b) the value is equal
to the wheel's current time offset.

That assert would be valid on the assumption that the wheel should not
contain an entry which expires at the wheel's current time, but there
may be multiple wheel entries with the same deadline and only one entry
is removed from the wheel per poll call, with the wheel advancing its
internal time to those entries' deadline as it shifts entries between
levels before popping the first of the entries.

Since the general case where there are multiple entries with the same
deadline is already well handled, it's only necessary to avoid panicking
in Wheel::remove on the special case where the slot index is 0, which we
do by masking the slot bits before computing the level.
@Darksonn Darksonn added A-tokio-util Area: The tokio-util crate M-time Module: tokio/time labels Dec 14, 2020
Copy link
Contributor

@Darksonn Darksonn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that we have an identical implementation in the main Tokio crate as well. Should that be updated too?

@@ -548,6 +548,8 @@ impl<T> DelayQueue<T> {
let when = self.normalize_deadline(when);

self.slab[key.index].when = when;
self.slab[key.index].expired = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, just to confirm, the reason this is correct is that the call to insert_idx will set it back to true if it is expired?

Copy link
Contributor Author

@wabain wabain Dec 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, remove_key combined with these assignments should leave the slab entry in the same state as if it was a newly created Data value in insert_at. Then insert_idx is responsible for placing the entry in the wheel or expired queue and setting the flag.

@wabain
Copy link
Contributor Author

wabain commented Dec 27, 2020

I assume that we have an identical implementation in the main Tokio crate as well. Should that be updated too?

The scenario I listed first in the pull request description is specific to the delay queue. (Previously #3229 fixed a logically similar bug about dropping Tokio timer entries which had been moved from the wheel into the pending list, but the implementation was separate.)

The second scenario concerns timer wheel functions which do have a similar implementation in the main Tokio crate and I ported the changes to those over, although in Tokio 0.2–1 (I haven't checked 0.1) it's impossible to trigger the assert through the time driver given its access patterns.

@Darksonn
Copy link
Contributor

Alright, thanks. I hope to take another look at the PR tomorrow.

pub(crate) fn remove(&mut self, item: &T::Borrowed, store: &mut T::Store) {
let when = T::when(item, store);

assert!(
Copy link
Contributor Author

@wabain wabain Dec 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I added this assert because while the assert in level_for wasn't valid in general, it turned out to be helpful in surfacing some forms of inconsistent state; that's how issue #2473 (looking up entries in the wheel which had been moved to the expired queue) manifested. Asserting this invariant can catch at least some cases of invalid API use or delay queue corruption.

I added the same condition as a debug_assert in the main Tokio crate, where I don't think it could be hit unless internal state were corrupted.

Copy link
Contributor

@Darksonn Darksonn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here seem good to me, but it'd be great to have someone more familiar with the timer take a look too. Especially regarding the asserts.

I expect others will be able to review once we get out of vacation territory.

@wabain
Copy link
Contributor Author

wabain commented Dec 30, 2020

@Darksonn Thanks for taking a look!

@paulhauner
Copy link

We've been observing this panic since May 2020 (#2473) and are still observing it to this day in production.

I'm not trying to be pushy, just communicating that this is an issue in production and we're keen to see a fix. Thanks to the maintainers and the PR author for their great efforts.

@Darksonn Darksonn merged commit c4f66ed into tokio-rs:master Jan 21, 2021
@paulhauner
Copy link

Thank you @Darksonn, much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio-util Area: The tokio-util crate M-time Module: tokio/time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants