Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: replace RwLock<Slab> with a lock free slab #1625

Merged
merged 94 commits into from Oct 28, 2019
Merged

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Oct 2, 2019

Motivation

The tokio_net::driver module currently stores the state associated
with scheduled IO resources in a Slab implementation from the slab
crate. Because inserting items into and removing items from slab::Slab
requires mutable access, the slab must be placed within a RwLock. This
has the potential to be a performance bottleneck especially in the context of
the work-stealing scheduler where tasks and the reactor are often located on
the same thread.

tokio-net currently reimplements the ShardedRwLock type from
crossbeam on top of parking_lot's RwLock in an attempt to squeeze
as much performance as possible out of the read-write lock around the
slab. This introduces several dependencies that are not used elsewhere.

Solution

This branch replaces the RwLock<Slab> with a lock-free sharded slab
implementation.

The sharded slab is based on the concept of free list sharding
described by Leijen, Zorn, and de Moura in Mimalloc: Free List
Sharding in Action
, which describes the implementation of a
concurrent memory allocator. In this approach, the slab is sharded so
that each thread has its own thread-local list of slab pages. Objects
are always inserted into the local slab of the thread where the
insertion is performed. Therefore, the insert operation needs not be
synchronized.

However, since objects can be removed from the slab by threads other
than the one on which they were inserted, removal operations can still
occur concurrently. Therefore, Leijen et al. introduce a concept of
local and global free lists. When an object is removed on the same
thread it was originally inserted on, it is placed on the local free
list; if it is removed on another thread, it goes on the global free
list for the heap of the thread from which it originated. To find a free
slot to insert into, the local free list is used first; if it is empty,
the entire global free list is popped onto the local free list. Since
the local free list is only ever accessed by the thread it belongs to,
it does not require synchronization at all, and because the global free
list is popped from infrequently, the cost of synchronization has a
reduced impact. A majority of insertions can occur without any
synchronization at all; and removals only require synchronization when
an object has left its parent thread.

The sharded slab was initially implemented in a separate crate (soon to
be released), vendored in-tree to decrease tokio-net's dependencies.
Some code from the original implementation was removed or simplified,
since it is only necessary to support tokio-net's use case, rather
than to provide a fully generic implementation.

Performance

These graphs were produced by out-of-tree criterion benchmarks of the
sharded slab implementation.

The first shows the results of a benchmark where an increasing number of
items are inserted and then removed into a slab concurrently by five
threads. It compares the performance of the sharded slab implementation
with a RwLock<slab::Slab>:

Screen Shot 2019-10-01 at 5 09 49 PM

The second graph shows the results of a benchmark where an increasing
number of items are inserted and then removed by a single thread. It
compares the performance of the sharded slab implementation with an
RwLock<slab::Slab> and a mut slab::Slab.

Screen Shot 2019-10-01 at 5 13 45 PM

Note that while the mut slab::Slab (i.e. no read-write lock) is
(unsurprisingly) faster than the sharded slab in the single-threaded
benchmark, the sharded slab outperforms the un-contended
RwLock<slab::Slab>. This case, where the lock is uncontended and only
accessed from a single thread, represents the best case for the current
use of slab in tokio-net, since the lock cannot be conditionally
removed in the single-threaded case.

These benchmarks demonstrate that, while the sharded approach introduces
a small constant-factor overhead, it offers significantly better
performance across concurrent accesses.

Notes

This branch removes the following dependencies tokio-net:

  • parking_lot
  • num_cpus
  • crossbeam_util
  • slab

This branch adds the following dev-dependencies:

  • proptest
  • loom

Note that these dev dependencies were used to implement tests for the
sharded-slab crate out-of-tree, and were necessary in order to vendor
the existing tests. Alternatively, since the implementation is tested
externally, we could remove these tests in order to avoid picking up
dev-dependencies. However, this means that we should try to ensure that
tokio-net's vendored implementation doesn't diverge significantly from
upstream's, since it would be missing a majority of its tests.

Signed-off-by: Eliza Weisman eliza@buoyant.io

@hawkw hawkw added the C-enhancement Category: A PR with an enhancement or bugfix. label Oct 2, 2019
@hawkw hawkw requested a review from carllerche October 2, 2019 20:21
@hawkw hawkw self-assigned this Oct 2, 2019
}
}

trait Pack: Sized {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a trait? I'm not seeing any bounds.

@hawkw hawkw requested a review from carllerche October 7, 2019 22:55
@hawkw
Copy link
Member Author

hawkw commented Oct 10, 2019

@carllerche as per our offline conversation yesterday, I've made the following changes:

  • removed the Config trait and override the page size in tests using path attributes
  • remove the per-slot ABA guard and put it back on the reactor
  • add loom tests for token uniqueness (one of these will reliably fail without the ABA guard)

tokio-net/src/driver/sharded_slab/slab.rs Outdated Show resolved Hide resolved
tokio-net/src/driver/sharded_slab/pack.rs Outdated Show resolved Hide resolved
tokio-net/src/driver/sharded_slab/page/mod.rs Outdated Show resolved Hide resolved
tokio-net/src/driver/sharded_slab/slab.rs Outdated Show resolved Hide resolved
tokio-net/src/driver/sharded_slab/slab.rs Outdated Show resolved Hide resolved
tokio-net/src/driver/sharded_slab/tid.rs Outdated Show resolved Hide resolved
tokio-net/src/driver/sharded_slab/tid.rs Outdated Show resolved Hide resolved
tokio-net/src/driver/sharded_slab/page/mod.rs Outdated Show resolved Hide resolved
tokio-net/src/driver/sharded_slab/page/mod.rs Outdated Show resolved Hide resolved
@carllerche
Copy link
Member

carllerche commented Oct 15, 2019

I've been thinking about the steps to get this full implemented.

Currently, there is no releasing of pages. This is fine as the current impl does not release. However, the PR as proposed lazily creates instances of the slab per thread. This, paired w/ the blocking API of creating more threads could result in a leak.

What do you think, as an initial step, to have a single slab and have insertions guarded by a Mutex. all other ops should still be lock free, which will be a huge win in of itself. access should be lock-free. For remove, we could do a try_lock and use local if success, remote if fail.

debug_assert!(self.slab.with(|s| unsafe { (*s).is_none() }));

let mut slab = Vec::with_capacity(self.size);
slab.extend((1..self.size).map(Slot::new));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to eagerly initialize slots. This could be very expensive. We want to amortize the cost it by doing it lazily.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carllerche we need to allocate the whole page eagerly (rather than using Vec::push) to avoid reallocating while data is being accessed by another thread. Do you mean that we could allocate uninitialized memory and only initialize the slots as needed? Because (AFAICT) that calls for an additional atomic tracking the last written slot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually...it occurs to me just now that we don't need to use MaybeUninit and a counter tracking the last initialized slot. We can just add new slots with Vec::push, and use Vec::with_capacity to avoid ever triggering a realloc in push. That way, we initialize the slots lazily but have a single allocation that we never need to reallocate.

Since pushes would only occur in insert, which is called only from the local thread (or when the slab is locked in the single-shard case), we don't need to use an atomic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carllerche The approach I mentioned above does appear to work, but it makes it much harder to test the code using loom (the Vec mutation can occur concurrently with accesses which we know are to slots that will not be mutated, but loom doesn't know that). Using CausalCell<MaybeUninit<Slot>>plays much nicer with loom, as the mutation can be tracked at the level of each slot in the slab.

I have a branch that makes that change: https://github.com/tokio-rs/tokio/compare/eliza/net-driver-slab...eliza/net-driver-slab-lazy-init?expand=1#diff-398083c85d9c10f01b1c6aeb84bcdaf1 However, I'm not sure if this is worth it; as far as I can tell, this requires adding an atomic tracking the last slot that was initialized, and insert, get, and remove operations all have to access that atomic. I'd be surprised if lazily initializing slots makes a big enough performance difference to be worth adding so much more unsafe code. What do you think?

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
The `debug_assert!` family of macros are guarded by an
`if cfg!(debug_assertions)`, _not_ a `#[cfg(debug_assertions)]`
attribute. This means that the code in the assertion is still type
checked in release mode, and using code that only exists in debug mode
will thus fail in release mode. See rust-lang/rust#62527.

Therefore, since the `shard.tid` field has a `#[cfg(debug_assertions)]`
attribute on it, it's necessary to also put that attribute on the
assertions that use that field.

cc @kleimkuhler, since you had previously asked why putting the cfg
attribute on these assertions was necessary
(#1625 (comment)). I
thought you might nbe interested in seeing the answer. :)

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
@hawkw hawkw merged commit 7eb264a into master Oct 28, 2019
@carllerche carllerche deleted the eliza/net-driver-slab branch October 30, 2019 03:31
carllerche added a commit that referenced this pull request Aug 10, 2020
The I/O driver uses a slab to store per-resource state. Doing this
provides two benefits. First, allocating state is streamlined. Second,
resources may be safetly indexed using a `usize` type. The `usize` is
used passed to the OS's selector when registering for receiving events.

The original slab implementation used a `Vec` backed by `RwLock`. This
primarily caused contention when reading state. This implementation also
only **grew** the slab capacity but never shrank. In #1625, the slab was
rewritten to use a lock-free strategy. The lock contention was removed
but this implementation was still grow-only.

This change adds the ability to release memory. Similar to the previous
implementation, it structures the slab to use a vector of pages. This
enables growing the slab without having to move any previous entries. It
also adds the ability to release pages. This is done by introducing a
lock when allocating / releasing slab entries. This does not impact
benchmarks, primarily due to the existing implementation not being
"done" and also having a lock around allocating and releasing.

A `Slab::compact()` function is added. Pages are iterated. When a page
is found with no slots in use, the page is freed. The `compact()`
function is called occassionally by the I/O driver.
carllerche added a commit that referenced this pull request Aug 10, 2020
The I/O driver uses a slab to store per-resource state. Doing this
provides two benefits. First, allocating state is streamlined. Second,
resources may be safetly indexed using a `usize` type. The `usize` is
used passed to the OS's selector when registering for receiving events.

The original slab implementation used a `Vec` backed by `RwLock`. This
primarily caused contention when reading state. This implementation also
only **grew** the slab capacity but never shrank. In #1625, the slab was
rewritten to use a lock-free strategy. The lock contention was removed
but this implementation was still grow-only.

This change adds the ability to release memory. Similar to the previous
implementation, it structures the slab to use a vector of pages. This
enables growing the slab without having to move any previous entries. It
also adds the ability to release pages. This is done by introducing a
lock when allocating / releasing slab entries. This does not impact
benchmarks, primarily due to the existing implementation not being
"done" and also having a lock around allocating and releasing.

A `Slab::compact()` function is added. Pages are iterated. When a page
is found with no slots in use, the page is freed. The `compact()`
function is called occassionally by the I/O driver.
carllerche added a commit that referenced this pull request Aug 12, 2020
The I/O driver uses a slab to store per-resource state. Doing this
provides two benefits. First, allocating state is streamlined. Second,
resources may be safely indexed using a `usize` type. The `usize` is
used passed to the OS's selector when registering for receiving events.

The original slab implementation used a `Vec` backed by `RwLock`. This
primarily caused contention when reading state. This implementation also
only **grew** the slab capacity but never shrank. In #1625, the slab was
rewritten to use a lock-free strategy. The lock contention was removed
but this implementation was still grow-only.

This change adds the ability to release memory. Similar to the previous
implementation, it structures the slab to use a vector of pages. This
enables growing the slab without having to move any previous entries. It
also adds the ability to release pages. This is done by introducing a
lock when allocating/releasing slab entries. This does not impact
benchmarks, primarily due to the existing implementation not being
"done" and also having a lock around allocating and releasing.

A `Slab::compact()` function is added. Pages are iterated. When a page
is found with no slots in use, the page is freed. The `compact()`
function is called occasionally by the I/O driver.

Fixes #2505
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: A PR with an enhancement or bugfix.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants