Skip to content

Commit

Permalink
net: replace RwLock<Slab> with a lock free slab (#1625)
Browse files Browse the repository at this point in the history
## Motivation

The `tokio_net::driver` module currently stores the state associated
with scheduled IO resources in a `Slab` implementation from the `slab`
crate. Because inserting items into and removing items from `slab::Slab`
requires mutable access, the slab must be placed within a `RwLock`. This
has the potential to be a performance bottleneck especially in the context of
the work-stealing scheduler where tasks and the reactor are often located on
the same thread.

`tokio-net` currently reimplements the `ShardedRwLock` type from
`crossbeam` on top of `parking_lot`'s `RwLock` in an attempt to squeeze
as much performance as possible out of the read-write lock around the
slab. This introduces several dependencies that are not used elsewhere.

## Solution

This branch replaces the `RwLock<Slab>` with a lock-free sharded slab
implementation. 

The sharded slab is based on the concept of _free list sharding_
described by Leijen, Zorn, and de Moura in [_Mimalloc: Free List
Sharding in Action_][mimalloc], which describes the implementation of a
concurrent memory allocator. In this approach, the slab is sharded so
that each thread has its own thread-local list of slab _pages_. Objects
are always inserted into the local slab of the thread where the
insertion is performed. Therefore, the insert operation needs not be
synchronized.

However, since objects can be _removed_ from the slab by threads other
than the one on which they were inserted, removal operations can still
occur concurrently. Therefore, Leijen et al. introduce a concept of
_local_ and _global_ free lists. When an object is removed on the same
thread it was originally inserted on, it is placed on the local free
list; if it is removed on another thread, it goes on the global free
list for the heap of the thread from which it originated. To find a free
slot to insert into, the local free list is used first; if it is empty,
the entire global free list is popped onto the local free list. Since
the local free list is only ever accessed by the thread it belongs to,
it does not require synchronization at all, and because the global free
list is popped from infrequently, the cost of synchronization has a
reduced impact. A majority of insertions can occur without any
synchronization at all; and removals only require synchronization when
an object has left its parent thread.

The sharded slab was initially implemented in a separate crate (soon to
be released), vendored in-tree to decrease `tokio-net`'s dependencies.
Some code from the original implementation was removed or simplified,
since it is only necessary to support `tokio-net`'s use case, rather
than to provide a fully generic implementation.

[mimalloc]: https://www.microsoft.com/en-us/research/uploads/prod/2019/06/mimalloc-tr-v1.pdf

## Performance

These graphs were produced by out-of-tree `criterion` benchmarks of the
sharded slab implementation.


The first shows the results of a benchmark where an increasing number of
items are inserted and then removed into a slab concurrently by five
threads. It compares the performance of the sharded slab implementation
with a `RwLock<slab::Slab>`:

<img width="1124" alt="Screen Shot 2019-10-01 at 5 09 49 PM" src="https://user-images.githubusercontent.com/2796466/66078398-cd6c9f80-e516-11e9-9923-0ed6292e8498.png">

The second graph shows the results of a benchmark where an increasing
number of items are inserted and then removed by a _single_ thread. It
compares the performance of the sharded slab implementation with an
`RwLock<slab::Slab>` and a `mut slab::Slab`.

<img width="925" alt="Screen Shot 2019-10-01 at 5 13 45 PM" src="https://user-images.githubusercontent.com/2796466/66078469-f0974f00-e516-11e9-95b5-f65f0aa7e494.png">

Note that while the `mut slab::Slab` (i.e. no read-write lock) is
(unsurprisingly) faster than the sharded slab in the single-threaded
benchmark, the sharded slab outperforms the un-contended
`RwLock<slab::Slab>`. This case, where the lock is uncontended and only
accessed from a single thread, represents the best case for the current
use of `slab` in `tokio-net`, since the lock cannot be conditionally
removed in the single-threaded case.

These benchmarks demonstrate that, while the sharded approach introduces
a small constant-factor overhead, it offers significantly better
performance across concurrent accesses.

## Notes

This branch removes the following dependencies `tokio-net`:
- `parking_lot`
- `num_cpus`
- `crossbeam_util`
- `slab`

This branch adds the following dev-dependencies:
- `proptest`
- `loom`

Note that these dev dependencies were used to implement tests for the
sharded-slab crate out-of-tree, and were necessary in order to vendor
the existing tests. Alternatively, since the implementation is tested
externally, we _could_ remove these tests in order to avoid picking up
dev-dependencies. However, this means that we should try to ensure that
`tokio-net`'s vendored implementation doesn't diverge significantly from
upstream's, since it would be missing a majority of its tests.

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
  • Loading branch information
hawkw committed Oct 28, 2019
1 parent 1195263 commit 7eb264a
Show file tree
Hide file tree
Showing 28 changed files with 2,326 additions and 314 deletions.
1 change: 1 addition & 0 deletions azure-pipelines.yml
Expand Up @@ -82,6 +82,7 @@ jobs:
rust: beta
crates:
- tokio-executor
- tokio

# Try cross compiling
- template: ci/azure-cross-compile.yml
Expand Down
10 changes: 9 additions & 1 deletion tokio/Cargo.toml
Expand Up @@ -41,7 +41,7 @@ io-util = ["io-traits", "pin-project", "memchr"]
io = ["io-traits", "io-util"]
macros = ["tokio-macros"]
net-full = ["tcp", "udp", "uds"]
net-driver = ["mio", "tokio-executor/blocking"]
net-driver = ["mio", "tokio-executor/blocking", "lazy_static"]
rt-current-thread = [
"timer",
"tokio-executor/current-thread",
Expand Down Expand Up @@ -113,6 +113,10 @@ version = "0.3.8"
default-features = false
optional = true

[target.'cfg(loom)'.dependencies]
# play nice with loom tests in other crates.
loom = "0.2.11"

[dev-dependencies]
tokio-test = { version = "=0.2.0-alpha.6", path = "../tokio-test" }
tokio-util = { version = "=0.2.0-alpha.6", path = "../tokio-util" }
Expand All @@ -130,5 +134,9 @@ serde_json = "1.0"
tempfile = "3.1.0"
time = "0.1"

# sharded slab tests
loom = "0.2.11"
proptest = "0.9.4"

[package.metadata.docs.rs]
all-features = true
4 changes: 3 additions & 1 deletion tokio/src/lib.rs
Expand Up @@ -69,7 +69,6 @@
//! }
//! }
//! ```

macro_rules! if_runtime {
($($i:item)*) => ($(
#[cfg(any(
Expand Down Expand Up @@ -97,6 +96,9 @@ pub mod io;
#[cfg(feature = "net-driver")]
pub mod net;

#[cfg(feature = "net-driver")]
mod loom;

pub mod prelude;

#[cfg(feature = "process")]
Expand Down
45 changes: 45 additions & 0 deletions tokio/src/loom.rs
@@ -0,0 +1,45 @@
//! This module abstracts over `loom` and `std::sync` depending on whether we
//! are running tests or not.
pub(crate) use self::inner::*;

#[cfg(all(test, loom))]
mod inner {
pub(crate) use loom::sync::CausalCell;
pub(crate) use loom::sync::Mutex;
pub(crate) mod atomic {
pub(crate) use loom::sync::atomic::*;
pub(crate) use std::sync::atomic::Ordering;
}
}

#[cfg(not(all(test, loom)))]
mod inner {
use std::cell::UnsafeCell;
pub(crate) use std::sync::atomic;
pub(crate) use std::sync::Mutex;

#[derive(Debug)]
pub(crate) struct CausalCell<T>(UnsafeCell<T>);

impl<T> CausalCell<T> {
pub(crate) fn new(data: T) -> CausalCell<T> {
CausalCell(UnsafeCell::new(data))
}

#[inline(always)]
pub(crate) fn with<F, R>(&self, f: F) -> R
where
F: FnOnce(*const T) -> R,
{
f(self.0.get())
}

#[inline(always)]
pub(crate) fn with_mut<F, R>(&self, f: F) -> R
where
F: FnOnce(*mut T) -> R,
{
f(self.0.get())
}
}
}
8 changes: 8 additions & 0 deletions tokio/src/net/driver/mod.rs
Expand Up @@ -124,7 +124,15 @@
//! [`PollEvented`]: struct.PollEvented.html
//! [`std::io::Read`]: https://doc.rust-lang.org/std/io/trait.Read.html
//! [`std::io::Write`]: https://doc.rust-lang.org/std/io/trait.Write.html
#[cfg(loom)]
macro_rules! loom_thread_local {
($($tts:tt)+) => { loom::thread_local!{ $($tts)+ } }
}

#[cfg(not(loom))]
macro_rules! loom_thread_local {
($($tts:tt)+) => { std::thread_local!{ $($tts)+ } }
}
pub(crate) mod platform;
mod reactor;
mod registration;
Expand Down
53 changes: 53 additions & 0 deletions tokio/src/net/driver/reactor/dispatch/iter.rs
@@ -0,0 +1,53 @@
use super::{
page::{self, ScheduledIo},
Shard,
};
use std::slice;

pub(in crate::net::driver::reactor) struct UniqueIter<'a> {
pub(super) shards: slice::IterMut<'a, Shard>,
pub(super) pages: slice::Iter<'a, page::Shared>,
pub(super) slots: Option<page::Iter<'a>>,
}

impl<'a> Iterator for UniqueIter<'a> {
type Item = &'a ScheduledIo;
fn next(&mut self) -> Option<Self::Item> {
loop {
if let Some(item) = self.slots.as_mut().and_then(|slots| slots.next()) {
return Some(item);
}

if let Some(page) = self.pages.next() {
self.slots = page.iter();
}

if let Some(shard) = self.shards.next() {
self.pages = shard.iter();
} else {
return None;
}
}
}
}

pub(in crate::net::driver::reactor) struct ShardIter<'a> {
pub(super) pages: slice::IterMut<'a, page::Shared>,
pub(super) slots: Option<page::Iter<'a>>,
}

impl<'a> Iterator for ShardIter<'a> {
type Item = &'a ScheduledIo;
fn next(&mut self) -> Option<Self::Item> {
loop {
if let Some(item) = self.slots.as_mut().and_then(|slots| slots.next()) {
return Some(item);
}
if let Some(page) = self.pages.next() {
self.slots = page.iter();
} else {
return None;
}
}
}
}
36 changes: 36 additions & 0 deletions tokio/src/net/driver/reactor/dispatch/mod.rs
@@ -0,0 +1,36 @@
//! A lock-free concurrent slab.

#[cfg(all(test, loom))]
macro_rules! test_println {
($($arg:tt)*) => {
println!("{:?} {}", crate::net::driver::reactor::dispatch::Tid::current(), format_args!($($arg)*))
}
}

mod iter;
mod pack;
mod page;
mod sharded_slab;
mod tid;

#[cfg(all(test, loom))]
// this is used by sub-modules
use self::tests::test_util;
use pack::{Pack, WIDTH};
use sharded_slab::Shard;
#[cfg(all(test, loom))]
pub(crate) use sharded_slab::Slab;
pub(crate) use sharded_slab::{SingleShard, MAX_SOURCES};
use tid::Tid;

#[cfg(target_pointer_width = "64")]
const MAX_THREADS: usize = 4096;
#[cfg(target_pointer_width = "32")]
const MAX_THREADS: usize = 2048;
const INITIAL_PAGE_SIZE: usize = 32;
const MAX_PAGES: usize = WIDTH / 4;
// Chosen arbitrarily.
const RESERVED_BITS: usize = 5;

#[cfg(test)]
mod tests;
89 changes: 89 additions & 0 deletions tokio/src/net/driver/reactor/dispatch/pack.rs
@@ -0,0 +1,89 @@
pub(super) const WIDTH: usize = std::mem::size_of::<usize>() * 8;

/// Trait encapsulating the calculations required for bit-packing slab indices.
///
/// This allows us to avoid manually repeating some calculations when packing
/// and unpacking indices.
pub(crate) trait Pack: Sized {
// ====== provided by each implementation =================================

/// The number of bits occupied by this type when packed into a usize.
///
/// This must be provided to determine the number of bits into which to pack
/// the type.
const LEN: usize;
/// The type packed on the less significant side of this type.
///
/// If this type is packed into the least significant bit of a usize, this
/// should be `()`, which occupies no bytes.
///
/// This is used to calculate the shift amount for packing this value.
type Prev: Pack;

// ====== calculated automatically ========================================

/// A number consisting of `Self::LEN` 1 bits, starting at the least
/// significant bit.
///
/// This is the higest value this type can represent. This number is shifted
/// left by `Self::SHIFT` bits to calculate this type's `MASK`.
///
/// This is computed automatically based on `Self::LEN`.
const BITS: usize = {
let shift = 1 << (Self::LEN - 1);
shift | (shift - 1)
};
/// The number of bits to shift a number to pack it into a usize with other
/// values.
///
/// This is caculated automatically based on the `LEN` and `SHIFT` constants
/// of the previous value.
const SHIFT: usize = Self::Prev::SHIFT + Self::Prev::LEN;

/// The mask to extract only this type from a packed `usize`.
///
/// This is calculated by shifting `Self::BITS` left by `Self::SHIFT`.
const MASK: usize = Self::BITS << Self::SHIFT;

fn as_usize(&self) -> usize;
fn from_usize(val: usize) -> Self;

#[inline(always)]
fn pack(&self, to: usize) -> usize {
let value = self.as_usize();
debug_assert!(value <= Self::BITS);

(to & !Self::MASK) | (value << Self::SHIFT)
}

#[inline(always)]
fn from_packed(from: usize) -> Self {
let value = (from & Self::MASK) >> Self::SHIFT;
debug_assert!(value <= Self::BITS);
Self::from_usize(value)
}
}

impl Pack for () {
const BITS: usize = 0;
const LEN: usize = 0;
const SHIFT: usize = 0;
const MASK: usize = 0;

type Prev = ();

fn as_usize(&self) -> usize {
unreachable!()
}
fn from_usize(_val: usize) -> Self {
unreachable!()
}

fn pack(&self, _to: usize) -> usize {
unreachable!()
}

fn from_packed(_from: usize) -> Self {
unreachable!()
}
}

0 comments on commit 7eb264a

Please sign in to comment.