Write a parallel deque for work stealing #4877

brson · 2013-02-10T22:49:42Z

The work stealing algorithm uses a deque. Their algorithm has some properties that might make further optimizations possible later (one end is used only by a single thread). We only need something simple to start with, a locked vector that just pushes and pops and shifts and unshifts. Using a circular buffer would be better, lock-free better still (maybe).

Also probably relevant is the paper on data locality in work stealing though I haven't read it yet.

There are some useful data structures for atomically reference counted types and mutexes in core::private.

The text was updated successfully, but these errors were encountered:

brson · 2013-02-10T22:59:49Z

Related to #3095

brson · 2013-02-15T07:05:40Z

Since I'm not sure that lock-free deques even exist I've scaled back the scope of this slightly.

nikomatsakis · 2013-02-17T00:39:14Z

The deques used in work-stealing typically have a special property that one thread (the owner) only PUSHES and POP and other threads (the thieves) only ever DEQUEUE. Nobody ever queues. This lets you get better efficiency for multi-threading, but means that they are not multi-purpose.

brson · 2013-02-20T23:48:06Z

Somebody pointed out that the 'Chase / Lev' deque is a lock-free parallel deque for work stealing. Sounds good to me.

ILyoan · 2013-04-30T05:17:48Z

Is there any progress on this?

brson · 2013-05-02T04:55:14Z

@ILyoan No. There is a type in place at rt::work_queue but it is not implemented.

brson · 2013-05-16T18:58:18Z

A related paper http://www.cs.bgu.ac.il/~hendlerd/papers/dynamic-size-deque.pdf, "A Dynamic-Sized Nonblocking Work Stealing Deque", by Hendler, Lev, Moir, Shavit

emberian · 2013-07-12T19:36:31Z

@Aatch was working on this at some point but stopped.

toddaaro · 2013-07-25T22:17:01Z

http://www.di.ens.fr/~zappa/readings/ppopp13.pdf

This recent paper is a wonderfully detailed description of the atomic memory issues involved with the data structure. Includes pseudocode for a C11 memory model version with every atomic operation specified.

cartazio · 2013-09-04T01:32:20Z

a good example implementation of such a scheduler can be found in this haskell lib: https://github.com/ekmett/structures/blob/master/src/Control/Concurrent/Deque.hs

brson · 2013-10-17T00:54:25Z

I am wondering whether we really want to use a work-stealing deque as or per-thread work queue. Something that allows for random access would prevent the starvation problems we have to hack around and would be more fair. Is there such a data structure that is lock free?

toffaletti · 2013-11-06T21:07:47Z

I've got a work-in-progress implementation of chase-lev deque. I started working from the C11 paper, but I found some bugs and omissions in their implementation so I'm having to go back to the original paper. Any help would be appreciated. I'm currently trying to wrap my head around Section 4 of the paper, which discusses how to adapt the algorithm for growing/shrinking the array without a garbage collector.

https://github.com/toffaletti/rust-code/blob/master/chase_lev_deque.rs

Even if this isn't used for the scheduler, I've been told it might be useful for Servo rendering work.

cartazio · 2013-11-06T22:28:02Z

if you want to see a worked out chase lev implementation thats pretty readable, look at the deque branch of edward kmett's structures lib https://github.com/ekmett/structures/blob/deque/src/Control/Concurrent/Deque.hs

it uses some other primops you can see defined here http://hackage.haskell.org/package/atomic-primops-0.4/docs/Data-Atomics-Counter-Reference.html

some of that may not be relevant for the no gc context, but at least gives a pretty readable working implementation to look at explicitly

toffaletti · 2013-11-07T00:34:44Z

Thanks, @cartazio I will take a look. I haven't seen a fully working implementation that does array resizing and reclaiming without a GC. The original paper is light on specifics, just outlining a solution and then in a footnote saying "It is straightforward, however, to use the same solution for reclaiming buﬀers also when growing."

I decided to use Relacy Race Detector to help get a correct implementation because helgrind and drd report too many false positives. That attempt is here: https://github.com/toffaletti/chase-lev

It currently passes the tests, but I've gone the extremely heavy handed route of making all atomic access use memory_order_seq_cst. I'll work on relaxing that and fixing any other bugs.

This adds an implementation of the Chase-Lev work-stealing deque to libstd under std::rt::deque. I've been unable to break the implementation of the deque itself, and it's not super highly optimized just yet (everything uses a SeqCst memory ordering). The major snag in implementing the chase-lev deque is that the buffers used to store data internally cannot get deallocated back to the OS. In the meantime, a shared buffer pool (synchronized by a normal mutex) is used to deallocate/allocate buffers from. This is done in hope of not overcommitting too much memory. It is in theory possible to eventually free the buffers, but one must be very careful in doing so. I was unable to get some good numbers from src/test/bench tests (I don't think many of them are slamming the work queue that much), but I was able to get some good numbers from one of my own tests. In a recent rewrite of select::select(), I found that my implementation was incredibly slow due to contention on the shared work queue. Upon switching to the parallel deque, I saw the contention drop to 0 and the runtime go from 1.6s to 0.9s with the most amount of time spent in libuv awakening the schedulers (plus allocations). Closes rust-lang#4877

This adds an implementation of the Chase-Lev work-stealing deque to libstd under std::rt::deque. I've been unable to break the implementation of the deque itself, and it's not super highly optimized just yet (everything uses a SeqCst memory ordering). The major snag in implementing the chase-lev deque is that the buffers used to store data internally cannot get deallocated back to the OS. In the meantime, a shared buffer pool (synchronized by a normal mutex) is used to deallocate/allocate buffers from. This is done in hope of not overcommitting too much memory. It is in theory possible to eventually free the buffers, but one must be very careful in doing so. I was unable to get some good numbers from src/test/bench tests (I don't think many of them are slamming the work queue that much), but I was able to get some good numbers from one of my own tests. In a recent rewrite of select::select(), I found that my implementation was incredibly slow due to contention on the shared work queue. Upon switching to the parallel deque, I saw the contention drop to 0 and the runtime go from 1.6s to 0.9s with the most amount of time spent in libuv awakening the schedulers (plus allocations). Closes #4877

brson mentioned this issue Feb 19, 2013

A new scheduler prototype #5022

Closed

brson mentioned this issue Apr 30, 2013

Scheduler rewrite with I/O event loop #4419

Closed

30 tasks

toddaaro mentioned this issue Jul 26, 2013

Implement llvm memory fence intrinsic. #8061

Closed

brson mentioned this issue Oct 17, 2013

Implement lock-free queue for scheduler message queue #9710

Closed

toffaletti mentioned this issue Nov 17, 2013

improve work queue #10529

Closed

alexcrichton mentioned this issue Nov 26, 2013

Implement a lock-free work-stealing deque #10678

Merged

alexcrichton closed this as completed in a70f9d7 Nov 29, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write a parallel deque for work stealing #4877

Write a parallel deque for work stealing #4877

brson commented Feb 10, 2013

brson commented Feb 10, 2013

brson commented Feb 15, 2013

nikomatsakis commented Feb 17, 2013

brson commented Feb 20, 2013

ILyoan commented Apr 30, 2013

brson commented May 2, 2013

brson commented May 16, 2013

emberian commented Jul 12, 2013

toddaaro commented Jul 25, 2013

cartazio commented Sep 4, 2013

brson commented Oct 17, 2013

toffaletti commented Nov 6, 2013

cartazio commented Nov 6, 2013

toffaletti commented Nov 7, 2013

Write a parallel deque for work stealing #4877

Write a parallel deque for work stealing #4877

Comments

brson commented Feb 10, 2013

brson commented Feb 10, 2013

brson commented Feb 15, 2013

nikomatsakis commented Feb 17, 2013

brson commented Feb 20, 2013

ILyoan commented Apr 30, 2013

brson commented May 2, 2013

brson commented May 16, 2013

emberian commented Jul 12, 2013

toddaaro commented Jul 25, 2013

cartazio commented Sep 4, 2013

brson commented Oct 17, 2013

toffaletti commented Nov 6, 2013

cartazio commented Nov 6, 2013

toffaletti commented Nov 7, 2013