Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write a parallel deque for work stealing #4877

Closed
brson opened this issue Feb 10, 2013 · 14 comments · Fixed by #10678
Closed

Write a parallel deque for work stealing #4877

brson opened this issue Feb 10, 2013 · 14 comments · Fixed by #10678
Labels
A-runtime Area: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflows C-enhancement Category: An issue proposing an enhancement or a PR with one.

Comments

@brson
Copy link
Contributor

brson commented Feb 10, 2013

The work stealing algorithm uses a deque. Their algorithm has some properties that might make further optimizations possible later (one end is used only by a single thread). We only need something simple to start with, a locked vector that just pushes and pops and shifts and unshifts. Using a circular buffer would be better, lock-free better still (maybe).

Also probably relevant is the paper on data locality in work stealing though I haven't read it yet.

There are some useful data structures for atomically reference counted types and mutexes in core::private.

@brson
Copy link
Contributor Author

brson commented Feb 10, 2013

Related to #3095

@brson
Copy link
Contributor Author

brson commented Feb 15, 2013

Since I'm not sure that lock-free deques even exist I've scaled back the scope of this slightly.

@nikomatsakis
Copy link
Contributor

The deques used in work-stealing typically have a special property that one thread (the owner) only PUSHES and POP and other threads (the thieves) only ever DEQUEUE. Nobody ever queues. This lets you get better efficiency for multi-threading, but means that they are not multi-purpose.

@brson
Copy link
Contributor Author

brson commented Feb 20, 2013

Somebody pointed out that the 'Chase / Lev' deque is a lock-free parallel deque for work stealing. Sounds good to me.

@ILyoan
Copy link
Contributor

ILyoan commented Apr 30, 2013

Is there any progress on this?

@brson
Copy link
Contributor Author

brson commented May 2, 2013

@ILyoan No. There is a type in place at rt::work_queue but it is not implemented.

@brson
Copy link
Contributor Author

brson commented May 16, 2013

A related paper http://www.cs.bgu.ac.il/~hendlerd/papers/dynamic-size-deque.pdf, "A Dynamic-Sized Nonblocking Work Stealing Deque", by Hendler, Lev, Moir, Shavit

@emberian
Copy link
Member

@Aatch was working on this at some point but stopped.

@toddaaro
Copy link
Contributor

http://www.di.ens.fr/~zappa/readings/ppopp13.pdf

This recent paper is a wonderfully detailed description of the atomic memory issues involved with the data structure. Includes pseudocode for a C11 memory model version with every atomic operation specified.

@cartazio
Copy link
Contributor

cartazio commented Sep 4, 2013

a good example implementation of such a scheduler can be found in this haskell lib: https://github.com/ekmett/structures/blob/master/src/Control/Concurrent/Deque.hs

@brson
Copy link
Contributor Author

brson commented Oct 17, 2013

I am wondering whether we really want to use a work-stealing deque as or per-thread work queue. Something that allows for random access would prevent the starvation problems we have to hack around and would be more fair. Is there such a data structure that is lock free?

@toffaletti
Copy link
Contributor

I've got a work-in-progress implementation of chase-lev deque. I started working from the C11 paper, but I found some bugs and omissions in their implementation so I'm having to go back to the original paper. Any help would be appreciated. I'm currently trying to wrap my head around Section 4 of the paper, which discusses how to adapt the algorithm for growing/shrinking the array without a garbage collector.

https://github.com/toffaletti/rust-code/blob/master/chase_lev_deque.rs

Even if this isn't used for the scheduler, I've been told it might be useful for Servo rendering work.

@cartazio
Copy link
Contributor

cartazio commented Nov 6, 2013

if you want to see a worked out chase lev implementation thats pretty readable, look at the deque branch of edward kmett's structures lib https://github.com/ekmett/structures/blob/deque/src/Control/Concurrent/Deque.hs

it uses some other primops you can see defined here http://hackage.haskell.org/package/atomic-primops-0.4/docs/Data-Atomics-Counter-Reference.html

some of that may not be relevant for the no gc context, but at least gives a pretty readable working implementation to look at explicitly

@toffaletti
Copy link
Contributor

Thanks, @cartazio I will take a look. I haven't seen a fully working implementation that does array resizing and reclaiming without a GC. The original paper is light on specifics, just outlining a solution and then in a footnote saying "It is straightforward, however, to use the same solution for reclaiming buffers also when growing."

I decided to use Relacy Race Detector to help get a correct implementation because helgrind and drd report too many false positives. That attempt is here: https://github.com/toffaletti/chase-lev

It currently passes the tests, but I've gone the extremely heavy handed route of making all atomic access use memory_order_seq_cst. I'll work on relaxing that and fixing any other bugs.

alexcrichton added a commit to alexcrichton/rust that referenced this issue Nov 29, 2013
This adds an implementation of the Chase-Lev work-stealing deque to libstd
under std::rt::deque. I've been unable to break the implementation of the deque
itself, and it's not super highly optimized just yet (everything uses a SeqCst
memory ordering).

The major snag in implementing the chase-lev deque is that the buffers used to
store data internally cannot get deallocated back to the OS. In the meantime, a
shared buffer pool (synchronized by a normal mutex) is used to
deallocate/allocate buffers from. This is done in hope of not overcommitting too
much memory. It is in theory possible to eventually free the buffers, but one
must be very careful in doing so.

I was unable to get some good numbers from src/test/bench tests (I don't think
many of them are slamming the work queue that much), but I was able to get some
good numbers from one of my own tests. In a recent rewrite of select::select(),
I found that my implementation was incredibly slow due to contention on the
shared work queue. Upon switching to the parallel deque, I saw the contention
drop to 0 and the runtime go from 1.6s to 0.9s with the most amount of time
spent in libuv awakening the schedulers (plus allocations).

Closes rust-lang#4877
bors added a commit that referenced this issue Nov 29, 2013
This adds an implementation of the Chase-Lev work-stealing deque to libstd
under std::rt::deque. I've been unable to break the implementation of the deque
itself, and it's not super highly optimized just yet (everything uses a SeqCst
memory ordering).

The major snag in implementing the chase-lev deque is that the buffers used to
store data internally cannot get deallocated back to the OS. In the meantime, a
shared buffer pool (synchronized by a normal mutex) is used to
deallocate/allocate buffers from. This is done in hope of not overcommitting too
much memory. It is in theory possible to eventually free the buffers, but one
must be very careful in doing so.

I was unable to get some good numbers from src/test/bench tests (I don't think
many of them are slamming the work queue that much), but I was able to get some
good numbers from one of my own tests. In a recent rewrite of select::select(),
I found that my implementation was incredibly slow due to contention on the
shared work queue. Upon switching to the parallel deque, I saw the contention
drop to 0 and the runtime go from 1.6s to 0.9s with the most amount of time
spent in libuv awakening the schedulers (plus allocations).

Closes #4877
bors added a commit that referenced this issue Nov 29, 2013
This adds an implementation of the Chase-Lev work-stealing deque to libstd
under std::rt::deque. I've been unable to break the implementation of the deque
itself, and it's not super highly optimized just yet (everything uses a SeqCst
memory ordering).

The major snag in implementing the chase-lev deque is that the buffers used to
store data internally cannot get deallocated back to the OS. In the meantime, a
shared buffer pool (synchronized by a normal mutex) is used to
deallocate/allocate buffers from. This is done in hope of not overcommitting too
much memory. It is in theory possible to eventually free the buffers, but one
must be very careful in doing so.

I was unable to get some good numbers from src/test/bench tests (I don't think
many of them are slamming the work queue that much), but I was able to get some
good numbers from one of my own tests. In a recent rewrite of select::select(),
I found that my implementation was incredibly slow due to contention on the
shared work queue. Upon switching to the parallel deque, I saw the contention
drop to 0 and the runtime go from 1.6s to 0.9s with the most amount of time
spent in libuv awakening the schedulers (plus allocations).

Closes #4877
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-runtime Area: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflows C-enhancement Category: An issue proposing an enhancement or a PR with one.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants