Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scope without Send #562

Closed
jonhoo opened this issue Apr 3, 2018 · 21 comments
Closed

scope without Send #562

jonhoo opened this issue Apr 3, 2018 · 21 comments

Comments

@jonhoo
Copy link

jonhoo commented Apr 3, 2018

This ties somewhat into the discussion over in #522 (comment).

Currently, ThreadPool::scope requires the passed closure to be Send because the closure itself is executed on the thread pool. However, if ThreadPool is used as a more generic thread pool (rather than explicitly for data-dependent computation), it is not unreasonable for some existing thread to wish to spin off a number of jobs with access to its stack, and then wait for them all to complete (essentially as a pool of scoped threads). With the Send bound in place, that thread is pretty restricted in what it can use to generate jobs (e.g., anything with Rc is a no-go). It'd be good if there was an alternate version of scope that did not require Send for its closure, and which instead executed the closure on the current thread (but still waited for any spawned jobs to complete).

@cuviper
Copy link
Member

cuviper commented Apr 3, 2018

I expect it's possible, but I'm not sure that we should.

One reason that we enter the threadpool for scope is so that its calls to spawn can push directly to a local job queue. Pushing to the global queue requires taking a lock. But maybe that synchronization wouldn't be a big deal for some use cases.

jonhoo added a commit to mit-pdos/noria that referenced this issue Apr 3, 2018
With this change, generators monitor how quickly clients are draining
queued jobs, and stop issuing jobs when they detect that clients have
enough queued work to last for the remaining duration of the experiment.
This is mostly a work-around for rayon-rs/rayon#544.

Note that the load generator now runs *in* the thread pool, so the
`threads` argument should now be set to the total number of cores rather
than #core - #generators. This is due to rayon-rs/rayon#562. It's a
little unfortunate because it means that *all* job distribution requires
stealing (the generator will put all jobs on its local queue).

Note also that (because of the same linked rayon issue) the creation of
`id_rng` is now in a closure. This is so that the argument can be `Send`
so we can get it into the thread pool in the first place.
@alecmocatta
Copy link

I just bumped into this. I'm using scoped_threadpool as well as rayon and noticed the latter ostensibly contained the former's functionality, enabling me to unify on one threadpool. Unification would be convenient as I want a thread per core, and ensuring that over two independent threadpools is nontrivial (given I wouldn't want cores to be idle if one threadpool was full but one empty).

@rocallahan
Copy link
Contributor

I bumped into this too. https://github.com/reem/rust-scoped-pool is unmaintained so I'm trying to migrate to rayon's ThreadPool and in some places that doesn't work because the scope closure is not Send (and making it Send requires significant API changes in callers).

@ghost
Copy link

ghost commented Aug 18, 2018

@cuviper

Pushing to the global queue requires taking a lock. But maybe that synchronization wouldn't be a big deal for some use cases.

What if we used MsQueue for the global queue and avoid the lock?

Alternatively, I'm considering implementing a new Deque that supports push/pop/steal/steal_half, implements Sync, and doesn't separate workers and stealers. The intended use case for it is shared global deque of tasks, which is not owned by any worker thread.

@cuviper
Copy link
Member

cuviper commented Aug 20, 2018

What if we used MsQueue for the global queue and avoid the lock?

It's worth trying!

@cuviper
Copy link
Member

cuviper commented Dec 14, 2018

FWIW, #615 is changing the global queue to a SegQueue (without a lock), so we could follow up by experimenting here without an initial Send.

@abonander
Copy link

I've run into a need for this in #676.

@DzmitryFil
Copy link

+1 for this, Send requirement is very restrictive in some cases. I work primarily with webassembly, and none of the web api's are capable of working across different threads (!Send), which severely limits rayon usage. Scope without Send would allow me to do useful work on the worker thread with !Send stuff (inside scope closure), while at the same time running rayon jobs in scope. Currently i have to choose, either i make some progress with !Send stuff on the main thread, or i run rayon scope/join.

@rocallahan
Copy link
Contributor

rocallahan commented Jan 16, 2021

I hacked together a version of this here: https://github.com/rocallahan/rayon/commits/downstream

Apart from eliminating the Send bound on the OP closure it's a bit more efficient because you don't need to send the op across threads, when you're not already on a worker thread. Also, the op can safely wait for spawned closures to complete; with regular scope, you can't, because blocking a worker thread on spawned work can deadlock.

Should I clean it up and submit it?

@cuviper
Copy link
Member

cuviper commented Feb 13, 2021

@rocallahan I would be interested in that, but I hope we don't have to fork Scope for it. Can we refactor enough to let external_scope just be an alternate entry point that still uses the same Scope? Then we can probably do the same for ScopeFifo.

I'm still wary of changing the current scope entry behavior, since at some level that also serves as an alternate install. That is, by entering the thread pool it also sets the implicit pool for join, iterators, etc.

Apart from eliminating the Send bound on the OP closure it's a bit more efficient because you don't need to send the op across threads, when you're not already on a worker thread.

Yes, but then everything you spawn has to be sent across threads to the pool-wide injector queue, instead of just pushing to the thread-local worker queue. That tradeoff will probably vary widely by workload.

Also, the op can safely wait for spawned closures to complete; with regular scope, you can't, because blocking a worker thread on spawned work can deadlock.

This is only true if you're sure that you're not already on a worker thread.

@rocallahan
Copy link
Contributor

Can we refactor enough to let external_scope just be an alternate entry point that still uses the same Scope?

We can, but then there will be slightly higher overhead for scope users (some conditional branches at least). Is that acceptable?

I'm still wary of changing the current scope entry behavior, since at some level that also serves as an alternate install. That is, by entering the thread pool it also sets the implicit pool for join, iterators, etc.

Yes, we can't break that, we need a new entry point.

Yes, but then everything you spawn has to be sent across threads to the pool-wide injector queue, instead of just pushing to the thread-local worker queue. That tradeoff will probably vary widely by workload.

Right.

This is only true if you're sure that you're not already on a worker thread.

Right. In our case we have a dedicated thread pool and restrict access to it so only specific code can run in it.

@cuviper
Copy link
Member

cuviper commented Feb 13, 2021

Can we refactor enough to let external_scope just be an alternate entry point that still uses the same Scope?

We can, but then there will be slightly higher overhead for scope users (some conditional branches at least). Is that acceptable?

I suspect a few branches won't be noticeable compared to the existing synchronization, but let's see what it looks like.

@rocallahan
Copy link
Contributor

The tricky bit here is the latch. Currently ScopeBase contains a CountLatch, which is two atomics (state, job count). In my PoC, I replaced this with a LockLatch (Mutex, CondVar) plus an atomic job count. With a unified Scope, we need to support both. We could just enum our way to victory here, but maybe we can do better.

One thing I don't understand which may be relevant: Why is CountLatch two atomics? Couldn't it use a single atomic? wait_until_cold would have to be generic over L: Latch but that doesn't look like a problem. (You could argue that the 'state' atomic changes less frequently which might mean less cache traffic as waiters poll it, but it will share the same cache line as the counter in most cases.)

It might be simplest to allow CountLatch to be used as a raw atomic job count for the external scope case, and just add a LockLatch to Scope to be used for the external case. This could be whittled down to just one extra conditional branch when the last job completes in the Scope. This would increase the size of Scope by 32 bytes, is that OK?

(I considered trying to reuse Registry's LOCK_LATCH but I guess that could lead to deadlocks if an external scope op tries to call in_worker.)

@cuviper
Copy link
Member

cuviper commented Feb 13, 2021

You've jogged my memory enough to remember that I had tinkered with this, but not finished. I've now pushed that branch so I can share it here, in case that helps to compare notes and think through some of the issues:
master...cuviper:raw_scope

One possibility your prototype didn't cover was if we call this from one thread pool into another. It's not great to use a LockLatch in that case, blocking the thread in the old pool. In general, we try to let that sort of blocking go into work stealing too, and I tried to allow that in my branch.

Why is CountLatch two atomics?

I believe this is primarily so we can share the logic in the sleep module, which is fairly tricky.

This would increase the size of Scope by 32 bytes, is that OK?

Size shouldn't be of much concern, because the Scope is created once on the stack and only passed around by reference.

@rocallahan
Copy link
Contributor

I believe this is primarily so we can share the logic in the sleep module, which is fairly tricky.

Ok.

One possibility your prototype didn't cover was if we call this from one thread pool into another.

Yes, this had crossed my mind and then I forgot about it.

Stealing your ScopeLatch seems like the way to go.

@rocallahan
Copy link
Contributor

Using ScopeLatch means a conditional branch when we increment the latch too, but I guess that's OK.

@rocallahan
Copy link
Contributor

Here's what I have:
Pernosco@external-scope
I probably should write some more tests. Should I do that and submit a PR?

@rocallahan
Copy link
Contributor

Also, is external_scope a good name here? I didn't put any thought into choosing it.

@cuviper
Copy link
Member

cuviper commented Apr 2, 2021

Naming is hard -- external_scope sounds better than my raw_scope, at least. Let's rekindle this discussion in a PR and we can think about it more from there.

@rocallahan
Copy link
Contributor

Submitted PR #844

bors bot added a commit that referenced this issue May 4, 2021
844: Implement `in_place_scope` r=cuviper a=rocallahan

As discussed in #562.

Co-authored-by: Robert O'Callahan <robert@ocallahan.org>
@rocallahan
Copy link
Contributor

Fixed by #844 6a01573

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants