Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage with “Read” or “Iterator” as provider? #46

Closed
flying-sheep opened this issue Feb 5, 2016 · 7 comments
Closed

Usage with “Read” or “Iterator” as provider? #46

flying-sheep opened this issue Feb 5, 2016 · 7 comments

Comments

@flying-sheep
Copy link

flying-sheep commented Feb 5, 2016

I see that only ranges and slices can be converted into parallel iterators.

i wonder how to best utilize rayon while reading from a file.

ATM I read files into buffers, then parse them using an iterator. i imagine there’s a better way to use rayon than:

let buffer = String::new();
File::open(some_path)?.read_to_string(buffer)?;  // iterate unindexed data once
let records: Vec<_> = ParseIter::new(buffer).collect();  // iterate unindexed data again
records.par_iter_mut().for_each(do_stuff);  // iterate indexed data in parallel
@nikomatsakis
Copy link
Member

Well, right now there isn't really a much better way (though you could handwrite a recursive function with join), but I'd like to add a kind of "work queue" abstraction (basically similar to scoped threads) that would probably address this use case.

@Michael-F-Bryan
Copy link

This seems like a fairly popular feature judging on the number of issues people have created that are slight variations on this general Iterator to ParallelIterator adapter idea. Has there been any progress on this?

My particular use case is that I've got an iterator which yields every permutation of 4 numbers from 18 to 200 (around 167,961,600,000,000 permutations), then does a bunch of filters and maps to find a bunch of "ideal" combinations. In this case it's not feasible to save the combinations into a Vec, yet the filters and maps are all trivially parallelizable.

@cuviper
Copy link
Member

cuviper commented Oct 16, 2017

It's a popular request, but there's not an obvious way to implement it. We've had a few ideas, but not any progress to report for it, sadly.

My particular use case is that I've got an iterator which yields every permutation of 4 numbers from 18 to 200 (around 167,961,600,000,000 permutations),

Perhaps I misunderstand, but wouldn't the number of permutations be (200-18)4 -- roughly 109? If you can set up that generator with ranges, you can parallelize that.
(18..200).into_par_iter().for_each(|x| (180..200).into_par_iter().for_each(|y| ...))

In this case it's not feasible to save the combinations into a Vec,

Another strategy is batch this work. Generate a manageable amount into a Vec, process those in parallel, and repeat in a loop until you've done them all.

@grahame
Copy link

grahame commented Dec 6, 2017

I'm using Rayon to parallelise a fairly CPU heavy operation on a bunch of String instances coming to me via the csv crate. I wound up here after I did exactly what @cuviper suggested; batching into a Vec. That gave me a good performance gain, but not as much as should be possible – the csv crate has to do enough work that I lose a lot of time building the batches.

Would a generic solution be to allow Rayon to .par_map() on the rx side of a channel, or have a templated function that makes that happen given the type of the channel? I could then stick the iterator bit in a thread, have it spit String instances down the channel, and Rayon could do its thing. This would be even more general than supporting Read or Iterator as a provider.

(I'm fairly new to Rust, so apologies if I'm missing something and this is not actually easier than just supporting Iterator.)

@lnicola
Copy link

lnicola commented Dec 6, 2017

@grahame See also https://github.com/QuietMisdreavus/polyester, but I'm not sure that the cost of allocating Strings makes this worth it.

bors bot added a commit that referenced this issue Jun 6, 2018
550: add bridge from Iterator to ParallelIterator r=cuviper a=QuietMisdreavus

Half of #46

This started getting reviewed in QuietMisdreavus/polyester#6, but i decided to move my work to Rayon proper.

This PR adds a new trait, `AsParallel`, an implementation on `Iterator + Send`, and an iterator adapter `IterParallel` that implements `ParallelIterator` with a similar "cache items as you go" methodology as Polyester. I introduced a new trait because `ParallelIterator` was implemented on `Range`, which is itself an `Iterator`.

The basic idea is that you would start with a quick sequential `Iterator`, call `.as_parallel()` on it, and be able to use `ParallelIterator` adapters after that point, to do more expensive processing in multiple threads.

The design of `IterParallel` is like this:

* `IterParallel` defers background work to `IterParallelProducer`, which implements `UnindexedProducer`.
* `IterParallelProducer` will split as many times as there are threads in the current pool. (I've been told that #492 is a better way to organize this, but until that's in, this is how i wrote it. `>_>`)
* When folding items, `IterParallelProducer` keeps a `Stealer` from `crossbeam-deque` (added as a dependency, but using the same version as `rayon-core`) to access a deque of items that have already been loaded from the iterator.
* If the `Stealer` is empty, a worker will attempt to lock the Mutex to access the source `Iterator` and the `Deque`.
  * If the Mutex is already locked, it will call `yield_now`. The implementation in polyester used a `synchronoise::SignalEvent` but i've been told that worker threads should not block. In lieu of #548, a regular spin-loop was chosen instead.
  * If the Mutex is available, the worker will load a number of items from the iterator (currently (number of threads * number of threads * 2)) before closing the Mutex and continuing.
  * (If the Mutex is poisoned, the worker will just... stop. Is there a recommended approach here? `>_>`)

This design is effectively a first brush, has [the same caveats as polyester](https://docs.rs/polyester/0.1.0/polyester/trait.Polyester.html#implementation-note), probably needs some extra features in rayon-core, and needs some higher-level docs before i'm willing to let it go. However, i'm putting it here because it was not in the right place when i talked to @cuviper about it last time.

Co-authored-by: QuietMisdreavus <grey@quietmisdreavus.net>
Co-authored-by: Niko Matsakis <niko@alum.mit.edu>
@adamreichold
Copy link
Collaborator

It looks like this is still open as #550 which added ParallelBridge was considered only "half" of this issue. But does that really capture the requirement expressed in the original report? Read in contrast with iterators based on it like Lines or the above ParseIter does not have a fixed item count or rather buffer size. To me, it appears that how to go from bytes the items is out of scope for Rayon and this could be closed due to providing ParallelBridge?

@cuviper
Copy link
Member

cuviper commented Feb 24, 2023

Yes, I think it's fair to say that ParBridge solves Iterator input, and we have no plans for Read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants