-
Notifications
You must be signed in to change notification settings - Fork 506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switching par_iter makes memory usage explode (40x) #1189
Comments
Unfortunately, You could use let mut data = vec![0u8; num_iterations.len() * 4];
data.par_chunks_mut(4).zip(num_iterations).for_each(|(data, num_iterations)| {
// fill in this chunk
}); |
Hmm, that makes sense. Thanks for the explanation. This seems quite unfortunate, especially since being a drop-in replacement for iter is the main selling point of par_iter and flat_map is (in my experience) often used in this manner. Especially since this performance footgun is this large. But without specialization, I doubt there is anything that can be done to address this, without also impacting using flat_map with large (and efficiently parallelizable) Iterators. Unfortunate :( |
The naive Line 127 in 7b3ef64
That could possibly be improved by keeping the same It would be really cool if These are hard problems to make the user side easy. :) Did the workaround that I suggested fare better for you? |
Yes. I switched from |
If I understand it correctly, |
It might be possible by adding a way to (if possible) get a len from the type directly (not an instance). For example adding an associated function With that it would be possible to do a sort of runtime specialization. |
Crap, I fell into the same trap as the original proposal of This would need to be a function on |
I have experimented a bit with my Idea for a possible This has improved the memory footprint (and therefore also runtime, since most of the runtime seems to be memory allocation). This does not fix the slow runtime of But I guess 15x slower vs DNF is a start |
It doesn't help the inner iterator, but I meant that the outer |
That is exactly what I did: https://github.com/BloodStainedCrow/rayon/blob/main/src/iter/flat_map.rs#L47. If I understand |
Definitely not, since it can examine
This might be better on |
For |
see below |
The same assumptions about the length being correct should apply to So as long as all "base" consumers, which use |
The issue with moving it to the It is both a We can distinguish different "Instances" but before we start consuming the iterators we do not have an instance yet. And at that point it is too late |
Ah - Maybe we can have both? The blanket |
I am not sure I understand you correctly. Unless we change arrays iter type to something like |
The current way to obtain a length for fn opt_len(&self) -> Option<usize> {
let base_len = self.base.opt_len()?;
let sub_len = PI::const_length()?;
base_len.checked_mul(sub_len)
} So fn const_length() -> Option<usize> Which I do not see how to implement without changing the types of ArrayIterators (or other's we want to have a size) to include the size in their generics...? |
There are 3 implementations of Lines 14 to 39 in 7b3ef64
Line 2435 in 7b3ef64
|
I think I see now. |
With this it is also possible to conditionally have an |
Okay quick list of everything that now has
Any idea how to get &array in there too? Probably now without changing arrays to stop using slice iterators :( |
Nevermind again. Both of them are already are not object-safe. This change is a possibly-breaking change |
This comment was marked as off-topic.
This comment was marked as off-topic.
I think it could, and diff --git a/src/iter/blocks.rs b/src/iter/blocks.rs
index 1691e8b15c3e..29658de3e6cf 100644
--- a/src/iter/blocks.rs
+++ b/src/iter/blocks.rs
@@ -9,7 +9,7 @@ struct BlocksCallback<S, C> {
impl<T, S, C> ProducerCallback<T> for BlocksCallback<S, C>
where
- C: UnindexedConsumer<T>,
+ C: Consumer<T>,
S: Iterator<Item = usize>,
{
type Output = C::Result;
@@ -20,7 +20,7 @@ where
// we need a local variable for the accumulated results
// we call the reducer's identity by splitting at 0
- let (left_consumer, right_consumer, _) = consumer.split_at(0);
+ let (left_consumer, right_consumer, mut reducer) = consumer.split_at(0);
let mut leftmost_res = left_consumer.into_folder().complete();
consumer = right_consumer;
@@ -36,13 +36,14 @@ where
producer = right_producer;
// split the consumer
- let (left_consumer, right_consumer, _) = consumer.split_at(capped_size);
+ let (left_consumer, right_consumer, next_reducer) = consumer.split_at(capped_size);
consumer = right_consumer;
- leftmost_res = consumer.to_reducer().reduce(
+ leftmost_res = reducer.reduce(
leftmost_res,
bridge_producer_consumer(capped_size, left_producer, left_consumer),
);
+ reducer = next_reducer;
}
leftmost_res
} (Our "specialized" |
I have done some memory profiling. It seems that (0..1_000_000).into_par_iter().flat_map(|_| [0, 1, 2, 3]).collect() It will create 4_000_000 Linked list elements. Each of these contains 2 pointers (next, prev) and the |
While this exact example would benefit from the (0..1_000_000).into_par_iter().flat_map(|_| ParallelIterator::repeat(0).take(4)).collect() would still inhibit this property |
Yes, that's the effect of the non-indexed (non- |
So if I understand correctly, |
No, with |
If |
A I see, and you mentioned |
It doesn't in general, but 4 items is so few that this case does end up on single elements. |
Ah unfortunate. So back to square one: The problem is that |
The only way I can currently see, to stop It could also be |
I'll start writing a PR for the |
rayon 1.10.0
I wrote a small Mandelbrot visualizer including this snippet:
num_iterations is a Vec of length 10_000 * 10_000, counts a Vec of length 10_000. The largest memory usage I observed was around 380MB which seems to fit the math quite well: 10_000 * 10_000 * 4 bytes ~= 400MB.
I am on a 6/12 core processor, so the increase is not linear with the number of cores as I first suspected.
When I switch to using
par_iter
The usage spikes to over 16GB (increase of 40x) after a couple of seconds. The runtime is also much slower than single threaded. (I was unable to time it, since it took too long on windows and kept getting OOM killed on Linux).Incase it is useful the repo with the full code is here.
The text was updated successfully, but these errors were encountered: