New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is .fold() faster than .filter().count()? #33038

Closed
alubbe opened this Issue Apr 16, 2016 · 15 comments

Comments

Projects
None yet
10 participants
@alubbe
Copy link

alubbe commented Apr 16, 2016

I was running some microbenchmarks today and noticed that .fold() runs a lot faster than .filter().count(). I find .filter().count() a lot more readable and so I was wondering whether there is some fundamental reason for this or is it something that could be sped up?
Also, being new to the language, I am more than happy to receive feedback on the below code.

// 1.rs
println!("{}", (0..1u32<<30).fold(0, |paths, i| if i.count_ones() == 15 {paths + 1} else {paths}));
// 2.rs
println!("{}", (0..1u32<<30).filter(|&i| i.count_ones() == 15).count());
$ time target/release/1
155117520

real    0m0.785s
user    0m0.770s
sys 0m0.012s

$ time target/release/2
155117520

real    0m2.244s
user    0m2.209s
sys 0m0.028s
@TimNN

This comment has been minimized.

Copy link
Contributor

TimNN commented Apr 16, 2016

One factor is probably that 1 is summing into an u32 (or i32, I'm not sure, it's a 32 bit type in any case) whereas count uses a usize. At least on the playpen the time of 1 is doubled (when running until 1<<28) .

@jonas-schievink

This comment has been minimized.

Copy link
Contributor

jonas-schievink commented Apr 16, 2016

Playpen link: http://is.gd/dSFFkg

The filter-count version doesn't seem to vectorize, while the fold version does.

@alubbe

This comment has been minimized.

Copy link

alubbe commented Apr 16, 2016

So is there some way to optimize the types in .filter().count() to get the same performance? I just converted everything to usize and got the same (slow) speed.

@Aatch

This comment has been minimized.

Copy link
Contributor

Aatch commented Apr 17, 2016

Hmm, this is interesting as there's no clear reason why should vectorize while the other doesn't. The most likely answer is, unfortunately, that the filter-count version produces code such that the place where vectorisation happens is before where some other optimisation allows it to vectorize that loop.

I think the issue is that the CFG for filter-count, while functionally equivalent to the fold, is just different enough that the optimiser can't quite handle it. The optimiser is always going to miss stuff, since it can't run for hours. It's just unfortunate that in this case the difference is so noticeable due to it being able vectorize one and not the other.

@alubbe

This comment has been minimized.

Copy link

alubbe commented Apr 17, 2016

Should I leave this issue open as investigating it might reveal something useful or close it as a non-priority?

@nagisa nagisa added the I-slow label Apr 17, 2016

@nagisa

This comment has been minimized.

Copy link
Contributor

nagisa commented Apr 17, 2016

Keep it open.

@dotdash

This comment has been minimized.

Copy link
Contributor

dotdash commented Apr 17, 2016

The difference that makes vectorization happen is that in the fold case, LLVM can eliminate the conditional addition by using the zero extended result of the comparison as the second operand for the addition.

@alubbe

This comment has been minimized.

Copy link

alubbe commented Apr 28, 2016

Is this something the MIR could help with?

@JonasOlson

This comment has been minimized.

Copy link

JonasOlson commented Aug 1, 2016

Wouldn't it be nice to have count take a predicate as an argument, and count only those items for which the predicate is true, so that .count(predicate) would be the same as .filter(predicate).count()? I think that might make even more sense readability-wise.

Edit: Eh, you can't have optional arguments, can you? One might have to make that .count_selectively(predicate) instead. Perhaps not worth it anymore, if it ever was.

@alubbe

This comment has been minimized.

Copy link

alubbe commented Oct 29, 2016

Just an update, MIR has not helped - this issue still persists on stable and nightly

@sinkuu

This comment has been minimized.

Copy link
Contributor

sinkuu commented Oct 20, 2018

This seems to have been fixed.

-O -Ctarget-cpu=native [ms] -O [ms]
fold(0i32, ..) 272.8 540.9
fold(0usize, ..) 622.5 1957
filter().count() 622.5 1958
rustc 1.29.2 (17a9dc751 2018-10-05)
binary: rustc
commit-hash: 17a9dc7513b9fea883dc9505f09f97c63d1d601b
commit-date: 2018-10-05
host: x86_64-unknown-linux-gnu
release: 1.29.2
LLVM version: 7.0
@shepmaster

This comment has been minimized.

Copy link
Member

shepmaster commented Oct 21, 2018

Thanks for the follow-up @sinkuu! I'm going to take the initiative to close this — if anyone thinks this is incorrect, let me know.

@shepmaster shepmaster closed this Oct 21, 2018

@alubbe

This comment has been minimized.

Copy link

alubbe commented Oct 22, 2018

Maybe I'm misunderstanding the results posted by @sinkuu, but when I run these test with rustc 1.29.2 on my mac I still get a >2x performance gap.

// with target-cpu=native
time 1.rs   0.277s
time 2.rs   0.612s
// without any flags
time 1.rs   0.559s
time 2.rs   1.874s
@sinkuu

This comment has been minimized.

Copy link
Contributor

sinkuu commented Oct 22, 2018

Are you using the same 1.rs 2.rs as #33038 (comment) ?
1.rs uses i32 (the default numeric type which is inferred for 0) for counting, and 2.rs uses usize. That's why I put both fold(0i32, ..) and fold(0usize, ..) in the comparison.

@alubbe

This comment has been minimized.

Copy link

alubbe commented Oct 23, 2018

yes, you're right - thanks for pointing that out!
I can confirm that these are equivalent:

    println!("{}", (0..1u32<<30).fold(0, |paths, i| if i.count_ones() == 15 {paths + 1} else {paths}));
    println!("{}", (0..1u32<<30).filter(|&i| i.count_ones() == 15).fold(0, |count, _| count + 1));

and these are equivalent:

    println!("{}", (0..1u32<<30).filter(|&i| i.count_ones() == 15).fold(0u64, |count, _| count + 1));
    println!("{}", (0..1u32<<30).filter(|&i| i.count_ones() == 15).count());

So in short, the slowdown is caused by .count() using 64 bit integers. Which is strange, because the length of the vector is known be less than 2^32 - is there any way to optimize Vec or count() to use something smaller than usize?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment