Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] impl FromIterator for Option/Result via scan #59605

Closed
wants to merge 4 commits into from

Conversation

pnkfelix
Copy link
Member

@pnkfelix pnkfelix commented Apr 1, 2019

This PR consists of three main things:

  • It swaps in a simpler (at least in terms of lines-of-code) implementation of FromIterator for Option and Result that uses the scan method to do the bulk of the work rather than the specialized adapter struct that the old implementation used.
  • It adds a micro-benchmark of FromIterator for Result in order to measure the performance of this operation, in order to ensure that this change (or other future changes) do not cause this operation to slow down significantly.
  • It revises the implementations of Vec::extend and Iterator::scan in order to address performance issues uncovered by above micro-benchmark.

cc #11084

Some (lightly edited) notes from the original PR post follow, but I have removed three of the four original benchmarks (you can find more about them in #11084).


The PR was initially marked WIP, because in my experiments on my Linux desktop machine, even when I compile with optimize=true, debug=false, codegen-units=1 and incremental=false, I still see performance regression on this particular micro-benchmark.

  • I have a bunch of notes and data about this in the comment thread here; it has an LTO off/on comparison. Compare the lines that say "using_baseline" (or "using_adapter", which should be roughly equivalent) with the lines that say "with_scan", to get an idea of the effect of this PR in various contexts.
  • The only way I have seen to reliably bring the performance back in line with expectations is to enable LTO in some form.
  • In particular, if you have codegen-units=1, you need to explicitly enable -C lto=thin in order to get competitive performance out of the "with_scan" implementation. The default with codegen-units=1 is suboptimal; -C lto=thin gives you "whole crate graph" LTO.
  • And if you have codegen-units > 1, then the default (which corresponds to something called "local Thin LTO") will yield the "best" performance.
    • I put "best" in quotes, because the performance for codegen-units > 1 here is far worse than codegen-units = 1.

Anyway, the micro-benchmarks added here include an explicit encoding of the adapter-based implementation of FromIterator for Result, so that one can see how the new implementation compares out of the box (that is, without enabling ThinLTO for codegen-units=1 on the bootstrapped benchmark build).

@rust-highfive
Copy link
Collaborator

r? @alexcrichton

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 1, 2019
@pnkfelix
Copy link
Member Author

pnkfelix commented Apr 1, 2019

As I noted in the description, I included micro-benchmarks for the old implementation.

This makes it really easy to see the performance regression that this re-implementation currently introduces: just run the benchmark (x.py bench src/libcore) and look at the lines that look like this (these are taken from my Linux Desktop box, a dual-processor (= 8-core) Intel i7-4790 @ 3.60Ghz) :

test iter::bench_result_from_iter_into_last                     ... bench:       3,640 ns/iter (+/- 121)
test iter::bench_result_from_iter_into_last_old                 ... bench:         665 ns/iter (+/- 0)
test iter::bench_result_from_iter_into_vec                      ... bench:       3,483 ns/iter (+/- 12)
test iter::bench_result_from_iter_into_vec_old                  ... bench:       1,929 ns/iter (+/- 33)

As you can see, when collecting into a Vec, we see a slowdown of 1.8x

When you collect into Last (an "anti-collection" type that side-steps allocation costs for benchmarking purposes), you see the slowdown is 5.5x.

That's why I've marked this PR a WIP: I don't want to blindly commit this change in the name of "code simplification" without seeing evidence that the microbenchmarks proposed in this PR are irrelevant.

Nonetheless, I still posted the PR itself (rather than abandoning these bits of code entirely). I did this for three main reasons:

  1. If we do reject this change to the impl FromIterator for these two types, then these benchmarks should probably be added to the benchmark suite.
  2. Also, if we do reject this change to the impl FromIterator for these two types, then we should also remove the // FIXME in each of them that suggests switching to a scan-based implementation after rust doesn't optimize closure in scan iterator #11084 is resolved.

rust/src/libcore/option.rs

Lines 1341 to 1342 in 6315221

// FIXME(#11084): This could be replaced with Iterator::scan when this
// performance bug is closed.

rust/src/libcore/result.rs

Lines 1235 to 1236 in 6315221

// FIXME(#11084): This could be replaced with Iterator::scan when this
// performance bug is closed.

  1. There's a decent chance that these micro-benchmarks actually are irrelevant, and that this change to the impl FromIterator for these two types should still land.

@alexcrichton
Copy link
Member

Some interesting numbers! @pnkfelix have you run a profiler to see if there's any particular hot spots in the new implementation that weren't in the old one? If it requires LTO to be turned on to be fast that probably means that something performance critical isn't getting inlined across crates and requires #[inline] maybe?

@jonas-schievink jonas-schievink added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Apr 1, 2019
@pnkfelix
Copy link
Member Author

pnkfelix commented Apr 2, 2019

@shepmaster ran a profiler early on in the investigation and found some "interesting" codegen in the hotspot: #11084 (comment)

I myself haven't run a profiler, not yet. I'll give it a quick whirl.

@pnkfelix
Copy link
Member Author

pnkfelix commented Apr 2, 2019

Running the profiler on the benchmark iter::bench_result_from_iter_into_last (the one that shows the most egregious regression) shows a code sequence similar to that identified in @shepmaster's hostspot quoted above:

       │ 80:┌─→cmpq   $0x1,-0x8(%rbx)
 30.06 │    │↓ jne    b0
       │    │  lea    0x30(%rsp),%rdi
       │    │  mov    %rbx,%rsi
       │    │→ callq  *0x68a7b(%rip)        # 74ff0 <<alloc::string::String as core::clone::Clone>::clone>
       │    │  mov    0x30(%rsp),%rax
       │    │  lea    0x38(%rsp),%rcx
       │    │  movups (%rcx),%xmm0
       │    │  movaps %xmm0,(%rsp)
       │    │  mov    $0x1,%ecx
       │    │↓ jmp    b5
       │    │  nop
  0.24 │ b0:│  mov    (%rbx),%rax
       │    │  xor    %ecx,%ecx
       │ b5:│  mov    %rcx,0x50(%rsp)
 25.21 │    │  mov    %rax,0x58(%rsp)
 26.02 │    │  movaps (%rsp),%xmm0
  6.58 │    │  movups %xmm0,0x0(%r13)
  4.89 │    │  movups 0x0(%r13),%xmm0
  6.69 │    │  movaps %xmm0,(%rsp)
  0.09 │    │  test   %rcx,%rcx
       │    │↓ jne    100
       │    │  add    $0x20,%rbx
       │    │  mov    $0x1,%r12d
       │    │  mov    %rax,%rbp
       │    ├──add    $0xffffffffffffffe0,%r15
       │    └──jne    80

@shepmaster
Copy link
Member

@shepmaster ran a profiler

I think you mean @dotdash, but I'm happy you thought of me ❤️

@alexcrichton
Copy link
Member

Ok thanks! That looks like a pretty reasonable trace, without much to illuminate. I wonder thought if you could gist a version that's a profile of what's there today? That code looks relatively optimal (no extraneous function calls at least) but it may be the case that the old version vectorized better or something like that

@pnkfelix
Copy link
Member Author

pnkfelix commented Apr 2, 2019

I wonder thought if you could gist a version that's a profile of what's there today?

What are you asking me to gist here; the analogous perf annotate output for the iter::bench_result_from_iter_into_last_old? Or something else?

@bors
Copy link
Contributor

bors commented Apr 2, 2019

☔ The latest upstream changes (presumably #59632) made this pull request unmergeable. Please resolve the merge conflicts.

@alexcrichton
Copy link
Member

Oh sure yeah, if *_old matches the current implementation in master that'd do it!

I'm basically just curious at the assembly level what the differences are to help understand why the new version is slower than the old

@pnkfelix
Copy link
Member Author

pnkfelix commented Apr 3, 2019

Okay here are some more complete transcriptions of the perf annotate output for the three cases of interest.

https://gist.github.com/pnkfelix/1b54b3272201d9f096a2289fd5712b52

Since I've taken the effort to transcribe the full machine code provided by perf annotate, I'll attempt to at least do a cursory comparison of these outputs. (And maybe also peek at the original MIR and/or LLVM IR we generated that led to these machine code sequences, though of course one must remember the machine code is post LTO...)

@alexcrichton
Copy link
Member

Ok thanks! Unfortunately nothing obviously jumps out at me, so it seems like it's just inherently more branchy in the version in this PR for whatever reason, but without digging into the LLVM IR and such I wouldn't know why

@scottmcm
Copy link
Member

scottmcm commented Apr 4, 2019

A possible thought: The general FromIterator for vec uses a while-let loop

while let Some(element) = iterator.next() {

You might try flipping that to a .for_each to hit the specialized implementation in Scan:

self.iter.try_fold(init, move |acc, x| {
match f(state, x) {
None => LoopState::Break(Try::from_ok(acc)),
Some(x) => LoopState::from_try(fold(acc, x)),
}
}).into_try()

And if that doesn't work, scan is only overriding try_fold; it's possible that a custom fold would simplify easier in LLVM and get the old codegen back.

@pnkfelix
Copy link
Member Author

pnkfelix commented Apr 4, 2019

You might try flipping that to a .for_each to hit the specialized implementation in Scan:

[...]

And if that doesn't work, scan is only overriding try_fold; it's possible that a custom fold would simplify easier in LLVM and get the old codegen back.

I went ahead and made changes based on this advice, and it does help iter::bench_result_from_iter_into_vec significantly, bringing the new implementation's performance inline with our expectations.

test iter::bench_result_from_iter_into_last_new                 ... bench:       3,339 ns/iter (+/- 98)
test iter::bench_result_from_iter_into_last_old                 ... bench:         661 ns/iter (+/- 2)
test iter::bench_result_from_iter_into_vec_new                  ... bench:       2,063 ns/iter (+/- 2)
test iter::bench_result_from_iter_into_vec_old                  ... bench:       1,934 ns/iter (+/- 8)

That's enough to convince me that we might be able land this change to the impl FromIterator for Result (and for Option). I'm willing to throw away the _into_last micro-benchmark as not measuring anything interesting.

Of course, it also requires that someone review my revisions to Vec::extend_desugared and a new specialized <Scan as Iterator>::fold. I'll put them up shortly. (I want to double check whether both revisions are actually necessary to get the desired performance, or if the Vec::extend_desugared change is sufficient on its own.)

(This is an attempt to ensure we do not regress performance here,
since the performance of this operation varied pretty wildly of the
course of rust-lang#11084.)
…end_desugared`.

This makes use of specialized Iterator methods (when available).
…old`.

(It is easier to subsequently optimize this body, rather than starting
from `Scan::try_fold`.)
@pnkfelix pnkfelix changed the title [WIP] impl FromIterator for Option/Result via scan impl FromIterator for Option/Result via scan Apr 5, 2019
@pnkfelix
Copy link
Member Author

pnkfelix commented Apr 5, 2019

(hmm, after a rebase, I am now seeing stack overflows with this PR applied. Marking WIP again.)

@pnkfelix pnkfelix changed the title impl FromIterator for Option/Result via scan [WIP] impl FromIterator for Option/Result via scan Apr 5, 2019
@pnkfelix pnkfelix added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Apr 5, 2019
@rust-highfive
Copy link
Collaborator

The job x86_64-gnu-llvm-6.0 of your PR failed on Travis (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.
travis_time:end:1a470f26:start=1554457028609868861,finish=1554457136322246540,duration=107712377679
$ git checkout -qf FETCH_HEAD
travis_fold:end:git.checkout

Encrypted environment variables have been removed for security reasons.
See https://docs.travis-ci.com/user/pull-requests/#pull-requests-and-security-restrictions
$ export SCCACHE_BUCKET=rust-lang-ci-sccache2
$ export SCCACHE_REGION=us-west-1
$ export GCP_CACHE_BUCKET=rust-lang-ci-cache
Setting environment variables from .travis.yml
---
travis_time:start:test_assembly
Check compiletest suite=assembly mode=assembly (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
[01:19:29] 
[01:19:29] running 9 tests
[01:19:29] iiiiiiiii
[01:19:29] 
[01:19:29]  finished in 0.164
[01:19:29] travis_fold:end:test_assembly

---
travis_time:start:test_debuginfo
Check compiletest suite=debuginfo mode=debuginfo-both (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
[01:19:47] 
[01:19:47] running 121 tests
[01:20:17] .iiiii...i.....i..i...i..i.i.i..i.ii...i.....i..i....i..........iiii..........i...ii...i.......ii.i. 100/121
[01:20:23] i.i......iii.i.....ii
[01:20:23] 
[01:20:23]  finished in 35.531
[01:20:23] travis_fold:end:test_debuginfo

---
[01:32:15] ...............................................................................i.i.................. 400/931
[01:32:15] .................................................................................................... 500/931
[01:32:15] .................................................................................................... 600/931
[01:32:15] .................................................................................................... 700/931
[01:32:15] ......................F........................................F.................................... 800/931
[01:32:17] ...............................
[01:32:17] failures:
[01:32:17] 
[01:32:17] ---- option::test_collect stdout ----
---
[01:32:17] 
[01:32:17] error: test failed, to rerun pass '--test coretests'
[01:32:17] 
[01:32:17] 
[01:32:17] command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo" "test" "--target" "x86_64-unknown-linux-gnu" "-j" "4" "--release" "--locked" "--color" "always" "--features" "panic-unwind backtrace" "--manifest-path" "/checkout/src/libstd/Cargo.toml" "-p" "core" "--" "--quiet"
[01:32:17] 
[01:32:17] 
[01:32:17] failed to run: /checkout/obj/build/bootstrap/debug/bootstrap test
[01:32:17] Build completed unsuccessfully in 0:25:57
[01:32:17] Build completed unsuccessfully in 0:25:57
[01:32:17] Makefile:48: recipe for target 'check' failed
[01:32:17] make: *** [check] Error 1
The command "stamp sh -x -c "$RUN_SCRIPT"" exited with 2.
travis_time:start:0b08d661
$ date && (curl -fs --head https://google.com | grep ^Date: | sed 's/Date: //g' || true)
Fri Apr  5 11:11:25 UTC 2019
---
travis_time:end:11e8dcf2:start=1554462687366630339,finish=1554462687372440308,duration=5809969
travis_fold:end:after_failure.3
travis_fold:start:after_failure.4
travis_time:start:076c7964
$ ln -s . checkout && for CORE in obj/cores/core.*; do EXE=$(echo $CORE | sed 's|obj/cores/core\.[0-9]*\.!checkout!\(.*\)|\1|;y|!|/|'); if [ -f "$EXE" ]; then printf travis_fold":start:crashlog\n\033[31;1m%s\033[0m\n" "$CORE"; gdb --batch -q -c "$CORE" "$EXE" -iex 'set auto-load off' -iex 'dir src/' -iex 'set sysroot .' -ex bt -ex q; echo travis_fold":"end:crashlog; fi; done || true
travis_fold:end:after_failure.4
travis_fold:start:after_failure.5
travis_time:start:11870736
travis_time:start:11870736
$ cat ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers || true
cat: ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers: No such file or directory
travis_fold:end:after_failure.5
travis_fold:start:after_failure.6
travis_time:start:1968b032
$ dmesg | grep -i kill

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

@Dylan-DPC-zz
Copy link

ping from triage @pnkfelix any updates?

@pnkfelix
Copy link
Member Author

I have higher priority items to attack in the near term, and there isn't much clear value provided here anyway. Closing PR.

@pnkfelix pnkfelix closed this May 17, 2019
Centril added a commit to Centril/rust that referenced this pull request Jul 28, 2019
…scottmcm

Refactoring use common code between option, result and accum

`Option` and `Result` have almost exactly the same code that in `accum.rs` that implement `Sum` and `Product`. This PR just move some code to use the same code for all of them. I believe is better to not implement this `Iterator` feature twice.

I'm not very familiar with pub visibility hope I didn't make then public. However, maybe these adapters could be useful and we could think to make then pub.

rust-lang#59605
rust-lang#11084

r? @pnkfelix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants