Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iterate over the smaller list #78323

Merged
merged 2 commits into from
Oct 28, 2020
Merged

Conversation

est31
Copy link
Member

@est31 est31 commented Oct 24, 2020

If there are two lists of different sizes,
iterating over the smaller list and then
looking up in the larger list is cheaper
than vice versa, because lookups scale
sublinearly.

If there are two lists of different sizes,
iterating over the smaller list and then
looking up in the larger list is cheaper
than vice versa, because lookups scale
sublinearly.
@rust-highfive
Copy link
Collaborator

r? @varkor

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Oct 24, 2020
@est31
Copy link
Member Author

est31 commented Oct 24, 2020

see #78317 for context.

@Mark-Simulacrum could this get a perf run?

@jonas-schievink
Copy link
Contributor

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion

@bors
Copy link
Contributor

bors commented Oct 24, 2020

⌛ Trying commit a21c2eb with merge 6d375b994aad05109a6a39de4ae01e8ed8940f4d...

@bors
Copy link
Contributor

bors commented Oct 24, 2020

☀️ Try build successful - checks-actions
Build commit: 6d375b994aad05109a6a39de4ae01e8ed8940f4d (6d375b994aad05109a6a39de4ae01e8ed8940f4d)

@rust-timer
Copy link
Collaborator

Queued 6d375b994aad05109a6a39de4ae01e8ed8940f4d with parent 2e8a54a, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (6d375b994aad05109a6a39de4ae01e8ed8940f4d): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never

@est31
Copy link
Member Author

est31 commented Oct 25, 2020

The biggest change is in the bootstrap timings, many of which going down in total compile time by 2% and 1%. I'm not sure how noisy they are but if multiple go down at the same time, can it be due to noise?

There's a tiny change in packed-simd-opt run incr-unchanged of 0.062 seconds to 0.059 seconds for crate_inherent_impls_overlap_check, so from 1.70% to 1.63% of the total compile time.

Local benchmarks of stm32f4 give me a regression, likely because its impl blocks are highly-homogenous and thus don't benefit from the optimization (the optimization is best for heterogenous impl blocks) and instead one feels the overhead of doing the length comparison... It also calls traits::overlapping_impls very often so that the pre-check code is only fluff that's in the way. Edit: turns out to be an artifact of how I measured, see my comment below.

I have added a commit to this PR that saves some allocations in the arena. Would it be possible to get another perf run for the PR? Then I can compare the two commits to get the impact of that new commit... that works, right?

@jyn514
Copy link
Member

jyn514 commented Oct 25, 2020

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion

@bors
Copy link
Contributor

bors commented Oct 25, 2020

⌛ Trying commit 5afdb1efe64d306a5ed9dbb528da7c0f50d4258b with merge f5b6e9824f9430f5bbe58b7c12c4a69612e93a30...

@bors
Copy link
Contributor

bors commented Oct 25, 2020

☀️ Try build successful - checks-actions
Build commit: f5b6e9824f9430f5bbe58b7c12c4a69612e93a30 (f5b6e9824f9430f5bbe58b7c12c4a69612e93a30)

@rust-timer
Copy link
Collaborator

Queued f5b6e9824f9430f5bbe58b7c12c4a69612e93a30 with parent 3e0dd24, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (f5b6e9824f9430f5bbe58b7c12c4a69612e93a30): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never

@est31
Copy link
Member Author

est31 commented Oct 25, 2020

Now it's red again for the bootstrap times, but green in the instruction counts. Maybe it's noise? Maybe it was noise in #78317 as well?

IDK should I close this or should it be merged? What do you think?

Copy link
Contributor

@bugadani bugadani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how wall times are measured exactly (well, besides starting and stopping a clock along with the process) and how sensitive is perf to it's environment (processes on the VM, other VMs on the node, etc.). I wouldn't consider a 2% change a huge one there. On the other hand, I don't know enough about the setup to know for sure.

Around half of the averages are 0.0%, others show a tiny improvement. Not much, but I can't see any significant regressions, so I don't see why this shouldn't be accepted.

The associated_items(def_id) call
allocates internally.
Previously, we'd have called it for
each pair, so we'd have had O(n^2)
many calls. By precomputing the
associated items, we avoid
repeating so many allocations.

The only instance where this precomputation
would be a regression is if there's only
one inherent impl block for the type,
as the inner loop then doesn't run.
In that instance, we just early return.

Also, use SmallVec to avoid doing an
allocation at all if the number is small
(the case for most impl blocks out there).
@est31
Copy link
Member Author

est31 commented Oct 26, 2020

Sooo... apparently the way I made local benchmarks of stm32f4 was flawed: I rebased my PR onto the commit of the latest nightly, then compared a local build of rustc with the PR with the nightly. But they were built in different environments, so the local build was a bit slower. If I take the PR's base, the same code that the nightly uses, and compare it to the nightly, I get a slowdown of my local build as well. So I changed the way I run benchmarks to compare two local builds: once of the code as of this PR, and with a local build of the PR's base. Then I see a noticeable speedup:

features cargo +local-bare check cargo +local check
stm32f401,stm32f405 31.838s 31.340s
stm32f401,stm32f405,stm32f407,stm32f410,
stm32f411
1m34.270s 1m33.241s
stm32f401,stm32f405,stm32f407,stm32f410,
stm32f411,stm32f412
2m3.626s 2m1.451s
stm32f401,stm32f405,stm32f407,stm32f410,
stm32f411,stm32f412,stm32f413,stm32f427
3m35.581s 3m30.222s
stm32f401,stm32f405,stm32f407,stm32f410,
stm32f411,stm32f412,stm32f413,stm32f427,
stm32f429,stm32f446,stm32f469
7m28.523s 7m17.673s

Up to 9 seconds speedup if all stm32f4 features are enabled!

This helped me convince of the worth of this PR.

@est31
Copy link
Member Author

est31 commented Oct 26, 2020

Also ran some -Zself-profile. In the instance of 8 enabled features (4th line in the table) crate_inherent_impls_overlap_check took 28.64s before this PR, and 23.53s after, using 8-10% of the entire compile time.

@varkor
Copy link
Member

varkor commented Oct 28, 2020

@bors r+

@bors
Copy link
Contributor

bors commented Oct 28, 2020

📌 Commit 6c9b8ad has been approved by varkor

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 28, 2020
@bors
Copy link
Contributor

bors commented Oct 28, 2020

⌛ Testing commit 6c9b8ad with merge 2eb4fc8...

@bors
Copy link
Contributor

bors commented Oct 28, 2020

☀️ Test successful - checks-actions
Approved by: varkor
Pushing 2eb4fc8 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Oct 28, 2020
@bors bors merged commit 2eb4fc8 into rust-lang:master Oct 28, 2020
@rustbot rustbot added this to the 1.49.0 milestone Oct 28, 2020
@Mark-Simulacrum
Copy link
Member

Appears to have been a slight win on packed-simd incr-unchanged instruction counts, though it is unclear that this had any effect on cpu cycles or wall times.

@est31
Copy link
Member Author

est31 commented Nov 4, 2020

@Mark-Simulacrum yes that's consistent with what the manual benchmark runs showed. The wall clock time changes are too small for it to matter for smaller workloads in the builtin benchmark testsuite, but they are observable with real life workloads like stm32f4 (see comments above).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants