Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange behaviour of inlining #1969

Open
Sahnvour opened this issue Feb 16, 2019 · 1 comment
Open

Strange behaviour of inlining #1969

Sahnvour opened this issue Feb 16, 2019 · 1 comment
Labels
use case Describes a real use case that is difficult or impossible, but does not propose a solution.
Milestone

Comments

@Sahnvour
Copy link
Contributor

Sahnvour commented Feb 16, 2019

I was trying various example usages of https://github.com/Sahnvour/zig-benchmark and found something that looks odd.

Using this code https://gist.github.com/Sahnvour/829e3d7930964f976ce97de60a644a27
and running with zig run test.zig --release-fast we can see dramatically different results for insert sequential, depending on wether the other 2 benchmarks (initially commented out) are executed or not.

(These benchmarked functions are not really a good way to measure the performance of HashMap but that's not the point here).

Example output from my machine, first only with insert sequential:

benchmarkArgs("insert sequential", mapInsertSequential, maps);
// benchmarkArgs("insert sequential evens", mapInsertSequentialEvens, maps);
// benchmarkArgs("insert sequential odds", mapInsertSequentialOdds, maps);
$ zig run test.zig --release-fast
insert sequential <HashMap(u32,u32,getAutoHashFn(u32)_hash,getAutoEqlFn(u32)_eql)>: avg 1.429ms (350 iterations)

Then with the other 2 benchmarks:

benchmarkArgs("insert sequential", mapInsertSequential, maps);
benchmarkArgs("insert sequential evens", mapInsertSequentialEvens, maps);
benchmarkArgs("insert sequential odds", mapInsertSequentialOdds, maps);
$ zig run test.zig --release-fast
insert sequential <HashMap(u32,u32,getAutoHashFn(u32)_hash,getAutoEqlFn(u32)_eql)>: avg 2.456ms (204 iterations)
insert sequential evens <HashMap(u32,u32,getAutoHashFn(u32)_hash,getAutoEqlFn(u32)_eql)>: avg 3.493ms (144 iterations)
insert sequential odds <HashMap(u32,u32,getAutoHashFn(u32)_hash,getAutoEqlFn(u32)_eql)>: avg 3.191ms (157 iterations)

As you can see, the average time for insert sequential greatly increased, even though we didn't modify it in any way. The only difference is that more code is executed in main.

I looked quickly at the assembly, and the main difference I spotted is that in the first case, HashMap.put() is inlined into mapInsertSequential while in the second it is not.
Note: my library forces the benchmarked functions to not be inlined when called. So as I understand, the outer scoped (here main) shouldn't have any influence on the inlining decisions taken inside them ?

@rohlem
Copy link
Contributor

rohlem commented Feb 16, 2019

I don't have much inside knowledge, but here's an explanation I can come up with:
A function with only a single call site can be inlined and its "detached" implementation discarded, always resulting in a net size reduction. Following that logic, inlining a function only called in one place makes perfect sense, and I would assume all/most other functions (map.init, map.clear, etc.) to be inlined as well.
With multiple call sites, it's the optimizer's decision to balance individual performance in respect to global code bloat. I don't think Zig does any function analysis in this regard "itself" (though I really have no info on this); I assume this decision is solely up to LLVM optimization passes (disregarding specific requests via @inlineCall/@noInlineCall).
Under this premise, while the performance difference is noticeable, to me it doesn't appear to be a "bug" on Zig's part, but simply suboptimal behaviour on the part of LLVM's optimizations.

That leaves several options:

  • Realizing "hot code" like this is a bottleneck in an application, it's a good idea to try manually forcing inlining via @inlineCall.
  • The generated LLVM IR could be analyzed and altered by hand, to see if we (by which I mean someone more informed/competent in this than present-me) can
    • generate more optimizer-friendly IR; this might be a very involved process, where improving one scenario could degrade performance in another.
    • trace this behaviour back to a past change in LLVM, to file it as an upstream bug and make LLVM smarter about these scenarios.

All that being said, there is probably room to allow more fine-grained tuning from within Zig. --release-fast (in contrast to --release-small) should already prioritize speed over code size (although past a certain application size, smaller code is faster), but maybe there could be a builtin for scope-level flagging, similar to @setFloatMode, to specify "in this loop, inline N levels of function calls".

@andrewrk andrewrk modified the milestones: 0.4.0, 0.5.0 Feb 18, 2019
@andrewrk andrewrk modified the milestones: 0.5.0, 0.6.0 Sep 20, 2019
@andrewrk andrewrk modified the milestones: 0.6.0, 0.7.0 Dec 31, 2019
@andrewrk andrewrk modified the milestones: 0.7.0, 0.8.0 Oct 10, 2020
@andrewrk andrewrk modified the milestones: 0.8.0, 0.9.0 Nov 6, 2020
@SpexGuy SpexGuy added the use case Describes a real use case that is difficult or impossible, but does not propose a solution. label Mar 20, 2021
@andrewrk andrewrk modified the milestones: 0.9.0, 0.10.0 May 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
use case Describes a real use case that is difficult or impossible, but does not propose a solution.
Projects
None yet
Development

No branches or pull requests

4 participants