New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving function into a separate crate results in a more effective code for some reason #41894

Open
newpavlov opened this Issue May 10, 2017 · 3 comments

Comments

Projects
None yet
4 participants
@newpavlov
Contributor

newpavlov commented May 10, 2017

While working on optimizations for crypto-hashes I've notice a very strange behaviour described in the title. I've isolated the relevant code into this repository, so you can run it yourself.

Enabling lto produces the same slow result for separate case as for in-crate one. Optimal code generated only if #[inline] or #[inline(always)] used for compress function. Also generated assembly for two cases is quite different despite the identical code.

Probably it's due to some mis-optimization which gets turned off when function is in a different crate, but available for inlining.

UPD: It was reported on the reddit that 64-bit ARM shows the same performance for both cases.

@newpavlov

This comment has been minimized.

Contributor

newpavlov commented May 11, 2017

Citing doener:

It's tied to the RC array. Moving just out of the original crate makes all the difference. Not sure why (yet).

OK, so with the array values available, LLVM can collapse two add instructions into a single lea instruction.
For example:

add    0x4(%r10),%edx # %r10 is the base address of RC
add    %ecx,%edx

Becomes:

lea    -0x173848aa(%rdi,%rax,1),%eax

Now the problem is that this causes frontend stalls. perf stat shows:

1,946,734,091 stalled-cycles-frontend # 65.05% frontend cycles idle ( +- 15.66% )

vs.

1,062,250,896 stalled-cycles-frontend # 44.28% frontend cycles idle ( +- 21.81% )

Which is an issue in Intel CPUs since Sandy Bridge, where the three operand LEA instruction has higher latency and limited dispatch port choices, see https://software.intel.com/en-us/node/544484

@samlh

This comment has been minimized.

samlh commented Jun 1, 2017

@newpavlov

This comment has been minimized.

Contributor

newpavlov commented Oct 29, 2018

The issue is still reproducible on Nightly 2018-10-28.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment