Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apple platforms: Disabled frame pointer elimination causes perf issues and is not in line with what clang does #86196

Closed
hkratz opened this issue Jun 10, 2021 · 5 comments · Fixed by #86652
Assignees
Labels
A-codegen Area: Code generation C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. O-ios Operating system: iOS O-macos Operating system: macOS P-high High priority regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Milestone

Comments

@hkratz
Copy link
Contributor

hkratz commented Jun 10, 2021

PR #85706 causes massive performance issues for not-inlined leaf functions since the frame pointer and the link register are now always saved and restored at the beginning and end of a function, even if this function e.g. just returns a constant. In contrast to the PR description this is not what clang does on Macos aarch64. The clang default on Macos is "frame-pointer"="non-leaf" instead.

E.g. clang compiles

unsigned long test(unsigned long num) {
    return num % 64;
}

to this LLVM IR (with clang -c test.c -S -emit-llvm -O3)

[...]
; Function Attrs: norecurse nounwind readnone ssp uwtable
define i64 @test(i64 %0) local_unnamed_addr #0 {
  %2 = and i64 %0, 63
  ret i64 %2
}

attributes #0 = { norecurse nounwind readnone ssp uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="non-leaf" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="true" "probe-stack"="__chkstk_darwin" "stack-protector-buffer-size"="8" "target-cpu"="apple-a12" "target-features"="+aes,+crc,+crypto,+fp-armv8,+fullfp16,+lse,+neon,+ras,+rcpc,+rdm,+sha2,+v8.3a,+zcm,+zcz" "unsafe-fp-math"="false" "use-soft-float"="false" }
[...]

which leads to this assembly:

_test:                                  ; @test
	.cfi_startproc
; %bb.0:
	and	x0, x0, #0x3f
	ret
	.cfi_endproc

Since the inclusion of the PR Rust Nightly sets "frame-pointer"="all", which causes this assembly to be generated:

_test:                                  ; @test
   .cfi_startproc
; %bb.0:
   stp	x29, x30, [sp, #-16]!           ; 16-byte Folded Spill
   mov	x29, sp
   .cfi_def_cfa w29, 16
   .cfi_offset w30, -8
   .cfi_offset w29, -16
   and	x0, x0, #0x3f
   ldp	x29, x30, [sp], #16             ; 16-byte Folded Reload
   ret
   .cfi_endproc

@jrmuizel: FYI

@hkratz hkratz changed the title Frame pointer elimination configuration on Apple platforms is not in line with what clang does Apple platformes: Disabled frame pointer elimination causes perf issues and is not in line with what clang does Jun 10, 2021
@hkratz hkratz changed the title Apple platformes: Disabled frame pointer elimination causes perf issues and is not in line with what clang does Apple platforms: Disabled frame pointer elimination causes perf issues and is not in line with what clang does Jun 10, 2021
@jrmuizel
Copy link
Contributor

Indeed. Here's what clang does:
https://github.com/llvm/llvm-project/blob/c3cc14f87f78f8172b74175bbd2557cfb9384900/clang/lib/Driver/ToolChains/Clang.cpp#L604

For AArch64 it omits the frame pointer for leaf functions. Interestingly at https://github.com/llvm/llvm-project/blob/c3cc14f87f78f8172b74175bbd2557cfb9384900/clang/lib/Driver/ToolChains/Clang.cpp#L517 you'd think the same would apply ARM and Thumb on Darwin but with a closer reading it appears that's not the case.

It looks like Rust doesn't currently support non-leaf frame pointer omission so that's going to need to be hooked up before this can fixed.

Can you elaborate on the massive performance issues? Do you have a benchmark where you noticed this?

@inquisitivecrystal
Copy link
Contributor

@rustbot label A-codegen C-bug I-prioritize I-slow O-ios O-macos regression-from-stable-to-nightly T-compiler

Not entirely sure "bug" and "regression" fit a deliberate change, but since it seems to have been based on a false assumption and have caused performance issues I'm applying them. If someone thinks they don't fit, feel free to undo.

@rustbot rustbot added A-codegen Area: Code generation C-bug Category: This is a bug. I-prioritize Issue: Indicates that prioritization has been requested for this issue. I-slow Issue: Problems and improvements with respect to performance of generated code. O-ios Operating system: iOS O-macos Operating system: macOS regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jun 10, 2021
@hkratz
Copy link
Contributor Author

hkratz commented Jun 11, 2021

I have to admit that I was writing a non-inlined partial load function using Neon intrinsics which was heavily affected by this. So it might not be as bad in the general case. I will run some benchmarks.

@hkratz
Copy link
Contributor Author

hkratz commented Jun 14, 2021

The time for an idem function call (returning the u64 passed in as an argument) on Apple M1 is:

  • 0.34 ns inlined
  • 0.96 ns without saving the frame pointer
  • 1.44 ns with saving the frame pointer

The performance difference in the real world applications I have measured (ripgrep, rustc) is negligable. The size of generated binaries is somewhat surprising: The __text section size is 1817500 on ripgrep without saving the fp vs 1846828 (x 1.016) saving the fp. However the __unwind_info and __const sections are much larger leading to an overall stripped binary size of 2926720 without saving the fp vs 2811168 saving the fp.

So all in all while I still think that using "frame-pointer"="non-leaf" on Macos is generally a good idea since it has the potential to affect performance in rare cases it is not as urgent as I thought. Mea culpa.

@pietroalbini pietroalbini added this to the 1.54.0 milestone Jun 14, 2021
@apiraino
Copy link
Contributor

Assigning priority as discussed in the Zulip thread of the Prioritization Working Group.

@rustbot label -I-prioritize +P-high

@rustbot rustbot added P-high High priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Jun 17, 2021
@nagisa nagisa self-assigned this Jun 24, 2021
GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this issue Jul 1, 2021
…henkov

Add support for leaf function frame pointer elimination

This PR adds ability for the target specifications to specify frame
pointer emission type that's not just “always” or “whatever cg decides”.

In particular there's a new mode that allows omission of the frame
pointer for leaf functions (those that don't call any other functions).

We then set this new mode for Aarch64-based Apple targets.

Fixes rust-lang#86196
@bors bors closed this as completed in 9b67cba Jul 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. O-ios Operating system: iOS O-macos Operating system: macOS P-high High priority regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants