Skip to content

[BOLT] Fix density for jump-through functions #145619

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

aaupov
Copy link
Contributor

@aaupov aaupov commented Jun 25, 2025

Address the issue that stems from how the density is computed.

Binary function density is the ratio of its total dynamic number of
executed bytes over the static size in bytes. The meaning of it is the
amount of dynamic profile information relative to its static size.

Binary profile density is the minimum function density among well-
-profiled
functions, taken as functions covering p99 samples, or, in
other words, excluding functions in the tail 1% of samples. p99 is an
arbitrary cutoff. The meaning of profile density is the minimum amount
of profile information per function
to be able to optimize the program
well. The threshold for profile density is set empirically.

The dynamically executed bytes are taken directly from LBR fall-throughs
and for LBRs recorded in trampoline functions, such as

000000001a941ec0 <Sleef_expf8_u10>:
1a941ec0: jmpq *0x37b911fa(%rip) # <pnt_expf8_u10>
1a941ec6: nopw %cs:(%rax,%rax)

the fall-through has zero length:

# Branch   Target   NextBranch  Count
T 1b171cf6 1a941ec0 1a941ec0    568562

But it's not correct to say this function has zero executed bytes, just
the size of the next branch is not included in the fall-through.

If such functions have non-trivial sample count, they will fall in p99
samples, and cause the profile density to be zero.

To solve this, we can either:

  1. Include fall-through end jump size into executed bytes:
    is logically sound but technically challenging: the size needs to
    come from disassembly (expensive), and the threshold need to be
    reevaluated with updated definition of binary function density.
  2. Exclude pass-through functions from density computation:
    follows the intent of profile density which is to set the amount of
    profile information needed to optimize the function well. Single
    instruction pass-through functions don't need samples many times
    the size to be optimized well.

Go with option 2 as a reasonable compromise.

Test Plan: added bolt/test/X86/zero-density.s

Created using spr 1.3.4
@llvmbot
Copy link
Member

llvmbot commented Jun 25, 2025

@llvm/pr-subscribers-bolt

Author: Amir Ayupov (aaupov)

Changes

Address the issue that stems from how the density is computed.

Binary function density is the ratio of its total dynamic number of
executed bytes over the static size in bytes. The meaning of it is the
amount of dynamic profile information relative to its static size.

Binary profile density is the minimum function density among well-
-profiled
functions, taken as functions covering p99 samples, or, in
other words, excluding functions in the tail 1% of samples. p99 is an
arbitrary cutoff. The meaning of profile density is the minimum amount
of profile information per function
to be able to optimize the program
well. The threshold for profile density is set empirically.

The dynamically executed bytes are taken directly from LBR fall-throughs
and for LBRs recorded in trampoline functions, such as

000000001a941ec0 &lt;Sleef_expf8_u10&gt;:
1a941ec0: jmpq *0x37b911fa(%rip) # &lt;pnt_expf8_u10&gt;
1a941ec6: nopw %cs:(%rax,%rax)

the fall-through has zero length:

T 1b171cf6 1a941ec0 1a941ec0    568562

But it's not correct to say this function has zero executed bytes, just
the size of the (next) branch is not included in the fall-through.

If such functions have non-trivial sample count, they will fall in p99
samples, and cause the profile density to be zero.

To solve this, we can either:

  1. Include fall-through end jump size into executed bytes.
  2. Exclude pass-through functions from density computation.
  1. is logically sound but technically challenging: the data needs to
    come from disassembly (expensive), and the thresholds need to be
    reset with updated definition of binary function density.
  2. is a reasonable compromise that follows the intent of profile density
    which is to set the amount of profile information needed to optimize
    the function well. Single instruction pass-through functions don't
    need samples many times the size to optimize well.

Test Plan: added bolt/test/X86/zero-density.s


Full diff: https://github.com/llvm/llvm-project/pull/145619.diff

2 Files Affected:

  • (modified) bolt/lib/Passes/BinaryPasses.cpp (+16-17)
  • (added) bolt/test/X86/zero-density.s (+32)
diff --git a/bolt/lib/Passes/BinaryPasses.cpp b/bolt/lib/Passes/BinaryPasses.cpp
index e356481bbdc7c..5d44e1a1a4902 100644
--- a/bolt/lib/Passes/BinaryPasses.cpp
+++ b/bolt/lib/Passes/BinaryPasses.cpp
@@ -1765,27 +1765,26 @@ Error PrintProgramStats::runOnFunctions(BinaryContext &BC) {
 
   if (opts::ShowDensity) {
     double Density = 0.0;
-    // Sorted by the density in descending order.
-    llvm::stable_sort(FuncDensityList,
-                      [&](const std::pair<double, uint64_t> &A,
-                          const std::pair<double, uint64_t> &B) {
-                        if (A.first != B.first)
-                          return A.first > B.first;
-                        return A.second < B.second;
-                      });
+    llvm::sort(FuncDensityList);
 
     uint64_t AccumulatedSamples = 0;
-    uint32_t I = 0;
     assert(opts::ProfileDensityCutOffHot <= 1000000 &&
            "The cutoff value is greater than 1000000(100%)");
-    while (AccumulatedSamples <
-               TotalSampleCount *
-                   static_cast<float>(opts::ProfileDensityCutOffHot) /
-                   1000000 &&
-           I < FuncDensityList.size()) {
-      AccumulatedSamples += FuncDensityList[I].second;
-      Density = FuncDensityList[I].first;
-      I++;
+    // Subtract samples in zero-density functions (no fall-throughs) from
+    // TotalSampleCount (not used anywhere below).
+    for (const auto [CurDensity, CurSamples] : FuncDensityList) {
+      if (CurDensity != 0.0)
+        break;
+      TotalSampleCount -= CurSamples;
+    }
+    const uint64_t CutoffSampleCount =
+        1.f * TotalSampleCount * opts::ProfileDensityCutOffHot / 1000000;
+    // Process functions in decreasing density order
+    for (const auto [CurDensity, CurSamples] : llvm::reverse(FuncDensityList)) {
+      if (AccumulatedSamples >= CutoffSampleCount)
+        break;
+      AccumulatedSamples += CurSamples;
+      Density = CurDensity;
     }
     if (Density == 0.0) {
       BC.errs() << "BOLT-WARNING: the output profile is empty or the "
diff --git a/bolt/test/X86/zero-density.s b/bolt/test/X86/zero-density.s
new file mode 100644
index 0000000000000..7843804ffed8c
--- /dev/null
+++ b/bolt/test/X86/zero-density.s
@@ -0,0 +1,32 @@
+## Check that trampoline functions are excluded from density computation.
+
+# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown %s -o %t.o
+# RUN: ld.lld %t.o -o %t
+# RUN: link_fdata %s %t %t.preagg PREAGG
+# RUN: llvm-strip -NLjmp %t
+# RUN: perf2bolt %t -p %t.preagg --pa -o %t.fdata | FileCheck %s
+# CHECK: Functions with density >= {{.*}} account for 99.00% total sample counts.
+# CHECK-NOT: the output profile is empty or the --profile-density-cutoff-hot option is set too low.
+
+  .text
+  .globl trampoline
+trampoline:
+  mov main,%rax
+  jmpq *%rax
+.size trampoline,.-trampoline
+# PREAGG: f #trampoline# #trampoline# 2
+
+	.globl main
+main:
+	.cfi_startproc
+	vmovaps %zmm31,%zmm3
+
+	add    $0x4,%r9
+	add    $0x40,%r10
+	dec    %r14
+Ljmp:
+	jne    main
+# PREAGG: T #Ljmp# #main# #Ljmp# 10
+	ret
+	.cfi_endproc
+.size main,.-main

@aaupov aaupov merged commit 4984714 into main Jun 26, 2025
9 checks passed
@aaupov aaupov deleted the users/aaupov/spr/bolt-fix-density-for-jump-through-functions branch June 26, 2025 05:20
anthonyhatran pushed a commit to anthonyhatran/llvm-project that referenced this pull request Jun 26, 2025
Address the issue that stems from how the density is computed.

Binary *function* density is the ratio of its total dynamic number of
executed bytes over the static size in bytes. The meaning of it is the
amount of dynamic profile information relative to its static size.

Binary *profile* density is the minimum *function* density among *well-
-profiled* functions, taken as functions covering p99 samples, or, in
other words, excluding functions in the tail 1% of samples. p99 is an
arbitrary cutoff. The meaning of profile density is the *minimum amount
of profile information per function* to be able to optimize the program
well. The threshold for profile density is set empirically.

The dynamically executed bytes are taken directly from LBR fall-throughs
and for LBRs recorded in trampoline functions, such as
```
000000001a941ec0 <Sleef_expf8_u10>:
1a941ec0: jmpq *0x37b911fa(%rip) # <pnt_expf8_u10>
1a941ec6: nopw %cs:(%rax,%rax)
```
the fall-through has zero length:
```
# Branch   Target   NextBranch  Count
T 1b171cf6 1a941ec0 1a941ec0    568562
```

But it's not correct to say this function has zero executed bytes, just
the size of the next branch is not included in the fall-through.

If such functions have non-trivial sample count, they will fall in p99
samples, and cause the profile density to be zero.

To solve this, we can either:
1. Include fall-through end jump size into executed bytes:
   is logically sound but technically challenging: the size needs to
   come from disassembly (expensive), and the threshold need to be
   reevaluated with updated definition of binary function density.
2. Exclude pass-through functions from density computation:
   follows the intent of profile density which is to set the amount of 
   profile information needed to optimize the function well. Single
   instruction pass-through functions don't need samples many times 
   the size to be optimized well.

Go with option 2 as a reasonable compromise.

Test Plan: added bolt/test/X86/zero-density.s
rlavaee pushed a commit to rlavaee/llvm-project that referenced this pull request Jul 1, 2025
Address the issue that stems from how the density is computed.

Binary *function* density is the ratio of its total dynamic number of
executed bytes over the static size in bytes. The meaning of it is the
amount of dynamic profile information relative to its static size.

Binary *profile* density is the minimum *function* density among *well-
-profiled* functions, taken as functions covering p99 samples, or, in
other words, excluding functions in the tail 1% of samples. p99 is an
arbitrary cutoff. The meaning of profile density is the *minimum amount
of profile information per function* to be able to optimize the program
well. The threshold for profile density is set empirically.

The dynamically executed bytes are taken directly from LBR fall-throughs
and for LBRs recorded in trampoline functions, such as
```
000000001a941ec0 <Sleef_expf8_u10>:
1a941ec0: jmpq *0x37b911fa(%rip) # <pnt_expf8_u10>
1a941ec6: nopw %cs:(%rax,%rax)
```
the fall-through has zero length:
```
# Branch   Target   NextBranch  Count
T 1b171cf6 1a941ec0 1a941ec0    568562
```

But it's not correct to say this function has zero executed bytes, just
the size of the next branch is not included in the fall-through.

If such functions have non-trivial sample count, they will fall in p99
samples, and cause the profile density to be zero.

To solve this, we can either:
1. Include fall-through end jump size into executed bytes:
   is logically sound but technically challenging: the size needs to
   come from disassembly (expensive), and the threshold need to be
   reevaluated with updated definition of binary function density.
2. Exclude pass-through functions from density computation:
   follows the intent of profile density which is to set the amount of 
   profile information needed to optimize the function well. Single
   instruction pass-through functions don't need samples many times 
   the size to be optimized well.

Go with option 2 as a reasonable compromise.

Test Plan: added bolt/test/X86/zero-density.s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants