Performance regression in `1.48.0` #79246

marmeladema · 2020-11-20T21:22:56Z

Hello everyone!

We detected a performance regression on a closed-source project after an upgrade of rust from version 1.47.0 to version 1.48.0.
The benchmark we run internally are using the low-level perf_event_open Linux API to measure various hardware performance counters on specific code section.
The most stable and reliable counter we care about is the number of retired instructions, corresponding to instructions:u in classic perf output.

On certain datasets, the regressions are up to +16% in the number of instructions executed.

I used cargo-bisect-rustc as instructed on https://rust-lang.zulipchat.com/#narrow/stream/122651-general/topic/Performance.20regressions.20in.201.2E48.2E0 in order to search for the commit that introduced the regression:

searched nightlies: from nightly-2020-09-03 to nightly-2020-11-20
regressed nightly: nightly-2020-09-04
searched commits: from 80fc9b0 to 0d0f6b1
regressed commit: 0d0f6b1

bisected with cargo-bisect-rustc v0.6.0

Host triple: x86_64-unknown-linux-gnu
Reproduce with:

cargo bisect-rustc --script=./bisect.sh

The bisection points to #70793

I was able to reproduce the regression on two different machines.
I will also try to provide a minimized code example to reproduce the issue but that will take some time as the code being measured is quite big.

If anyone else has noticed this regression and could help me towards tools / ideas / whatever that could be of help to reproduce the issue would greatly appreciated!

EDIT:
MCVE at https://github.com/marmeladema/bitvec-perf-regression

The text was updated successfully, but these errors were encountered:

Mark-Simulacrum · 2020-11-20T22:48:59Z

I suspect a good next step would be to try to narrow down the code being regressed -- maybe with something like a cachegrind diff -- to attempt to isolate the problem, and perhaps propose a fix.

JulianKnodt · 2020-11-21T02:43:08Z

Just curious, are you using LTO when you notice the regression? I was looking at the code in the PR and there are a lot of small methods which aren't marked with #[inline], which from my experience would cause slowdown.

Edit: after reading specifically that you're looking at instructions:u, I'm less certain since I imagine that more inlining would lead to more instructions retired, but now I'm not sure.

marmeladema · 2020-11-21T11:25:43Z

Just curious, are you using LTO when you notice the regression? I was looking at the code in the PR and there are a lot of small methods which aren't marked with #[inline], which from my experience would cause slowdown.

Yes, LTO is enabled for release builds which what is being measured here.

the8472 · 2020-11-21T11:39:02Z

If it helps narrowing down which code to test: The changes should only affect code using into_iter() on Vec, Box<[]> or BinaryHeap and then collecting into one of those types again, optionally involving some of the iterator adapters from the core crate. Anything else would be unintended.

The primary goal of that PR is to reduce malloc/free traffic by reusing allocations where possible. I only measured with the system allocator. Do you use that or a custom one?

Part of that optimization also involves switching from loops over next() to a try_fold() for. That might affect vectorization. But most of the microbenchmarks that I ran under perf seemed to vectorize fine even with iterators involving several adapters.

There also are a bunch of smaller tweaks relating to allocations in various Vec iteration/collection methods such as changing a memcpy between allocations to an internal memmove.

I was looking at the code in the PR and there are a lot of small methods which aren't marked with #[inline]

Afaiu inline annotations are only needed on non-generic methods and the whole SourceIter methods are generic.

the8472 · 2020-11-21T11:47:02Z

The changes should only affect code using into_iter() on Vec, Box<[]> or BinaryHeap and then collecting into one of those types again

Small correction, it may also affect some iterators involving Vec.into_iter() and zip without collecting into one of the above types again.

marmeladema · 2020-11-21T11:50:50Z

If it helps narrowing down which code to test: The changes should only affect code using into_iter() on Vec, Box<[]> or BinaryHeap and then collecting into one of those types again, optionally involving some of the iterator adapters from the core crate. Anything else would be unintended.

Ok thanks! That is valuable information. FYI the measured code has some (maybe heavy?) use of Vec/Box<[]> and iterating/collecting from those but no usage of BinaryHeap

marmeladema · 2020-11-22T14:30:38Z

I continued my investigation and did a perf diff between two run of the crate comparing stable 1.47.0 to 1.48.0:

with rust 1.47.0, most of the time is spent in:

itvec::vec::iter::<impl core::iter::traits::collect::Extend<bool> for bitvec::vec::BitVec<O,T>>::extend

whereas with rust 1.48.0, most of the time is spent in:

bitvec::vec::api::<impl bitvec::vec::BitVec<O,T>>::push

This led me to focus on the bitvec crate itself and more especially the BitVec::extend method which seems to be the one being severely impacted by the regression.
The bitvec version being tested was 0.17.4 and after updating to version 1.19.4, I could not trigger the regression anymore.

I managed to provide a self-contained and (relatively) minimized example on https://github.com/marmeladema/bitvec-perf-regression

I run the bisection again and it failed with:

********************************************************************************
Regression in nightly-2020-09-02
********************************************************************************

fetching https://static.rust-lang.org/dist/2020-09-01/channel-rust-nightly-git-commit-hash.txt
ERROR: Tarball not found at https://static.rust-lang.org/dist/2020-09-01/channel-rust-nightly-git-commit-hash.txt

but this time it seems to point to nightly-2020-09-02 instead of nightly-2020-09-04 ...
Not sure exactly which one is the culprit?

I have also bisected the bitvec crate itself and the regression disappears at ferrilab/bitvec@f821122 which is basically a rewrite of the crate

marmeladema · 2020-11-22T18:45:55Z

I managed to manually bisect rustc and it seems the bitvec regression has been introduced by commit 1d22f75 which corresponds to the merge of #76030

marmeladema · 2020-11-22T19:54:38Z

The regression still triggers with the latest nightly rustc 1.50.0-nightly (da3846948 2020-11-21) and I can confirm that reverting #76030:

patch

diff --git a/compiler/rustc_codegen_llvm/src/type_of.rs b/compiler/rustc_codegen_llvm/src/type_of.rs
index 8ea4768f77d..0876907e119 100644
--- a/compiler/rustc_codegen_llvm/src/type_of.rs
+++ b/compiler/rustc_codegen_llvm/src/type_of.rs
@@ -40,9 +40,7 @@ fn uncached_llvm_type<'a, 'tcx>(
         // FIXME(eddyb) producing readable type names for trait objects can result
         // in problematically distinct types due to HRTB and subtyping (see #47638).
         // ty::Dynamic(..) |
-        ty::Adt(..) | ty::Closure(..) | ty::Foreign(..) | ty::Generator(..) | ty::Str
-            if !cx.sess().fewer_names() =>
-        {
+        ty::Adt(..) | ty::Closure(..) | ty::Foreign(..) | ty::Generator(..) | ty::Str => {
             let mut name = with_no_trimmed_paths(|| layout.ty.to_string());
             if let (&ty::Adt(def, _), &Variants::Single { index }) =
                 (layout.ty.kind(), &layout.variants)
@@ -58,12 +56,6 @@ fn uncached_llvm_type<'a, 'tcx>(
             }
             Some(name)
         }
-        ty::Adt(..) => {
-            // If `Some` is returned then a named struct is created in LLVM. Name collisions are
-            // avoided by LLVM (with increasing suffixes). If rustc doesn't generate names then that
-            // can improve perf.
-            Some(String::new())
-        }
         _ => None,
     };

fixes the regression.

I am not sure what to do from there.

cc @eddyb @davidtwco

eddyb · 2020-11-23T07:56:53Z

Ugh, choice of names shouldn't cause perf changes. At a glance, this looks like a LLVM bug.

I was going to suggest testing with -Z fewer-names=off, but that doesn't actually do anything (the lack of --emit=llvm-ir means it's forced to "fewer names", we should maybe fix that).
And you tested the revert anyway, so that's confirmation enough.

@marmeladema if you have the time, one thing you could try is "bisecting" for the name (i.e. choose between Some(name) and Some(String::new()) based on some property, such as name.len() or name.contains("...") etc.). Presumably the exact type that needs a name to not cause the perf regression, could be determined.

Then you pretty much want to diff the output of -Z no-parallel-llvm -C llvm-args=-print-after-all runs with and without names for that one type, and see where optimizations diverge. Or you can start with that kind of diff, if you can figure out a way to ignore most of the differences from having named vs numbered types.

marmeladema · 2020-11-23T22:26:07Z

I have managed to isolate the precise name that needs to be generated to not cause the performance regression:

bitvec::prelude::BitVec (exactly)

marmeladema · 2020-11-23T23:26:08Z

I have tried to decompile BitVec::extend with ghidra but it's the first time i've used that tool and I am not really sure of what I did, nor if it's really useful 😄 The only difference I can see in the decompiled code is an additional endless loop when unwrapping the Option<bool>:

--- bitvec.extend.slow.decompiled.c     2020-11-23 23:19:02.971052349 +0000
+++ bitvec.extend.fast.decompiled.c     2020-11-23 23:10:45.442938517 +0000
@@ -99,11 +99,13 @@
   do {
     bVar2 = *param_2;
     if (bVar2 == 2) {
-                    /* WARNING: Subroutine does not return */
       core::panicking::panic
                 (
                  "called `Option::unwrap()` on a `None`valuelibrary/std/src/panicking.rslibrary/std/src/env.rsfailed to get environmentvariable ``: already mutablyborrowedlibrary/std/src/sys_common/thread_info.rsassertion failed:c.borrow().is_none()Rust panics must be rethrownRust cannot catch foreignexceptionslibrary/std/src/sys/unix/rwlock.rs<unnamed>note: run with`RUST_BACKTRACE=1` environment variable to display a backtrace\n\' panicked at\'\', rwlock maximum reader count exceededrwlock read lock would result indeadlockthread panicked while panicking. aborting.\nthread panicked whileprocessing panic.aborting.\n.debug_library/std/src/../../backtrace/src/symbolize/gimli/elf.rs/home/adema/code/rust/library/alloc/src/vec.rslibrary/std/src/path.rsinternal error:entered unreachable code"
                  ,0x2b,&PTR_UINT_0014f3c8);
+      do {
+        invalidInstructionException();
+      } while( true );
     }
     pointer::BitPtr<T>::into_bitslice(*param_1,param_1[1]);
     pointer::BitPtr<T>::into_bitslice(*param_1);

Does that help at all, or am I not looking in the right place?

the8472 · 2020-11-23T23:58:22Z

Are you sure you're looking at the right method? Previously said most of the time is spent in bitvec::vec::api::<impl bitvec::vec::BitVec<O,T>>::push, not extend.

marmeladema · 2020-11-24T00:12:15Z

I decompiled this one because that's actually the method that is called in my reproducible code. push shows up in the perf diff but because it must be called at some point during extend. At least that how I understand it but again, I might be wrong.
@the8472 have you had a chance to look at the mcve yet?

spastorino · 2020-12-17T20:56:58Z

This was discussed during today's compiler meeting

tmiasko · 2021-09-28T12:48:51Z

LLVM bug report https://bugs.llvm.org/show_bug.cgi?id=51667 and proposed patch https://reviews.llvm.org/D109294.

marmeladema · 2022-01-08T14:14:12Z

@tmiasko I see the llvm bug has been resolved. Has our llvm fork been updated with the fix? If so should this be closed?

nikic · 2022-01-08T14:28:08Z

@marmeladema No, the fix will be pulled in with the LLVM 14 upgrade.

marmeladema · 2022-09-07T17:38:27Z

Closing this issue now that we are at LLVM 15.

Mark-Simulacrum added the regression-from-stable-to-stable Performance or correctness regression from one stable version to another. label Nov 20, 2020

rustbot added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Nov 20, 2020

bjorn3 removed the E-needs-mcve Call for participation: This issue has a repro, but needs a Minimal Complete and Verifiable Example label Nov 24, 2020

pnkfelix added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Nov 24, 2020

apiraino added I-prioritize Issue: Indicates that prioritization has been requested for this issue. I-nominated labels Dec 16, 2020

spastorino removed the I-nominated label Dec 17, 2020

apiraino removed the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Dec 23, 2020

marmeladema mentioned this issue Dec 28, 2020

Iterator-based approach performs 10x worse than manual implementation #80416

Open

tmiasko mentioned this issue Aug 29, 2021

Reapply "cg_llvm: fewer_names in uncached_llvm_type" #88449

Closed

marmeladema closed this as completed Sep 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression in `1.48.0` #79246

Performance regression in `1.48.0` #79246

marmeladema commented Nov 20, 2020 •

edited

Loading

Mark-Simulacrum commented Nov 20, 2020

JulianKnodt commented Nov 21, 2020 •

edited

Loading

marmeladema commented Nov 21, 2020

the8472 commented Nov 21, 2020

the8472 commented Nov 21, 2020

marmeladema commented Nov 21, 2020

marmeladema commented Nov 22, 2020

marmeladema commented Nov 22, 2020

marmeladema commented Nov 22, 2020 •

edited

Loading

eddyb commented Nov 23, 2020

marmeladema commented Nov 23, 2020 •

edited

Loading

marmeladema commented Nov 23, 2020

the8472 commented Nov 23, 2020

marmeladema commented Nov 24, 2020

spastorino commented Dec 17, 2020

tmiasko commented Sep 28, 2021

marmeladema commented Jan 8, 2022

nikic commented Jan 8, 2022

marmeladema commented Sep 7, 2022

Performance regression in 1.48.0 #79246

Performance regression in 1.48.0 #79246

Comments

marmeladema commented Nov 20, 2020 • edited Loading

Mark-Simulacrum commented Nov 20, 2020

JulianKnodt commented Nov 21, 2020 • edited Loading

marmeladema commented Nov 21, 2020

the8472 commented Nov 21, 2020

the8472 commented Nov 21, 2020

marmeladema commented Nov 21, 2020

marmeladema commented Nov 22, 2020

marmeladema commented Nov 22, 2020

marmeladema commented Nov 22, 2020 • edited Loading

eddyb commented Nov 23, 2020

marmeladema commented Nov 23, 2020 • edited Loading

marmeladema commented Nov 23, 2020

the8472 commented Nov 23, 2020

marmeladema commented Nov 24, 2020

spastorino commented Dec 17, 2020

tmiasko commented Sep 28, 2021

marmeladema commented Jan 8, 2022

nikic commented Jan 8, 2022

marmeladema commented Sep 7, 2022

Performance regression in `1.48.0` #79246

Performance regression in `1.48.0` #79246

marmeladema commented Nov 20, 2020 •

edited

Loading

JulianKnodt commented Nov 21, 2020 •

edited

Loading

marmeladema commented Nov 22, 2020 •

edited

Loading

marmeladema commented Nov 23, 2020 •

edited

Loading