Improve performance of gimli dwarf parsing #232

alexcrichton · 2019-07-31T17:55:27Z

Today I was playing around with the various smoke benchmarks we have in this repository, and I executed cargo +nightly bench followed by cargo +nightly bench --features gimli-symbolize. The results were pretty interesting! Running them through cargo benchcmp I get:

 name                                 current ns/iter  gimli ns/iter  diff ns/iter    diff %  speedup 
 new                                  13,833           134,499             120,666   872.31%   x 0.10 
 new_unresolved                       5,432            5,490                    58     1.07%   x 0.99 
 new_unresolved_and_resolve_separate  13,896           134,745             120,849   869.67%   x 0.10 
 trace                                2,715            2,801                    86     3.17%   x 0.97 
 trace_and_resolve_callback           4,174            109,484             105,310  2523.00%   x 0.04 
 trace_and_resolve_separate           4,632            116,182             111,550  2408.25%   x 0.04

Overall it looks like the gimli implementation in this crate is 10-20x slower than libbacktrace at symbolicating symbols in a backtrace.

Some of the top entries in a perf record are:

+   18.70%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] gimli::read::line::LineRows<R,Program,Offset>::next_row                                                                                                                        ▒
+    7.89%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] backtrace::symbolize::gimli::resolve                                                                                                                                           ◆
+    5.75%  benchmarks-2de6  libc-2.27.so                 [.] _dl_addr                                                                                                                                                                       ▒
+    4.39%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] addr2line::Context<R>::find_frames                                                                                                                                             ▒
+    3.90%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] gimli::leb128::read::unsigned                                                                                                                                                  ▒
+    3.66%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] gimli::leb128::read::signed                                                                                                                                                    ▒
+    3.08%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] gimli::read::unit::parse_attribute                                                                                                                                             ▒
+    2.84%  benchmarks-2de6  libc-2.27.so                 [.] _int_malloc                                                                                                                                                                    ▒
+    2.81%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] gimli::read::unit::DebuggingInformationEntry<R,Offset>::attr                                                                                                                   ▒
+    2.22%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] <std::path::Components as core::iter::traits::iterator::Iterator>::next                                                                                                        ▒
+    1.69%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] <core::str::lossy::Utf8LossyChunksIter as core::iter::traits::iterator::Iterator>::next                                                                                        ▒
     1.56%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] rustc_demangle::legacy::demangle                                                                                                                                               ▒
+    1.33%  benchmarks-2de6  [kernel.kallsyms]            [k] __entry_trampoline_start                                                                                                                                                       ▒
+    1.31%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] <intervaltree::QueryIter<K,V> as core::iter::traits::iterator::Iterator>::next                                                                                                 ▒
+    1.23%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] core::iter::traits::iterator::Iterator::eq                                                                                                                                     ▒
+    1.22%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] std::path::Components::parse_next_component                                                                                                                                    ▒
+    1.07%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] rustc_demangle::demangle                                                                                                                                                       ▒
+    1.00%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] gimli::read::line::LineRow::apply_line_advance                                                                                                                                 ▒

Nothing immediately jumps out as an obvious improvement, but this is something we should cover before considering to make a switch to use gimli by default.

cc @fitzgen, I figure you're probably interested in this, and others who work on gimli may be as well? This may very well just be bugs in how this crate works with gimli as well, it may be a local fix rather than a gimli fix.

I will say though in briefly looking at libbacktrace it looks like it's roughly doing the same thing, traversing the dwarf whenever something is symbolicated.

The text was updated successfully, but these errors were encountered:

fitzgen · 2019-07-31T19:37:54Z

This is mildly surprising since we previously found that the gimli addr2line was faster than binutils, which is using libbacktrace IIRC.

I wonder if the integration in this crate is not reusing structures enough to fully amortize some of the caching / side tables addr2line internally uses.

cc @philipc

fitzgen · 2019-07-31T19:39:21Z

This is mildly surprising since we previously found that the gimli addr2line was faster than binutils, which is using libbacktrace IIRC.

er wait, that's not what that says at all...

philipc · 2019-07-31T22:14:28Z

@fitzgen Yeah that benchmark result changed with the switch to main--'s code because it makes a different memory/time tradeoff. However, I wouldn't expect it to be 10x slower. I also haven't benchmarked gimli recently, so maybe something has regressed. I'll definitely look into what's causing this.

philipc · 2019-08-01T12:08:34Z

I'm pretty sure binutils doesn't use libbacktrace, and I don't think we've ever benchmarked against libbacktrace since I don't think it has an addr2line clone?

Changing the gimli data structures to store rows instead of sequences helps some:

 name                                 before.txt ns/iter  after.txt ns/iter  diff ns/iter   diff %  speedup 
 new                                  181,292             135,659                 -45,633  -25.17%   x 1.34 
 new_unresolved                       8,264               8,190                       -74   -0.90%   x 1.01 
 new_unresolved_and_resolve_separate  180,402             136,169                 -44,233  -24.52%   x 1.32 
 trace                                4,020               3,356                      -664  -16.52%   x 1.20 
 trace_and_resolve_callback           136,862             91,427                  -45,435  -33.20%   x 1.50 
 trace_and_resolve_separate           156,649             101,124                 -55,525  -35.45%   x 1.55

Still not close to libbacktrace though. I'll keep investigating.

philipc · 2019-08-02T06:54:23Z

Things that look like low hanging fruit:

Inside gimli

the gimli data structure change I mention above (this addresses the next_row entry in the perf output) (gimli-rs/addr2line@d197def)
inefficiencies in DWARF attribute parsing (e.g. we loop through the attributes multiple times) (Use gimli::EntriesRaw gimli-rs/addr2line#148)
we convert function names to and from UTF-8, but Symbol::Frame::name only wants [u8] (gimli-rs/addr2line@76d3486)
similarly, filenames only need to be a Path, so they don't need to go via UTF-8 conversions. Fixed by caching the result, since the slice concatenation was slow too, and these results will often be reused. (gimli-rs/addr2line@5a99165)
the interval tree used in gimli is slower than a simple tree of inlined functions (Replace IntervalTree with Vec gimli-rs/addr2line#151)
results of parsing function DIEs aren't being cached (Cache function attributes gimli-rs/addr2line#152)

Inside this crate

_dl_addr is being called for the gimli benchmark, but not for the libbacktrace benchmark, which seems strange (gimli: Look at object dynamic symbol tables #237)
MAPPINGS_CACHE compares PathBufs, but this comparison is showing up in the profile. Is it enough to compare the path bytes? (gimli: Only iterate shared libraries once #235)
goblin symbol iteration is inefficient (gimli: Sort symbols on Linux #236)

alexcrichton · 2019-08-05T17:43:13Z

Thanks for investigating @philipc! That all sounds definitely plausible. For those that are more related to this crate:

_dl_addr is being called for the gimli benchmark, but not for the libbacktrace benchmark, which seems strange

If gimli symbolication doesn't turn up any results we fall back to calling dladdr for an address. This is done currently for when the executable has stripped debug information and dwarf finds nothing, but we may be calling it too aggressively in some circumstances or on some symbols that may be libbacktrace can find in the debug info but gimli doesn't for one reason or another. (or we're just not searching all that libbacktrace is searching in the gimli impl)

we convert function names to and from UTF-8,

FWIW git2 and such often provide both a bytes accessor and a str accessor, and we could just be careful to use the bytes accessors here in this crate

MAPPINGS_CACHE compares PathBufs, but this comparison is showing up in the profile. Is it enough to compare the path bytes?

This is likely something to fix pretty easily. I think that the cache lookup hasn't really been optimized at all and we could almost certainly do so in a much more efficient fashion, perhaps even avoiding PathBuf entirely. In any case comparing bytes is likely all we need here since we're getting all the paths from the same sources.

goblin symbol iteration is inefficient

This is probably something we should cache on our end if that's the case, I wasn't sure if that was the case when I first looked into it!

philipc · 2019-08-05T21:57:56Z

I'll work on the gimli side of this. It also looks like the interval tree we use for inlined functions is slow, but I'm still investigating that.

Instead of relying on lazy iteration, instead iterate over all symbols up front and then sort them by address. We already do this on Windows and it looks like symbol iteration is somewhat slow in benchmarks for goblin. By sorting we should also be able to run through these lists much more quickly! Overall on top of #235 this yields another 30% speedup or so for symbolication. cc #232

This commit updates the gimli implementation to iterate over all shared libraries only once, caching information about them to get processed later. This information is usually static for the lifetime of the program, and this improves the performance of resolving a backtrace significantly in some situations (by up to 30%). The organization of the gimli backend has been updated to reflect this cache where `Cache` is now a dedicated type with methods hanging off it. Some helper code had to be duplicated from the `findshlibs` crate to do this entirely. cc #232

Instead of relying on lazy iteration, instead iterate over all symbols up front and then sort them by address. We already do this on Windows and it looks like symbol iteration is somewhat slow in benchmarks for goblin. By sorting we should also be able to run through these lists much more quickly! Overall on top of #235 this yields another 30% speedup or so for symbolication. cc #232

alexcrichton · 2019-08-07T21:08:16Z

Ok I made some ticky boxes on #232 (comment) and I also merged a few PRs related to this crate itself. What I'm seeing now is:

 name                                 0.3.34 ns/iter  master ns/iter  diff ns/iter   diff %  speedup 
 new                                  144,725         67,446               -77,279  -53.40%   x 2.15 
 new_unresolved_and_resolve_separate  134,998         67,445               -67,553  -50.04%   x 2.00 
 trace_and_resolve_callback           106,756         54,770               -51,986  -48.70%   x 1.95 
 trace_and_resolve_separate           113,928         51,604               -62,324  -54.70%   x 2.21

which is some great progress! Still aways to go to catch up to libbacktrace:

 name                                 libbacktrace ns/iter  master ns/iter  diff ns/iter    diff %  speedup 
 new                                  12,506                67,446                54,940   439.31%   x 0.19 
 new_unresolved_and_resolve_separate  12,625                67,445                54,820   434.22%   x 0.19 
 trace_and_resolve_callback           4,124                 54,770                50,646  1228.08%   x 0.08 
 trace_and_resolve_separate           4,523                 51,604                47,081  1040.92%   x 0.09

philipc · 2019-08-08T02:41:35Z

Ok great, with those changes plus my current hacked up gimli changes I can get that to:

 name                                 libbacktrace ns/iter  gimli ns/iter  diff ns/iter   diff %  speedup 
 new                                  24,320                16,692               -7,628  -31.37%   x 1.46 
 new_unresolved                       8,236                 8,153                   -83   -1.01%   x 1.01 
 new_unresolved_and_resolve_separate  18,114                14,564               -3,550  -19.60%   x 1.24 
 trace                                3,416                 3,388                   -28   -0.82%   x 1.01 
 trace_and_resolve_callback           7,247                 6,665                  -582   -8.03%   x 1.09 
 trace_and_resolve_separate           5,894                 8,023                 2,129   36.12%   x 0.73

alexcrichton · 2019-08-08T13:59:41Z

Holy cow, very nice! That's basically everything :)

philipc · 2019-08-15T06:37:43Z

Just an update on my progress.

The backtrace-rs benchmarks don't cover the cost of building the caches. So while my gimli changes did make things faster for repeated lookups, building the cache is slower and uses more memory. I also don't have benchmarks of this for libbacktrace, so not sure how we compare there.

I'm still playing around with the gimli attribute parsing. I know we can do better there, but the challenge is keeping an API that is easy to use and not too low level.

alexcrichton · 2019-08-15T14:02:20Z

No worries, and thanks!

I think that for this crate itself that tradeoff is fine, and if it's something that this crate opts-in to specifically I think that's also fine

alexcrichton · 2020-01-13T15:00:22Z

Current numbers as of #273 are:

 name                                 baseline ns/iter  gimli ns/iter  diff ns/iter  diff %  speedup
 new                                  12,223            13,732                1,509  12.35%   x 0.89
 new_unresolved                       5,693             5,722                    29   0.51%   x 0.99
 new_unresolved_and_resolve_separate  12,257            13,731                1,474  12.03%   x 0.89
 trace                                2,938             2,915                   -23  -0.78%   x 1.01
 trace_and_resolve_callback           4,373             5,193                   820  18.75%   x 0.84
 trace_and_resolve_separate           4,813             5,993                 1,180  24.52%   x 0.80

Thanks so much for all your help here @philipc, I think we're basically done now!

To get a better idea, this is a benchmark of gimli roughly when I opened this issue vs now:

 name                                 gimli-before ns/iter  gimli ns/iter  diff ns/iter   diff %  speedup
 new                                  123,536               13,732             -109,804  -88.88%   x 9.00
 new_unresolved                       5,687                 5,722                    35    0.62%   x 0.99
 new_unresolved_and_resolve_separate  121,364               13,731             -107,633  -88.69%   x 8.84
 trace                                2,913                 2,915                     2    0.07%   x 1.00
 trace_and_resolve_callback           107,671               5,193              -102,478  -95.18%  x 20.73
 trace_and_resolve_separate           111,442               5,993              -105,449  -94.62%  x 18.60

Nothing like some 10-20x speedups :)

alexcrichton mentioned this issue Aug 7, 2019

gimli: Only iterate shared libraries once #235

Merged

alexcrichton mentioned this issue Aug 7, 2019

gimli: Sort symbols on Linux #236

Merged

philipc mentioned this issue Aug 10, 2019

Improve performance for line number program handling gimli-rs/addr2line#130

Merged

philipc mentioned this issue Sep 10, 2019

Improve function lookup perfomance gimli-rs/addr2line#141

Closed

philipc mentioned this issue Jan 11, 2020

Update addr2line to 0.11.0 #273

Merged

alexcrichton closed this as completed Jan 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of gimli dwarf parsing #232

Improve performance of gimli dwarf parsing #232

alexcrichton commented Jul 31, 2019

fitzgen commented Jul 31, 2019

fitzgen commented Jul 31, 2019

philipc commented Jul 31, 2019

philipc commented Aug 1, 2019

philipc commented Aug 2, 2019 •

edited

Loading

alexcrichton commented Aug 5, 2019

philipc commented Aug 5, 2019

alexcrichton commented Aug 7, 2019

philipc commented Aug 8, 2019

alexcrichton commented Aug 8, 2019

philipc commented Aug 15, 2019

alexcrichton commented Aug 15, 2019

alexcrichton commented Jan 13, 2020 •

edited

Loading

Improve performance of gimli dwarf parsing #232

Improve performance of gimli dwarf parsing #232

Comments

alexcrichton commented Jul 31, 2019

fitzgen commented Jul 31, 2019

fitzgen commented Jul 31, 2019

philipc commented Jul 31, 2019

philipc commented Aug 1, 2019

philipc commented Aug 2, 2019 • edited Loading

Inside gimli

Inside this crate

alexcrichton commented Aug 5, 2019

philipc commented Aug 5, 2019

alexcrichton commented Aug 7, 2019

philipc commented Aug 8, 2019

alexcrichton commented Aug 8, 2019

philipc commented Aug 15, 2019

alexcrichton commented Aug 15, 2019

alexcrichton commented Jan 13, 2020 • edited Loading

philipc commented Aug 2, 2019 •

edited

Loading

alexcrichton commented Jan 13, 2020 •

edited

Loading