Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of gimli dwarf parsing #232

Closed
alexcrichton opened this issue Jul 31, 2019 · 13 comments
Closed

Improve performance of gimli dwarf parsing #232

alexcrichton opened this issue Jul 31, 2019 · 13 comments

Comments

@alexcrichton
Copy link
Member

Today I was playing around with the various smoke benchmarks we have in this repository, and I executed cargo +nightly bench followed by cargo +nightly bench --features gimli-symbolize. The results were pretty interesting! Running them through cargo benchcmp I get:

 name                                 current ns/iter  gimli ns/iter  diff ns/iter    diff %  speedup 
 new                                  13,833           134,499             120,666   872.31%   x 0.10 
 new_unresolved                       5,432            5,490                    58     1.07%   x 0.99 
 new_unresolved_and_resolve_separate  13,896           134,745             120,849   869.67%   x 0.10 
 trace                                2,715            2,801                    86     3.17%   x 0.97 
 trace_and_resolve_callback           4,174            109,484             105,310  2523.00%   x 0.04 
 trace_and_resolve_separate           4,632            116,182             111,550  2408.25%   x 0.04 

Overall it looks like the gimli implementation in this crate is 10-20x slower than libbacktrace at symbolicating symbols in a backtrace.

Some of the top entries in a perf record are:

+   18.70%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] gimli::read::line::LineRows<R,Program,Offset>::next_row                                                                                                                        ▒
+    7.89%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] backtrace::symbolize::gimli::resolve                                                                                                                                           ◆
+    5.75%  benchmarks-2de6  libc-2.27.so                 [.] _dl_addr                                                                                                                                                                       ▒
+    4.39%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] addr2line::Context<R>::find_frames                                                                                                                                             ▒
+    3.90%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] gimli::leb128::read::unsigned                                                                                                                                                  ▒
+    3.66%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] gimli::leb128::read::signed                                                                                                                                                    ▒
+    3.08%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] gimli::read::unit::parse_attribute                                                                                                                                             ▒
+    2.84%  benchmarks-2de6  libc-2.27.so                 [.] _int_malloc                                                                                                                                                                    ▒
+    2.81%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] gimli::read::unit::DebuggingInformationEntry<R,Offset>::attr                                                                                                                   ▒
+    2.22%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] <std::path::Components as core::iter::traits::iterator::Iterator>::next                                                                                                        ▒
+    1.69%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] <core::str::lossy::Utf8LossyChunksIter as core::iter::traits::iterator::Iterator>::next                                                                                        ▒
     1.56%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] rustc_demangle::legacy::demangle                                                                                                                                               ▒
+    1.33%  benchmarks-2de6  [kernel.kallsyms]            [k] __entry_trampoline_start                                                                                                                                                       ▒
+    1.31%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] <intervaltree::QueryIter<K,V> as core::iter::traits::iterator::Iterator>::next                                                                                                 ▒
+    1.23%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] core::iter::traits::iterator::Iterator::eq                                                                                                                                     ▒
+    1.22%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] std::path::Components::parse_next_component                                                                                                                                    ▒
+    1.07%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] rustc_demangle::demangle                                                                                                                                                       ▒
+    1.00%  benchmarks-2de6  benchmarks-2de6689c5c5a5583  [.] gimli::read::line::LineRow::apply_line_advance                                                                                                                                 ▒

Nothing immediately jumps out as an obvious improvement, but this is something we should cover before considering to make a switch to use gimli by default.

cc @fitzgen, I figure you're probably interested in this, and others who work on gimli may be as well? This may very well just be bugs in how this crate works with gimli as well, it may be a local fix rather than a gimli fix.

I will say though in briefly looking at libbacktrace it looks like it's roughly doing the same thing, traversing the dwarf whenever something is symbolicated.

@fitzgen
Copy link
Member

fitzgen commented Jul 31, 2019

This is mildly surprising since we previously found that the gimli addr2line was faster than binutils, which is using libbacktrace IIRC.

I wonder if the integration in this crate is not reusing structures enough to fully amortize some of the caching / side tables addr2line internally uses.

cc @philipc

@fitzgen
Copy link
Member

fitzgen commented Jul 31, 2019

This is mildly surprising since we previously found that the gimli addr2line was faster than binutils, which is using libbacktrace IIRC.

er wait, that's not what that says at all...

@philipc
Copy link
Contributor

philipc commented Jul 31, 2019

@fitzgen Yeah that benchmark result changed with the switch to main--'s code because it makes a different memory/time tradeoff. However, I wouldn't expect it to be 10x slower. I also haven't benchmarked gimli recently, so maybe something has regressed. I'll definitely look into what's causing this.

@philipc
Copy link
Contributor

philipc commented Aug 1, 2019

I'm pretty sure binutils doesn't use libbacktrace, and I don't think we've ever benchmarked against libbacktrace since I don't think it has an addr2line clone?

Changing the gimli data structures to store rows instead of sequences helps some:

 name                                 before.txt ns/iter  after.txt ns/iter  diff ns/iter   diff %  speedup 
 new                                  181,292             135,659                 -45,633  -25.17%   x 1.34 
 new_unresolved                       8,264               8,190                       -74   -0.90%   x 1.01 
 new_unresolved_and_resolve_separate  180,402             136,169                 -44,233  -24.52%   x 1.32 
 trace                                4,020               3,356                      -664  -16.52%   x 1.20 
 trace_and_resolve_callback           136,862             91,427                  -45,435  -33.20%   x 1.50 
 trace_and_resolve_separate           156,649             101,124                 -55,525  -35.45%   x 1.55 

Still not close to libbacktrace though. I'll keep investigating.

@philipc
Copy link
Contributor

philipc commented Aug 2, 2019

Things that look like low hanging fruit:

Inside gimli

Inside this crate

@alexcrichton
Copy link
Member Author

Thanks for investigating @philipc! That all sounds definitely plausible. For those that are more related to this crate:

_dl_addr is being called for the gimli benchmark, but not for the libbacktrace benchmark, which seems strange

If gimli symbolication doesn't turn up any results we fall back to calling dladdr for an address. This is done currently for when the executable has stripped debug information and dwarf finds nothing, but we may be calling it too aggressively in some circumstances or on some symbols that may be libbacktrace can find in the debug info but gimli doesn't for one reason or another. (or we're just not searching all that libbacktrace is searching in the gimli impl)

we convert function names to and from UTF-8,

FWIW git2 and such often provide both a bytes accessor and a str accessor, and we could just be careful to use the bytes accessors here in this crate

MAPPINGS_CACHE compares PathBufs, but this comparison is showing up in the profile. Is it enough to compare the path bytes?

This is likely something to fix pretty easily. I think that the cache lookup hasn't really been optimized at all and we could almost certainly do so in a much more efficient fashion, perhaps even avoiding PathBuf entirely. In any case comparing bytes is likely all we need here since we're getting all the paths from the same sources.

goblin symbol iteration is inefficient

This is probably something we should cache on our end if that's the case, I wasn't sure if that was the case when I first looked into it!

@philipc
Copy link
Contributor

philipc commented Aug 5, 2019

I'll work on the gimli side of this. It also looks like the interval tree we use for inlined functions is slow, but I'm still investigating that.

alexcrichton added a commit that referenced this issue Aug 7, 2019
Instead of relying on lazy iteration, instead iterate over all symbols
up front and then sort them by address. We already do this on Windows
and it looks like symbol iteration is somewhat slow in benchmarks for
goblin. By sorting we should also be able to run through these lists
much more quickly!

Overall on top of #235 this yields another 30% speedup or so for
symbolication.

cc #232
alexcrichton added a commit that referenced this issue Aug 7, 2019
This commit updates the gimli implementation to iterate over all shared
libraries only once, caching information about them to get processed
later. This information is usually static for the lifetime of the
program, and this improves the performance of resolving a backtrace
significantly in some situations (by up to 30%).

The organization of the gimli backend has been updated to reflect this
cache where `Cache` is now a dedicated type with methods hanging off it.
Some helper code had to be duplicated from the `findshlibs` crate to do
this entirely.

cc #232
alexcrichton added a commit that referenced this issue Aug 7, 2019
Instead of relying on lazy iteration, instead iterate over all symbols
up front and then sort them by address. We already do this on Windows
and it looks like symbol iteration is somewhat slow in benchmarks for
goblin. By sorting we should also be able to run through these lists
much more quickly!

Overall on top of #235 this yields another 30% speedup or so for
symbolication.

cc #232
@alexcrichton
Copy link
Member Author

Ok I made some ticky boxes on #232 (comment) and I also merged a few PRs related to this crate itself. What I'm seeing now is:

 name                                 0.3.34 ns/iter  master ns/iter  diff ns/iter   diff %  speedup 
 new                                  144,725         67,446               -77,279  -53.40%   x 2.15 
 new_unresolved_and_resolve_separate  134,998         67,445               -67,553  -50.04%   x 2.00 
 trace_and_resolve_callback           106,756         54,770               -51,986  -48.70%   x 1.95 
 trace_and_resolve_separate           113,928         51,604               -62,324  -54.70%   x 2.21 

which is some great progress! Still aways to go to catch up to libbacktrace:

 name                                 libbacktrace ns/iter  master ns/iter  diff ns/iter    diff %  speedup 
 new                                  12,506                67,446                54,940   439.31%   x 0.19 
 new_unresolved_and_resolve_separate  12,625                67,445                54,820   434.22%   x 0.19 
 trace_and_resolve_callback           4,124                 54,770                50,646  1228.08%   x 0.08 
 trace_and_resolve_separate           4,523                 51,604                47,081  1040.92%   x 0.09 

@philipc
Copy link
Contributor

philipc commented Aug 8, 2019

Ok great, with those changes plus my current hacked up gimli changes I can get that to:

 name                                 libbacktrace ns/iter  gimli ns/iter  diff ns/iter   diff %  speedup 
 new                                  24,320                16,692               -7,628  -31.37%   x 1.46 
 new_unresolved                       8,236                 8,153                   -83   -1.01%   x 1.01 
 new_unresolved_and_resolve_separate  18,114                14,564               -3,550  -19.60%   x 1.24 
 trace                                3,416                 3,388                   -28   -0.82%   x 1.01 
 trace_and_resolve_callback           7,247                 6,665                  -582   -8.03%   x 1.09 
 trace_and_resolve_separate           5,894                 8,023                 2,129   36.12%   x 0.73 

@alexcrichton
Copy link
Member Author

Holy cow, very nice! That's basically everything :)

@philipc
Copy link
Contributor

philipc commented Aug 15, 2019

Just an update on my progress.

The backtrace-rs benchmarks don't cover the cost of building the caches. So while my gimli changes did make things faster for repeated lookups, building the cache is slower and uses more memory. I also don't have benchmarks of this for libbacktrace, so not sure how we compare there.

I'm still playing around with the gimli attribute parsing. I know we can do better there, but the challenge is keeping an API that is easy to use and not too low level.

@alexcrichton
Copy link
Member Author

No worries, and thanks!

I think that for this crate itself that tradeoff is fine, and if it's something that this crate opts-in to specifically I think that's also fine

@alexcrichton
Copy link
Member Author

alexcrichton commented Jan 13, 2020

Current numbers as of #273 are:

 name                                 baseline ns/iter  gimli ns/iter  diff ns/iter  diff %  speedup
 new                                  12,223            13,732                1,509  12.35%   x 0.89
 new_unresolved                       5,693             5,722                    29   0.51%   x 0.99
 new_unresolved_and_resolve_separate  12,257            13,731                1,474  12.03%   x 0.89
 trace                                2,938             2,915                   -23  -0.78%   x 1.01
 trace_and_resolve_callback           4,373             5,193                   820  18.75%   x 0.84
 trace_and_resolve_separate           4,813             5,993                 1,180  24.52%   x 0.80

Thanks so much for all your help here @philipc, I think we're basically done now!

To get a better idea, this is a benchmark of gimli roughly when I opened this issue vs now:

 name                                 gimli-before ns/iter  gimli ns/iter  diff ns/iter   diff %  speedup
 new                                  123,536               13,732             -109,804  -88.88%   x 9.00
 new_unresolved                       5,687                 5,722                    35    0.62%   x 0.99
 new_unresolved_and_resolve_separate  121,364               13,731             -107,633  -88.69%   x 8.84
 trace                                2,913                 2,915                     2    0.07%   x 1.00
 trace_and_resolve_callback           107,671               5,193              -102,478  -95.18%  x 20.73
 trace_and_resolve_separate           111,442               5,993              -105,449  -94.62%  x 18.60

Nothing like some 10-20x speedups :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants