Add image cache to avoid calling dladdr() and add libbacktrace elf image list refresh #674

tiago-rodrigues · 2023-11-29T18:39:37Z

On glibc platforms, add an image cache to avoid calling dladdr() when doing offline symbol resolution, as dladdr() can cause contention with other dl* calls (see: #665 (comment)).

Add support for libbacktrace to detect new elfs have been dynamically loaded after backtrace_initialize() has been called, and consider them for further symbol resolution.

…esolution. This cache can also be used in the runtime symbol resolution case to detect we should recreate "backtrace_state" when new images have been loaded

YaLTeR · 2023-12-02T05:11:31Z

Gave this a try and it seems to work very well! I am able to record GNOME Shell with offline symbol resolving without much extra lag seemingly.

wolfpld · 2023-12-03T00:22:33Z

public/client/TracyCallstack.cpp

+    }
+};
+#endif //#ifdef TRACY_USE_IMAGE_CACHE
+
 namespace tracy


Is there a reason for keeping the above code outside the tracy namespace?

not really, moved it under

wolfpld · 2023-12-03T00:23:44Z

public/client/TracyCallstack.cpp

+        if( !entry )
+        {
+            //printf("* addr not found: %p (%d entries) refreshing (m_numberOfRefreshes: %d)\n",
+            //       address, m_images->size(), m_numberOfRefreshes);


Remove debug code or convert to TracyDebug().

wolfpld · 2023-12-03T00:25:18Z

public/client/TracyCallstack.cpp

+
+        if( it != m_images->end() && address < it->m_endAddress )
+        {
+            return &(*it);


Why not return it?

wolfpld · 2023-12-03T00:32:53Z

CMakeLists.txt

@@ -86,6 +86,7 @@ set_option(TRACY_NO_CRASH_HANDLER "Disable crash handling" OFF)
 set_option(TRACY_TIMER_FALLBACK "Use lower resolution timers" OFF)
 set_option(TRACE_CLIENT_LIBUNWIND_BACKTRACE "Use libunwind backtracing where supported" OFF)
 set_option(TRACY_SYMBOL_OFFLINE_RESOLVE "Instead of full runtime symbol resolution, only resolve the image path and offset to enable offline symbol resolution" OFF)
+set_option(TRACY_ENABLE_IMAGE_CACHE "On glibc platforms, when doing offline symbol resolution, use an cache to determine the image path and offset instead of dladdr()" OFF)


What is the reason for keeping this as an option?

The description of that option should have included the runtime resolve case. For offline symbol resolving, having the image cache be used by default sounds ok to me.
I wasn't sure what to do with the runtime resolve case, where the image cache is used to detect new images were loaded and trigger libbacktrace state recreation. Maybe the option should be renamed something like "TRACY_ENABLE_LIBBACKTRACE_DYNAMIC_LOADED_IMAGES" and only control enabling that code for the runtime resolve case. It could then be disable by default as maybe not too many users needed (if they don't do dynamic loading of shared libs) and also knowing that it comes with libbacktrace state leaking?

Every program, even helloworldware, will do dynamic loading of shared libs. For example, when I put a breakpoint on elf_add in libbacktrace in the tracy test application, I see hits on /proc/self/exe, /usr/lib/libdebuginfod.so.1, /usr/lib/libstdc++.so.6, /usr/lib/libm.so.6, /usr/lib/libgcc_s.so.1, /usr/lib/libc.so.6, /usr/lib/libcurl.so.4, /usr/lib/libelf.so.1, /lib64/ld-linux-x86-64.so.2, and so on.

The call stack for each of these library loads is:

#0 tracy::elf_add (state=0x7fffee270080, filename=0x7fffee260120 "/usr/lib/libgcc_s.so.1", descriptor=232, memory=0x0, memory_size=0, base_address=140737350017024, error_callback=0x555555603dd0 <tracy::CallstackErrorCb(void*, char const*, int)>, data=0x0, fileline_fn=0x7fffef27e608, found_sym=0x7fffef27e754, found_dwarf=0x7fffef27e604, fileline_entry=0x0, exe=0, debuginfo=0, with_buildid_data=0x0, with_buildid_size=0) at ../public/libbacktrace/elf.cpp:6578 #1 0x000055555560f2a1 in tracy::phdr_callback (info=0x7fffee3c00d0, pdata=0x7fffef27e708) at ../public/libbacktrace/elf.cpp:7411 #2 0x000055555560c80c in tracy::backtrace_initialize (state=0x7fffee270080, filename=0x555555633d60 "/proc/self/exe", descriptor=232, error_callback=0x555555603dd0 <tracy::CallstackErrorCb(void*, char const*, int)>, data=0x0, fileline_fn=0x7fffef27e7f8) at ../public/libbacktrace/elf.cpp:7459 #3 0x0000555555609bc8 in tracy::fileline_initialize (state=0x7fffee270080, error_callback=0x555555603dd0 <tracy::CallstackErrorCb(void*, char const*, int)>, data=0x0) at ../public/libbacktrace/fileline.cpp:264 #4 0x000055555560340d in tracy::backtrace_pcinfo (state=0x7fffee270080, pc=93824992316766, callback=0x555555603780 <tracy::CallstackDataCb(void*, unsigned long, unsigned long, char const*, int, char const*)>, error_callback=0x555555603dd0 <tracy::CallstackErrorCb(void*, char const*, int)>, data=0x0) at ../public/libbacktrace/fileline.cpp:298 #5 0x00005555555f2eea in tracy::DecodeCallstackPtr (ptr=93824992316766) at ../public/client/TracyCallstack.cpp:1075 #6 0x00005555555ede59 in tracy::Profiler::HandleSymbolQueueItem (this=0x555555644700 <tracy::s_profiler>, si=...) at ../public/client/TracyProfiler.cpp:3266 #7 0x00005555555f4b3a in tracy::Profiler::SymbolWorker (this=0x555555644700 <tracy::s_profiler>) at ../public/client/TracyProfiler.cpp:3388 #8 0x000055555562afc5 in tracy::Profiler::LaunchSymbolWorker (ptr=0x555555644700 <tracy::s_profiler>) at ../public/common/../client/TracyProfiler.hpp:796 #9 0x000055555562c6ad in tracy::Thread::Launch (ptr=0x7ffff63600e0) at ../public/common/../client/TracyThread.hpp:80

Note that DecodeCallstackPtr is on the stack here, so I'm not sure of your reasoning in that sentence:

This cache can also be used in the runtime symbol resolution case to detect we should recreate "backtrace_state" when new images have been loaded, as libbacktrace symbol resolution will only consider images loaded before the first symbol resolve operation.

Can you tell me why you think the libbacktrace state should be reset when a new library appears? The trace above show that the library should handle this out of the box.

Hum, that is strange. I'll unoptimize the code and do some stepping again in my case to make sure what I was seeing was actually correct (I'll hopefully get to it later today). But after the first symbol resolve was hit, I was missing symbol resolving for shared libraries loaded after that. Looking through the libbacktrace code I could see it was using backtrace_state::fileline_fn as the control value to already called backtrace_initialize() (which was what was responsible for the shared library indexing), and that was set to something non-null after the call. That doesn't appear to be the case for you though, maybe I misinterpreted the results in the debugger with optimized code and something else was causing the issues.

I have to preface this comment, by admitting that the code path I've been testing with out main app is still using a 0.9.1 client code branch with extra cherry-picks on top (we haven't been able to move all workflows to use 0.10 just yet), so it's not really the same test with the HEAD code you did (I'll try to do this with HEAD code tomorrow), however I don't see many changes on the libbacktrace code that would have changed this behavior.
I only see it hit fileline_initialize then backtrace_initialize once on the first resolve, because it will then set state->fileline_fn at the end of fileline_initialize() and that will prevent it from calling backtrace_initialize on any subsequent calls.
Maybe you recall if something was changed between 0.9.1 and HEAD that would have allowed backtrace_initialize be called multiple times?

So as I understand it, the problem you have is that shared objects that are dlopened at a later time are not visible because libbacktrace has already collected a list of shared objects that the executable is linking to and will not look for more

exactly.

This feels like something libbacktrace should handle itself (in an ideal world).

I guess before I started looking more closely at the code I didn't realize you had already modified that libbacktrace code quite a bit and it wasn't just a dump from upstream.

you already know when you want to reset libbacktrace state. Have you tried running dl_iterate_phdr to call phdr_callback instead?

Yes, libacktrace code is already doing pretty much what I was doing once, so I can try to expose an update method I can call from outside - I'll try to do this this week and update this PR.

I guess before I started looking more closely at the code I didn't realize you had already modified that libbacktrace code quite a bit and it wasn't just a dump from upstream.

There have been some changes that add some sorely missing features, but there have been no substantial changes. The changes to dl_iterate_phdr are technical: 7e8961d

Yes, libacktrace code is already doing pretty much what I was doing once, so I can try to expose an update method I can call from outside - I'll try to do this this week and update this PR.

Please check if it would be feasible to implement this update functionality within libbacktrace itself.

The changes to dl_iterate_phdr are technical: 7e8961d

I think this was needed because elf_add can now call debuginfod, which might issue network requests, and this didn't play well with the requirements of dl_iterate_phdr.

Please check if it would be feasible to implement this update functionality within libbacktrace itself.

I made an attempt at it in the latest commit

…= 3 to obtain image path and addreses instead of dladdr()

… loaded after backtrace_initialize() has been called, and consider them for symbol resolution

wolfpld · 2023-12-06T15:44:57Z

public/libbacktrace/dwarf.cpp

+       return ret;
+    }
+
+    // if we failed to obtain an entry in range, it can mean that the address map has been cahnges and new entries


wolfpld · 2023-12-06T15:45:49Z

public/libbacktrace/elf.cpp

This whole file is marked as changed.

sorry line endings shenanigans - should be fixed now.
The indentation in those libbacktrace files is also somewhat "special"

wolfpld · 2023-12-09T11:31:48Z

public/client/TracyCallstack.cpp

+                }
+            }
+            image->m_name = cache->m_imageName;
+        }


Is image->m_name valid after this function returns? My guess would be that the info argument is only valid inside this function, and the dlInfo provided filename will be invalidated after the next call to dladdr.

good point - fixed.

Where is this memory released?

wolfpld · 2023-12-09T11:33:42Z

public/client/TracyCallstack.cpp

+            []( const ImageEntry& lhs, const ImageEntry& rhs ) { return lhs.m_startAddress > rhs.m_startAddress; } );
+    }
+
+    const ImageEntry* GetImageEntryForAddress( void* address ) const 


Rename to GetImageForAddressImpl.

wolfpld · 2023-12-09T11:34:50Z

public/client/TracyCallstack.cpp

+        const ImageEntry* entry = GetImageEntryForAddress( address );
+        if( !entry )
+        {
+            Refresh();


Is there a mechanism that will prevent constant reloads when a large set of unmapped addresses is queried?

There isn't one atm. The expectation would be that it wouldn't actually be called that often in real scenarios. In my case with a few hundred dynamic loaded shared libs, and instrumenting all memory allocations with a callstack captures I only get it called ~5 times until it actually has the full map as symbol resolution is relatively slow in itself.
What would be your suggestion, a "min ms between refreshes" and either skip (and return unknown symbols for that period) or "stall" the resolution thread for a period or time?

Stalling to allow batch processing of a larger set may be a solution. But I was more interested in whether you have considered this case. I don't think this is immediately actionable.

wolfpld · 2023-12-09T11:39:47Z

public/libbacktrace/elf.cpp

@@ -7366,6 +7393,12 @@ phdr_callback_mock (struct dl_phdr_info *info, size_t size ATTRIBUTE_UNUSED,
  }
  else ptr->dlpi_name = nullptr;
  ptr->dlpi_addr = info->dlpi_addr;
+
+  // calculate the address range so we can quickly determine is a PC is within the range of this image


…ctor.

wolfpld · 2023-12-09T16:42:56Z

public/client/TracyCallstack.cpp

+        // so we must get it in an alternative way and cache it
+        if( info->dlpi_name && info->dlpi_name[0] != '\0' )
+        {
+            image->m_name = info->dlpi_name;


What about this assignment?

fair point - copied it there as well.
I now free and rebuild images names with every refresh for simplicity - would probably be better if I just recreated the new image entries, although it would require searching and comparing the paths to be sure so not sure it's worth it.

YaLTeR · 2023-12-12T04:50:06Z

I'm testing libunwind bt + dynload and I'm getting funny strings in the backtrace occasionally, could be memory corruption?

YaLTeR · 2023-12-12T05:08:23Z

Okay, I get this without dynload too, so I'll make an issue instead.

YaLTeR · 2023-12-12T05:59:48Z

Made an issue at #684; I somewhat suspect the latter changes in this PR though, because when I tested it in the beginning it seemed to work fine.

Add image cache to avoid calling dladdr() when doing offline symbol r…

132419d

…esolution. This cache can also be used in the runtime symbol resolution case to detect we should recreate "backtrace_state" when new images have been loaded

wolfpld reviewed Dec 3, 2023

View reviewed changes

Tiago Rodrigues added 4 commits December 3, 2023 09:23

move under the tracy namespace, remove commented out code

a9d039e

simplify return from tracy::FastVector iterator

a618b6e

remove option to enable image cache, use it for TRACY_HAS_CALLSTACK =…

55f53b9

…= 3 to obtain image path and addreses instead of dladdr()

Add support for libbacktrace to detect new elfs have been dynamically…

b835d73

… loaded after backtrace_initialize() has been called, and consider them for symbol resolution

wolfpld reviewed Dec 6, 2023

View reviewed changes

Tiago Rodrigues added 4 commits December 6, 2023 12:32

fix typo

24b6c64

checkout elf.cpp as it looks like line ending were screwed up

3855917

re-apply diff

8dfc5fe

fix line endings

e80e1d2

wolfpld reviewed Dec 9, 2023

View reviewed changes

trodrigues added 2 commits December 9, 2023 09:37

fix typos and compilation warnings

8503f32

make a copy of dli_fname after calling dladdr. Call ImageCache destru…

15f1b6b

…ctor.

tiago-rodrigues changed the title ~~Add image cache to avoid calling dladdr() and enable detecting image loading and recreate "backtrace_state"~~ Add image cache to avoid calling dladdr() and add libbacktrace elf image list refresh Dec 9, 2023

wolfpld reviewed Dec 9, 2023

View reviewed changes

make sure we always copy the image name in ImageCache

ab1ec3f

wolfpld merged commit 9bc014b into wolfpld:master Dec 11, 2023
5 checks passed

wolfpld mentioned this pull request Mar 23, 2024

call stacks: dladdr() doesn't work with -fvisibility=hidden symbols #414

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add image cache to avoid calling dladdr() and add libbacktrace elf image list refresh #674

Add image cache to avoid calling dladdr() and add libbacktrace elf image list refresh #674

tiago-rodrigues commented Nov 29, 2023 •

edited

Loading

YaLTeR commented Dec 2, 2023

wolfpld Dec 3, 2023

tiago-rodrigues Dec 3, 2023

wolfpld Dec 3, 2023

tiago-rodrigues Dec 3, 2023

wolfpld Dec 3, 2023

wolfpld Dec 3, 2023

tiago-rodrigues Dec 3, 2023 •

edited

Loading

wolfpld Dec 3, 2023

tiago-rodrigues Dec 3, 2023 •

edited

Loading

tiago-rodrigues Dec 3, 2023 •

edited

Loading

tiago-rodrigues Dec 4, 2023

wolfpld Dec 4, 2023

wolfpld Dec 4, 2023

wolfpld Dec 4, 2023

tiago-rodrigues Dec 6, 2023

wolfpld Dec 6, 2023

wolfpld Dec 6, 2023

tiago-rodrigues Dec 6, 2023

wolfpld Dec 9, 2023

tiago-rodrigues Dec 9, 2023

wolfpld Dec 9, 2023

wolfpld Dec 9, 2023

wolfpld Dec 9, 2023

tiago-rodrigues Dec 9, 2023

wolfpld Dec 9, 2023

wolfpld Dec 9, 2023

wolfpld Dec 9, 2023

tiago-rodrigues Dec 10, 2023

YaLTeR commented Dec 12, 2023

YaLTeR commented Dec 12, 2023

YaLTeR commented Dec 12, 2023

Add image cache to avoid calling dladdr() and add libbacktrace elf image list refresh #674

Add image cache to avoid calling dladdr() and add libbacktrace elf image list refresh #674

Conversation

tiago-rodrigues commented Nov 29, 2023 • edited Loading

YaLTeR commented Dec 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tiago-rodrigues Dec 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tiago-rodrigues Dec 3, 2023 • edited Loading

Choose a reason for hiding this comment

tiago-rodrigues Dec 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YaLTeR commented Dec 12, 2023

YaLTeR commented Dec 12, 2023

YaLTeR commented Dec 12, 2023

tiago-rodrigues commented Nov 29, 2023 •

edited

Loading

tiago-rodrigues Dec 3, 2023 •

edited

Loading

tiago-rodrigues Dec 3, 2023 •

edited

Loading

tiago-rodrigues Dec 3, 2023 •

edited

Loading