Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory corruption in callstack recording #684

Closed
YaLTeR opened this issue Dec 12, 2023 · 12 comments
Closed

Memory corruption in callstack recording #684

YaLTeR opened this issue Dec 12, 2023 · 12 comments

Comments

@YaLTeR
Copy link
Contributor

YaLTeR commented Dec 12, 2023

Random strings end up in the callstack Image column:

image

Both when capturing with the UI and with the capture executable.

I tried UBSan, it shows this, not sure if related:

../subprojects/tracy/public/libbacktrace/dwarf.cpp:2337:20: runtime error: load of value 16646, which is not a valid value for type 'dwarf_tag'
../subprojects/tracy/public/libbacktrace/dwarf.cpp:2337:20: runtime error: load of value 16646, which is not a valid value for type 'dwarf_tag'
../subprojects/tracy/public/libbacktrace/dwarf.cpp:2397:19: runtime error: load of value 16646, which is not a valid value for type 'dwarf_tag'
../subprojects/tracy/public/libbacktrace/dwarf.cpp:2397:19: runtime error: load of value 16646, which is not a valid value for type 'dwarf_tag'
../subprojects/tracy/public/libbacktrace/dwarf.cpp:2399:15: runtime error: load of value 16646, which is not a valid value for type 'dwarf_tag'
../subprojects/tracy/public/libbacktrace/dwarf.cpp:3588:30: runtime error: load of value 16647, which is not a valid value for type 'dwarf_tag'
../subprojects/tracy/public/libbacktrace/dwarf.cpp:3588:30: runtime error: load of value 16647, which is not a valid value for type 'dwarf_tag'
../subprojects/tracy/public/libbacktrace/dwarf.cpp:3590:19: runtime error: load of value 16647, which is not a valid value for type 'dwarf_tag'
../subprojects/tracy/public/libbacktrace/dwarf.cpp:3592:19: runtime error: load of value 16647, which is not a valid value for type 'dwarf_tag'
../subprojects/tracy/public/libbacktrace/dwarf.cpp:3622:17: runtime error: load of value 16647, which is not a valid value for type 'dwarf_tag'
../subprojects/tracy/public/libbacktrace/dwarf.cpp:3622:17: runtime error: load of value 16647, which is not a valid value for type 'dwarf_tag'

I'm building as a shared library with tracy_on_demand. The same issue occurs with/without libunwind, the new dynload option, manual lifetime, offline.

@wolfpld
Copy link
Owner

wolfpld commented Dec 12, 2023

Have you tried bisecting this issue?

@YaLTeR
Copy link
Contributor Author

YaLTeR commented Dec 12, 2023

ab1ec3f is the first bad commit

@wolfpld
Copy link
Owner

wolfpld commented Dec 12, 2023

Thanks. @tiago-rodrigues can you take a look?

@tiago-rodrigues
Copy link

will do

@tiago-rodrigues
Copy link

tiago-rodrigues commented Dec 12, 2023

Unfortunately I haven't been able to repro this locally with our own app or the tracy test one.
Likely there is something going on with the image name string lifetime there, but it's not obvious me just looking at the code either.
Would you happen to have a bit more details on the repro, or maybe there is a publicly available I could try to run locally?

@wolfpld: maybe the best would be best to make TRACY_USE_IMAGE_CACHE disabled by by default until we figure out what is the issue? @YaLTeR: I assume you don't get the issue if TRACY_USE_IMAGE_CACHE is disabled?

@YaLTeR
Copy link
Contributor Author

YaLTeR commented Dec 12, 2023

I assume you don't get the issue if TRACY_USE_IMAGE_CACHE is disabled?

Yeah, if I comment out that define, it seems to work fine.

@tiago-rodrigues
Copy link

I've created this: #685 to add disable the option for now.

@YaLTeR
Copy link
Contributor Author

YaLTeR commented Dec 12, 2023

Would you happen to have a bit more details on the repro, or maybe there is a publicly available I could try to run locally?

It's just mutter (GNOME's Wayland compositor) instrumented with Tracy. I run it, connect with Tracy, move the cursor around and wait a little for the data to flow in, then stop, and check the sampled call stacks. Most will contain those glitched strings.

Might be a little annoying to run, but:

  1. Clone and apply this MR: https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/3391

  2. Open meson.build, comment out this line:

    --- a/meson.build
    +++ b/meson.build
    @@ -461,7 +461,7 @@ if have_profiler
             'tracy_only_localhost=true',
             # Disable symbol resolving parts of Tracy, because they contend with
             # GJS spamming dlsym, causing massive slowdowns.
    -        'tracy_no_callstack=true',
    +        # 'tracy_no_callstack=true',
             # Build as a shared, rather than static, library, because otherwise
             # some binaries end up with two copies of Tracy, which is not supported.
             'default_library=shared',
  3. Install build dependencies, on Fedora you can do sudo dnf builddep mutter

  4. meson setup build -Dprofiler_tracy=true, ninja -C build, build/src/mutter --nested

    The ninja step might hang in the end, which seems to also be caused by this issue. In this case, just Ctrl-C and run it again.

@wolfpld
Copy link
Owner

wolfpld commented Dec 12, 2023

I can reproduce this with my test game.

obraz

@tiago-rodrigues
Copy link

@YaLTeR. I only have access to Ubuntu systems right now, so I'm not sure I can build mutter easily there. The repro from @wolfpld might be simpler, I'm guessing it's not a public project?
I still can't repro on my end after trying a few runs, but I may be just (un)lucky.

@wolfpld: could this be due to the fact that on image cache refresh, the old strings get deleted and we've returned that address on DecodeCallstackPtr() already, before it gets send back? Does it need a CopyStringFast() into CallstackEntryData::imageName? It wasn't being done before as it just returned dlinfo.dli_fname directly so I didn't do it either.

@wolfpld
Copy link
Owner

wolfpld commented Dec 12, 2023

I disabled clearing the cache on refresh in 5decb91, so that the refresh code only adds new entries to the cache. It should fix the problem.

@YaLTeR
Copy link
Contributor Author

YaLTeR commented Dec 13, 2023

That appears to fix the problem, yeah.

@wolfpld wolfpld closed this as completed Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants