Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attaching a Debugger to BPF #14756

Open
5 of 10 tasks
Sladuca opened this issue Jan 21, 2021 · 4 comments
Open
5 of 10 tasks

Attaching a Debugger to BPF #14756

Sladuca opened this issue Jan 21, 2021 · 4 comments
Assignees
Milestone

Comments

@Sladuca
Copy link

Sladuca commented Jan 21, 2021

Problem

There is no debugger support for solana BPF programs yet, meaning users are limited to print statements for debugging solana programs, which is difficult as many solana programs use shared memory or have manual struct packing methods that do tricky things with pointers like pointer arithmetic and shared memory references which are much easier to debug using a debugger.

Proposed Solution

Implement a GDB stub server for solana BPF. Implementing a stub server instead of a dedicated arch definition makes it easier to decouple the debugging logic from the VM itself and to debug programs running in the context of a node. It also gives a lot more flexibility for defining how stuff like single step, breakpoints, and watchpoints are implemented, and in particular we can do it however we like in the context of a running BPF instance.

Proposed implementation follows three steps:

  • Implement a gdb stub server running in a separate thread from BPF itself.

CURRENT STATUS: For this, I used the crate gdbstub to avoid reinventing wheels. I then defined a simple "request/reply" pattern between the two threads over a std::sync::mpsc so the gdb stub could ask the vm to do things for it, like "step a single instruction" or "set a breakpoint at some address". Currently, the debugger, if enabled, will block until a GDB client connects to it.

  • Implement debugger logic in BPF in a way that minimally affects performance when running without the debugger, including support for:
    • reading/writing register values
    • reading/writing memory addresses
    • single step
    • breakpoints:
    • watchpoint
    • tests for the previous items (still need to do)

CURRENT STATUS: This was basically implemented as a handler for the "request/reply" pattern mentioned in the previous item, though actual tests still need to be written. It's also worth noting that these breakpoints / single steps are pretty much useless because of the next item. Initially it was conditionally-compiled with the debug feature, but that ended up being annoying for writing tests, so instead I just have some if let statmements gating debugger behavior in the loop - in the future if that's too much performance overhead, it can be refactored to have less of an impact, probably using some kind of IoC injection. However, only single-step works from an actual GDB instance (next item says why).

CURRENT STATUS: If the existing support isn't working for our needs we'd need to extend the existing target definition. The existing BPF support doesn't seem to be in any "commonly distributed" releases of gdb available, and I spent 4-ish hours trying to get it to build on my mac, but I ended up just using x86 gdb to avoid wasting time trying to compile GNU stuff on mac catalina. This "works" in the sense that the execution stops at the right pc if I print it out in the VM while debugging a test program, but it doesn't understand when the server responds with BPF register information because (obviously) x86 has different registers than BPF. But eventually we'll probably need to fork it and distribute it the same way we distribute our fork of the llvm toolchain.

  • Patch the compilation toolchain such that DWARF symbols are properly relocated and can be read by the GDB client.

CURRENT STATUS: This is where the majority of my internship was spent, and as my internship comes to an end I unfortunately still haven't come to a solution yet. This turned out to be a rather difficult issue that involved a lot "code archaeology" (in the words of Matt Godbolt), manually annotating hexdumps of malformed ELF sections that couldn't be read by llvm-dwarfdump and a lot of spelunking in the massive, messy codebases of the LLVM project, namely clang, lld, and llvm itself.

@jackcmay has been very helpful as far as providing suggestions and links to relevant and helpful documentation - I hope I didn't consume too much of his time. To make sure the minimum amount of information gets lost when I go back to school, I've added a somewhat extensive summary of what I did, what I found, and future directions I would take had I more time below. Part of me thinks I might spend some time on this afterwards because I'm still curious and it's open source, but in any case feel free to tag me in a comment on this issue if anyone has further questions about this in the future after my internship ends.

What I did
  • The first thing I did for attempting to add debug symbols was simply adding a -g flag to the compilation commands here in rbpf's tests and ld.lld ended up having an aneurism, screaming this:
         ld.lld: warning: udivmodti4.c:(.debug_info+0xD66B2): has non-ABS relocation R_BPF_64_32 against symbol ''
          ld.lld: warning: udivmodti4.c:(.debug_info+0xD66BE): has non-ABS relocation R_BPF_64_32 against symbol ''
          ld.lld: warning: udivmodti4.c:(.debug_info+0xD66C9): has non-ABS relocation R_BPF_64_32 against symbol ''
          ld.lld: warning: udivmodti4.c:(.debug_info+0xD66CD): has non-ABS relocation R_BPF_64_32 against symbol ''
          ld.lld: warning: udivmodti4.c:(.debug_info+0xD66D8): has non-ABS relocation R_BPF_64_32 against symbol ''
          ld.lld: warning: udivmodti4.c:(.debug_info+0xD66DC): has non-ABS relocation R_BPF_64_32 against symbol ''
... // many thousands of lines of this

          ld.lld: error: std.e0r84xw0-cgu.5:(.BTF+0x2C50): unrecognized reloc R_BPF_NONE
          ld.lld: error: std.e0r84xw0-cgu.5:(.BTF+0x456A0): has non-ABS relocation R_BPF_NONE against symbol 'std::future::TLS_CX::__getit::__KEY::hb4bb6cd48e7f2fa0'
          ld.lld: warning: std.e0r84xw0-cgu.14:(.BTF+0x531A6): has non-ABS relocation R_BPF_64_32 against symbol 'str.0'
          ld.lld: error: std.e0r84xw0-cgu.9:(.BTF+0x21C0): unrecognized reloc R_BPF_NONE
          ld.lld: error: std.e0r84xw0-cgu.9:(.BTF+0x6D9A6): has non-ABS relocation R_BPF_NONE against symbol 'std::collections::hash::map::RandomState::new::KEYS::__getit::__KEY::h41aec933d7eaa529'
          ld.lld: warning: alloc.4qne3mtx-cgu.13:(.BTF+0x7C582): has non-ABS relocation R_BPF_64_32 against symbol 'str.0'
          ld.lld: warning: alloc.4qne3mtx-cgu.5:(.BTF+0x806B6): has non-ABS relocation R_BPF_64_32 against symbol 'str.0'
          ld.lld: warning: core.d7muqx0w-cgu.10:(.BTF+0x88D50): has non-ABS relocation R_BPF_64_32 against symbol 'str.0'
          ld.lld: warning: core.d7muqx0w-cgu.12:(.BTF+0x9300B): has non-ABS relocation R_BPF_64_32 against symbol 'str.0'
          ld.lld: warning: core.d7muqx0w-cgu.13:(.BTF+0x96EB6): has non-ABS relocation R_BPF_64_32 against symbol 'str.1'
          ld.lld: warning: core.d7muqx0w-cgu.15:(.BTF+0x9B3BC): has non-ABS relocation R_BPF_64_32 against symbol 'str.0'
          ld.lld: warning: core.d7muqx0w-cgu.3:(.BTF+0xA6858): has non-ABS relocation R_BPF_64_32 against symbol 'str.0'
          ld.lld: warning: core.d7muqx0w-cgu.3:(.BTF+0xA6864): has non-ABS relocation R_BPF_64_32 against symbol 'str.1'
          ld.lld: warning: core.d7muqx0w-cgu.4:(.BTF+0xAA65B): has non-ABS relocation R_BPF_64_32 against symbol 'str.0'
          ld.lld: warning: core.d7muqx0w-cgu.5:(.BTF+0xAE50D): has non-ABS relocation R_BPF_64_32 against symbol 'str.0'
          ld.lld: warning: core.d7muqx0w-cgu.6:(.BTF+0xB2A90): has non-ABS relocation R_BPF_64_32 against symbol 'str.0'
          ld.lld: warning: core.d7muqx0w-cgu.7:(.BTF+0xB8F50): has non-ABS relocation R_BPF_64_32 against symbol 'str.1'
          ld.lld: warning: core.d7muqx0w-cgu.7:(.BTF+0xB8F5C): has non-ABS relocation R_BPF_64_32 against symbol 'str.2'
          ld.lld: warning: core.d7muqx0w-cgu.7:(.BTF+0xB8F68): has non-ABS relocation R_BPF_64_32 against symbol 'str.3'
          ld.lld: error: std.e0r84xw0-cgu.1:(.BTF.ext+0x34): unrecognized reloc R_BPF_NONE
          ld.lld: error: std.e0r84xw0-cgu.1:(.BTF.ext+0x34): has non-ABS relocation R_BPF_NONE against symbol ''
          ld.lld: error: std.e0r84xw0-cgu.11:(.BTF.ext+0x34): unrecognized reloc R_BPF_NONE
          ld.lld: error: std.e0r84xw0-cgu.11:(.BTF.ext+0x7ECC): has non-ABS relocation R_BPF_NONE against symbol ''
          ld.lld: error: std.e0r84xw0-cgu.13:(.BTF.ext+0x34): unrecognized reloc R_BPF_NONE
          ld.lld: error: std.e0r84xw0-cgu.13:(.BTF.ext+0xAA24): has non-ABS relocation R_BPF_NONE against symbol ''
          ld.lld: error: std.e0r84xw0-cgu.15:(.BTF.ext+0x34): unrecognized reloc R_BPF_NONE
          ld.lld: error: std.e0r84xw0-cgu.15:(.BTF.ext+0xC3E4): has non-ABS relocation R_BPF_NONE against symbol ''
          ld.lld: error: std.e0r84xw0-cgu.0:(.BTF.ext+0x34): unrecognized reloc R_BPF_NONE
          ld.lld: error: std.e0r84xw0-cgu.0:(.BTF.ext+0x10AB4): has non-ABS relocation R_BPF_NONE against symbol ''
          ld.lld: error: std.e0r84xw0-cgu.2:(.BTF.ext+0x34): unrecognized reloc R_BPF_NONE
          ld.lld: error: std.e0r84xw0-cgu.2:(.BTF.ext+0x1479C): has non-ABS relocation R_BPF_NONE against symbol ''
          ld.lld: error: std.e0r84xw0-cgu.10:(.BTF.ext+0x34): unrecognized reloc R_BPF_NONE
          ld.lld: error: std.e0r84xw0-cgu.10:(.BTF.ext+0x18E1C): has non-ABS relocation R_BPF_NONE against symbol ''
          ld.lld: error: too many errors emitted, stopping now (use -error-limit=0 to see all errors)
  • First thought was to just ignore them and see if GDB can read them, but it couldn't, complaining about an invalid pointer size in compunit header, prompting me to go learn what a linker actually does in a enough detail to understand what was happening since, at this point, my understanding of the compilation process was the typical "intro systems" (as I hadn't taken a course on compilers before) explanation of "the compiler turns your C code into many object file and the linker does some black magic to fuse them all together into a single executable or shared library". So I then had to go learn about that in enough detail for stuff to actually make some sense, which took almost a week, and even then relocations still seemed kinda magical.
  • Having understood stuff (kind of), I then added an option to solana's cargo-build-bpf to command not strip symbols and I removed the --release flag from cargo so that I could actually spent some time adding print statments to LLD to see what the existing BPF relocations were doing in the context in which the user would actually build their binaries using examples from solana-program-library, and that ended up causing ld.lld to straight up fail because it didn't handle the R_BPF_NONE relocation which apparently clang omits. Added a simple fix for that here but then I ended up getting roughly the same ld.lld tantrum as above.
  • Next the "code archaeology" started in earnest, since I now knew of the existence and usage of commands like readelf and dwarfdump to inspect ELF's. Dumps of the shared-objects that resulted from the linker were giving enormous outputs (>100k lines long) that contained the same invalid pointer size in compunit header, but the unlinked relocatable objects weren't, so now I was pretty sure it was an issue in the linker (though not entirely sure, see below other possible culprits that I didn't inspect).
  • At this point @jackcmay suggested I return to the single-source C test programs in rbpf as the simplest possible case, so I wrote a "small but not trivial" buggy test C program and started using that for all of my future investigations. and now the dumps were of a comprehensible size, and I did some more spelunking in the lld codebase to try to get an overall idea of what it was doing before I continued, and the biggest thing I noticed is that almost all of solana's patches were relocation-related, so I thought it was probably a relocation support issue and looked into that specifically. While I was doing this, I realized that there are separate llvm-readelf and llvm-dwarfdump commands, whose outputs actually interpreted the dwarf sections for me which was pretty nice.
  • I found some llvm development threads that mentioned a clang flag -X +dwarfris that prevented cross-section relocations from occurring in DWARF sections, and when I tried that it made ld.lld stop screaming, so I proceeded as if that was the more correct way as it limited the number of issues it could be. The debug sections went from having many R_BPF_64_32 relocations to having three R_BPF_64_64 relocations. But alas, GDB still couldn't read it, and when I llvm-dwarfdump'd it, the only significant issue was that some unexpected null bytes were prematurely terminating the .debug_info section - and looking at the offsets for the new R_BPF_64_64, I was pretty sure stuff that wasn't supposed to be null was being overwritten with null bytes. To confirm this, I ended up going deeper, pulling out the DWARF spec and trying to wrap my head around what it's doing so I could eventually look at hexdumps and see what exactly is being overwritten and where.
  • In a "donut" chat, @aeyakovenko and I talked a bit about dealing with these sorts of issues, and he mentioned that usually you just end up having to manually inspect the bytes of the "crufty old C structs" to see where they went wrong. This turned out to be very good advice, as when I finally hexdump'd the ELF's into text files and manually annotated the .debug_info and .debug_abbrev sections for both the shared objects and the pre-link relocatable objects (.debug_abbrev was the same for both), I not only found exactly what was being malformed, but I also got a much more precise understanding how DWARF and relocations work.
  • Now that I new exactly what was going wrong (at least when including the -X +dwarfris clang flag), I spent the last few days digging around in solana/lld, adding print statements and trying to understand exactly what transformation the R_BPF_64_64 relocations were performing , and at this point I'm pretty sure it's due to the fact that it's being used as an address relocation for addresses in .debug_info, but it actually performs a relocation of an lddw instruction, which is a bit different than an address.
What I found
  • Incorrect relocations are the culprit, so optimization level should be more or less irrelevant - for the sake of consistency with existing builds I kept it -O2
  • all issues are manifesting in the linker, not necessarily originating in it, as it could be the case that the relocation types emitted by clang are incorrect.
  • using the -X +dwarfris flag removes a vast majority of the issues and makes things very simple. I'm pretty sure we should use it, but it may be the case that a fix somewhere else will remove the need for it.
  • when the -X +dwarfris flag is included, R_BPF_64_64 relocations are being applied to relocate addresses, not lddw instructions, which would cause the issue where a relocation in the first debugging information entry is overwriting null bytes into some of the of the second debugging information entry in .debug_info. If this is the way forward, the mistake could very well be clang emitting R_BPF_64_64 relocations when they should have been something else, though I'm not entirely sure what it should be or even if such a relocation type is defined yet in the BPF ABI. But it could also still be an issue in the linker, where there's other cases to consider when performing / interpreting an R_BPF_64_64 relocation.
Next steps
  • case where we should use -X +dwarfris:
    • figure out whether or not R_BPF_64_64 is the correct relocation type for relocating addresses in .debug_info
    • figure out where/why clang emits an R_BPF_64_64 relocation type instead of something else
    • figure out what the relocations of addresses in .debug_info should be doing, if anything.
  • case where we shouldn't use -X +dwarfris:
    • manually annotate a hexdump of the .debug_info and .debug_abbrev sections of both the shared object and the pre-link relocatable object.
    • figure out what all of these R_BPF_32_32 relocations should be to accomplish, if anything
  • an entirely different approach that avoids the need for debug relocations in the linker altogether. Not sure how feasible this is, it's just a random idea I had.
information about hexdumps
  • The naming for all of my hex dumps which can be found here is dbi means "dump of .debug_info", dba means "dump of .debug_abbrev". _so means it's from the shared-object, while _o means its from the pre-link relocatable object. ris is appended to dbi or dba for dumps that omitted the -X +dwarfris flag, though I haven't really spent much time on those.
  • .debug_info section is basically a series of Debugging Information Entries (DIE's), which specify 1) a corresponding entry of .debug_abbreviation (via abbrev_index), which basically says all of the values that DIE is supposed to have and 2) the values themselves. DIE's can have "children", and a null byte following a DIE indicates the end of a sequence of DIE's at a particular level, so at the top level this indicates the end of the section. You can read more about DIE's in the DWARF spec on page 21.
  • The discrepancy can be seen pretty clearly by opening dbi_so and dbi_o side-by-side, as i've manually annotated most of each wrt to dba_o, which is identical to dba_so.
@jackcmay jackcmay added this to To do in Mainnet Beta Programs via automation Jan 21, 2021
@Sladuca
Copy link
Author

Sladuca commented Feb 16, 2021

Just an update - I've looked into the third alternative option under next steps a bit more and I think I'm going to try that next - I have a feeling this will end up being a bit less error-prone, and it also has the added benefit of making gdbstub better for many use cases.

@mvines mvines added this to the The Future! milestone May 10, 2021
@stranzhay
Copy link

was this ever worked on/finished ?

@dmakarov dmakarov self-assigned this May 20, 2022
@jawilk
Copy link

jawilk commented May 20, 2022

There is a work in progress by @terorie and me building on top of this. We want to make it accessible in the browser but unfortunately that will still take some time.

For now, you could follow the steps outlined here to have basic debugging functionality with gdb. Getting this up and running is a bit cumbersome as of now but after #anza-xyz/llvm-project#38 lands in bpf-tools you would only need to compile a patched gdb and patch rbpf with the gdbstub crate to make it work.

If you have a project you want to debug today you could either build rust with an llvm fork as mentioned in the repo or share your programs code (if you are able/allowed to) so I may be able to build it with debug info for you.

In any case since this is kind of scattered around github and I'm not sure the info provided in solana-poc-debugging-example is very clear you could open an issue there and I can help to get it running

@stranzhay
Copy link

@jawilk awesome tysm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

5 participants