Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use breakpad + symbolic to generate and interpret minidump-format core dumps #4202

Open
siddontang opened this issue Feb 13, 2019 · 7 comments
Labels
component/build Component: Build, Deployment, etc. component/server Component: Server difficulty/medium Difficulty: Medium. You need some kind of understanding of several components to work on this help wanted Help wanted. Contributions are very welcome! type/enhancement Type: Issue - Enhancement

Comments

@siddontang
Copy link
Contributor

siddontang commented Feb 13, 2019

Edit: We're going to try to integrate breakpad + symbolic to generate compact "minidumps" (via breakpad), and interpret them offline (via symbolic). Next step is to prototype breakpad and symbolic on a toy project to learn how to use them.

Feature Request

Is your feature request related to a problem? Please describe:

Sometimes TiKV may meet some problems like segment fault and crash directly, but unfortunately, our official deployment through Ansbile doesn't enable core because we worry generating too many core dump files may exhaust disk space.

Although we enable core, the generated core files may be too large and can't be sent through the network and we have to debug it on the users' machine directly(of course, this is not allowed in most of the users' environments).

Describe the feature you'd like:

Mostly we only want to know the panic backtrace. Instead of the core file, we can use minidump or just output the panic backtrace.

Teachability, Documentation, Adoption, Migration Strategy:

For minidump, we can use https://github.com/google/breakpad, in Rust, we may try https://github.com/getsentry/symbolic
Another way is to output backtrace directly, refer to https://github.com/gby/libcrash and https://www.scribd.com/doc/3726406/Crash-N-Burn-Writing-Linux-application-fault-handlers.

/cc @ethercflow

@siddontang siddontang added the type/enhancement Type: Issue - Enhancement label Feb 13, 2019
@brson
Copy link
Contributor

brson commented Feb 14, 2019

The op mentions panics, but this is really about hard crashes. That said, if we distributed builds with panic = abort we could treat crashes and panics the same, with minidump.

It seems like using breakpad + symbolic should work pretty fine for tikv.

Stripping debuginfo for breakpad would mostly fix #4107, and help with #4150.

This seems promising.

@brson brson added this to To do in Improve compile times via automation Feb 14, 2019
@siddontang
Copy link
Contributor Author

@brson

Do you think is it fine to let contributors help us? of course, we can mentor them.

@brson
Copy link
Contributor

brson commented Feb 16, 2019

@siddontang yes I think this could be an interesting issue for a contributor. It needs some more description of the next steps though - I don't think the original description is quite enough to get started. Feel free to write more about specifically what we need to do next, or else I'll come back and think about it later.

@siddontang
Copy link
Contributor Author

@brson

I think we can try to use breakpad or symbolic directly, I browse the breakpad source code and find it has already registered a signal handler to generate minidump directly. but we should verify whether it can work in Rust or not.

So I think at first, we should try to use these in Rust, then we can introduce to TiKV and ensure it can work ok in TiKV too.

After that, we can use breakpad to extract the symbol debug info, save to another file (we don't need to include the symbol file in release tar) and reduce the binary size. When the users meet coredump, they can only send us the minidump files, and we can debug it directly in our local computer with the symbol file.

@brson
Copy link
Contributor

brson commented Feb 22, 2019

Thanks @siddontang. It sounds like the next step is to prototype integrating breakpad and symbolic into a toy rust project.

@siddontang do you have time to mentor? I understand if not; we can look for somebody else.

@brson brson added help wanted Help wanted. Contributions are very welcome! component/build Component: Build, Deployment, etc. difficulty/medium Difficulty: Medium. You need some kind of understanding of several components to work on this component/server Component: Server labels Feb 22, 2019
@brson brson changed the title consider generating minidump or outputting backtrace when receives abort signal Use breakpad + symbolic to generate and interpret minidump-format core dumps Feb 22, 2019
@siddontang
Copy link
Contributor Author

Another thing I find is that the binary size in Linux is too huge, nearly 300MB, I use boaty

bloaty/bloaty bin/tikv-server
     VM SIZE                         FILE SIZE
 --------------                   --------------
   0.0%       0 .debug_info         123Mi  41.4%
   0.0%       0 .debug_loc         67.0Mi  22.5%
   0.0%       0 .debug_str         27.5Mi   9.3%
   0.0%       0 .debug_ranges      22.2Mi   7.5%
  63.6%  17.8Mi .text              17.8Mi   6.0%
   0.0%       0 .debug_line        11.2Mi   3.8%
   0.0%       0 .debug_pubnames    7.56Mi   2.5%
   0.0%       0 .debug_pubtypes    7.32Mi   2.5%
   0.0%       0 .strtab            2.62Mi   0.9%
   8.3%  2.31Mi .data.rel.ro       2.31Mi   0.8%
   7.8%  2.19Mi .rela.dyn          2.19Mi   0.7%
   7.5%  2.11Mi .bss                    0   0.0%
   0.0%       0 .debug_abbrev      1.61Mi   0.5%
   5.0%  1.40Mi .rodata            1.40Mi   0.5%
   5.0%  1.38Mi .eh_frame          1.38Mi   0.5%
   0.0%       0 .symtab             971Ki   0.3%
   1.8%   501Ki .gcc_except_table   501Ki   0.2%
   0.0%       0 .debug_aranges      326Ki   0.1%
   0.8%   220Ki .eh_frame_hdr       220Ki   0.1%
   0.3%  89.6Ki [30 Others]         101Ki   0.0%
   0.0%       0 .debug_macro       94.1Ki   0.0%
 100.0%  28.0Mi TOTAL               297Mi 100.0%

Seem most of the spaces are occupied by debug info, so I use strip --strip-debug tikv-server and the size becomes 30MB.

But this has a problem that we will miss the backtrace, so it is not a good idea to do it. Maybe we can separate the debug info, mostly, we don't need to use it, if we want to use, we can download it directly.

@sticnarf
Copy link
Contributor

I'm interested in this issue. Is there any mentor?

Here is my rough idea. To create a minidump, we create a binding of breakpad using bindgen, then initiate the exception handler on startup of tikv. After the tikv-server binary is built, use dump_syms in breakpad to create a symbol file. We save the symbol file and then we can distribute the stripped binary.

I have some concerns.

If we use breakpad to generate a minidump, is it safe to just remove the panic hook or even use panic = abort in release? There can be some logs in the buffer which are not flushed to disk. If it aborts, does it mean we will lose some logs?

I am also afraid that it will increase compile time. It introduces a new C++ library we have to compile. Extracting symbols from binary takes some time too.

The advantage I can see is distributing a smaller binary and creating a smaller dump file. But it seems that it is not helpful for debug mode. But we don't have profile-based dependency support in Cargo. It means compile time for dev builds will increase and we can hardly benefit from it while developing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/build Component: Build, Deployment, etc. component/server Component: Server difficulty/medium Difficulty: Medium. You need some kind of understanding of several components to work on this help wanted Help wanted. Contributions are very welcome! type/enhancement Type: Issue - Enhancement
Projects
Development

No branches or pull requests

3 participants