Metabug: Make minidump-processor good enough to replace mozilla's minidump-stackwalk #153

Gankra · 2021-05-04T14:45:05Z

The feature work is done, see the milestone dashboard for remaining bugfixes

Original metabug:

This is a metabug for the first milestone of Kill Breakpad

Replace mozilla/minidump-stackwalk with rust-minidump's minidump-processor

Subtasks

Needed for strict parity but not necessarily a blocker

~~[ ] Implement an exploitability heuristic like Breakpad #25 - Implement an exploitability heuristic~~
Context size is invalid (0x66f) #243 - Don't choke on XSAVE CpuContexts
Extract LastErrorValue from Thread TEBs #156 - Grab the last-error-value from each thread's TEB
Compute cert info #171 - Compute cert info
Support Mac Crash info #175 - Support Mac Crash Info (mozilla extension)
Implement ARM pointer auth stripping #195 - ARM Pointer auth stripping
Implement MINIDUMP_STREAM_TYPE::LinuxCpuInfo #207 - Implement LinuxCpuInfo

Probably lots more to do, will expand as we find them.

luser · 2021-05-04T15:07:07Z

Output the crashing thread (is this just the requesting_thread? or is there another field for this data?)

If the minidump has an exception in it, the exception stream has a thread_id and context.

I think you're correct though, since the existing code uses requesting_thread and just calls it crashing_thread when there's an exception record available.

Gankra · 2021-05-04T15:08:10Z

Things I don't know off the top of my head:

What's the deal with minidump-stackwalk's pipe-separated output? Do we still need/want it?
What's the deal with minidump-stackwalk's input json file? What's the schema/purpose?
Does minidump-stackwalk produce anything other than JSON
- Does it output the full "processed crash" JSON in socorro, or just the json_dump subfield?
Is the symbols_cache/symbols_tmp machinery something we can/should reproduce, or will that be subsumed theoretically by just using symbolic?

Gankra · 2021-05-04T15:09:29Z

Initial schema sketch from when minidump-stackwalk was being developed

{
  "status": string, // OK | ERROR_* | SYMBOL_SUPPLIER_INTERRUPTED
  "system_info": {
    "os": string,
    "os_ver": string,
    "cpu_arch": string, // x86 | amd64 | arm | ppc | sparc
    "cpu_info": string,
    "cpu_count": int
  },
  "crash_info": {
    "type": string,
    "crash_address": string, // 0x[[:xdigit:]]+
    "crashing_thread": int // | null
  }
  "main_module": int, // index into modules
  "modules": [
         // zero or more
    {
      "base_addr": string, // 0x[[:xdigit:]]+
      "debug_file": string,
      "debug_id": string, // [[:xdigit:]]{33}
      "end_addr": string, // 0x[[:xdigit:]]+
      "filename": string,
      "version": string
    }
  ],
  "thread_count": int,
  "threads": [
    // for i in range(thread_count)
    {
      "frame_count": int,
      "frames": [
        // for i in range(frame_count)
        {
          "frame": int,
          "module": string,   // optional
          "function": string, // optional
          "file": string,     // optional
          "line": int,        // optional
          "offset": string,         // 0x[[:xdigit:]]+
          "module_offset": string,  // 0x[[:xdigit:]]+ , optional
          "function_offset": string // 0x[[:xdigit:]]+ , optional
        }
      ]
    }
  ],
  // repeated here for ease of searching
  // (max 10 frames)
  "crashing_thread": {
    "threads_index": int,
    "total_frames": int,
    "frames": [
      // for i in range(length)
      {
        // as per "frames" entries from "threads" above
      }
    ]
  }
}

willkg · 2021-05-17T20:21:22Z

What's the deal with minidump-stackwalk's pipe-separated output? Do we still need/want it?

That was an old format that Socorro doesn't use anymore. I don't think there are other users of minidump-stackwalk, so my vote is that we can ditch it. If someone wants it back, I think they can write a wrapper that converts the JSON output to their pipe-delimited monstrosity format.

What's the deal with minidump-stackwalk's input json file? What's the schema/purpose?

The input JSON file is the crash annotations sent with the crash report. We have a list of annotations at https://hg.mozilla.org/mozilla-central/file/tip/toolkit/crashreporter/CrashAnnotations.yaml but no schema. That's on my todo list because not having a schema is bonkers.

Does minidump-stackwalk produce anything other than JSON

Socorro runs minidump-stackwalk in the processor and only uses the JSON output. There's a bin/run_mdsw.sh file at https://github.com/mozilla-services/minidump-stackwalk/blob/main/bin/run_mdsw.sh which is lifted from how the Socorro processor runs it.

minidump-stackwalk has that pipe-delimited output format which we can ditch.

minidump-stackwalk also generates a lot of logging output to stderr which is helpful for debugging.

If I had my druthers, it'd be better for minidump-stackwalk to do its logging output to stdout/stderr in a conventional fashion and emit the JSON output into an output file specified in the command line arguments.

Does it output the full "processed crash" JSON in socorro, or just the json_dump subfield?

minidump-stackwalk outputs JSON and the Socorro processor sticks that in the processed crash in the "json_dump" field. Then the Socorro processor runs a bunch of other processing rules and extracts data from the json_dump structure and does all kinds of other exciting and helpful things to generate the other fields in the processed crash structure.

There's no schema for the processed crash, either. I've got that on my todo list, too.

Is the symbols_cache/symbols_tmp machinery something we can/should reproduce, or will that be subsumed theoretically by just using symbolic?

minidump-stackwalk spends the majority of its time downloading SYM files and parsing them. To reduce that time spent, it caches the SYM files on disk in a wildly irritating way. It's still spending all the time parsing them, though. I have a bug for investigating how not caching SYM files on disk affects minidump-stackwalk performance, but I haven't done it, yet. I'm guessing it affects it, but not much.

Let's talk about multiple-url support... minidump-stackwalk supports looking at multiple places for SYM files. That's great, but it doesn't keep track of where it found the SYM file when it caches the SYM file on disk. So then when it goes to populate the url for the SYM file, it's wrong in a bunch of cases. This is complicated by the fact we have a "private bucket" of SYM files and the url for that is something like http://localhost:8000/<debug_file>/<debug_id>/ because we run a proxy in the Socorro processor nodes that adds the AWS S3 auth header to requests so the Socorro processor can access the private symbols bucket. I think Symbolicator does this better and supports multiple sources and multiple kinds of sources with configurable authentication and all that. I don't think we should add all that to rust-minidump because it's a drag. If I add token support for downloading files, we could maybe make Tecken handle that. So then rust-minidump would support one single symbols source url with configurable api token sent as an auth header and everything could be groovy.

Back to parsing... Symbolic parses SYM files (which takes a long time) and then generates a symcache file. If we cache those on disk, then we get to skip the downloading and parsing steps and it's much faster. That's what Eliot does currently and I think that's what Symbolicator does, too.

Given that it'd be faster for all these systems if they didn't have to parse SYM files, I've been tossing around changing Tecken so that it generates a symcache file for every SYM file uploaded and stores that in S3. So then we'd change things to try to download the symcache file and if it's there, use that. If it's not or it's in an older symcache format, download the sym file, parse it, generate a symcache, and use that. symcache files are not small and I think they do change format periodically, so maybe it makes sense to just generate symcache files for often used files. That's a project for a different day.

I think for here, I'd cache the symbolic symcache file on disk because reusing that file where possible is a big win. minidump-stackwalk cached files, but didn't do anything to maintain or cleanup the cache--that was done by another process that runs on the Socorro processor nodes that I maintain. If caching symcache files on disk sounds arduous, then I'm game for thinking about other ideas, but I don't think I have time to rewrite Tecken upload for a while.

willkg · 2021-05-17T21:21:27Z

For the minidump-stackwalk JSON output schema, I think I want to spend some quality time documenting it officially in that repo. Then we can point to that as the "official schema".

Towards that end, I took your schema and worked on it and created a WIP PR: mozilla-services/minidump-stackwalk#32

Gankra · 2021-06-10T17:19:30Z

So much progress... so close........

Gankra · 2021-10-19T16:07:04Z

Ok after a bit of delay from illness and sentry deciding to pivot to other problems, I am reasonably confident that rust-minidump's minidump-stackwalk is ~ready to be seriously tested as a replacement for socorro's backend. I have published all the crates at version 0.9.0 to make it easy to grab the latest build and also to indicate we're close to a 1.0.

Some useful links:

minidump-stackwalk's README has a setup and usage guide.
socc-pair is a tool I built for diffing rust-minidump and socorro's processed json more intelligently. It can be used to quickly check how we compare with only a crash id. I've been sucessfully using it to refine our behaviour!

TODO:

convince ourselves that there aren't any serious regressions/problems
convince ourselves that performance is adequate
deploy to socorro????

I'm not really sure how to evaluate performance here. I could do some synthetic benchmarking but I feel like caching of sym files under a real workload must be such a huge part of the performance story that we should be testing in something resembling an actual production workload?

Gankra · 2021-11-11T03:34:18Z

With the 0.9.2 release published, I am tentatively declaring the rust-minidump side of the equation complete. I will now be focusing more on testing and hardening, and not feature development. I am hanging the issues I find off the socorro parity milestone so I don't have to manually manage the "dashboard".

If you find any compatibility issues of your own, you can also file them against it (if github allows that, idk).

Gankra mentioned this issue May 4, 2021

Project Metabug: Kill Breakpad getsentry/symbolic#375

Closed

42 tasks

Gankra mentioned this issue May 10, 2021

Replace breakpad-symbols with something based on symbolic #159

Closed

Gankra mentioned this issue Nov 11, 2021

0.9.2 release #292

Merged

Gankra closed this as completed Nov 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metabug: Make minidump-processor good enough to replace mozilla's minidump-stackwalk #153

Metabug: Make minidump-processor good enough to replace mozilla's minidump-stackwalk #153

Gankra commented May 4, 2021 •

edited

Loading

luser commented May 4, 2021

Gankra commented May 4, 2021

Gankra commented May 4, 2021

willkg commented May 17, 2021

willkg commented May 17, 2021

Gankra commented Jun 10, 2021

Gankra commented Oct 19, 2021

Gankra commented Nov 11, 2021

Metabug: Make minidump-processor good enough to replace mozilla's minidump-stackwalk #153

Metabug: Make minidump-processor good enough to replace mozilla's minidump-stackwalk #153

Comments

Gankra commented May 4, 2021 • edited Loading

Subtasks

Needed for strict parity but not necessarily a blocker

luser commented May 4, 2021

Gankra commented May 4, 2021

Gankra commented May 4, 2021

willkg commented May 17, 2021

willkg commented May 17, 2021

Gankra commented Jun 10, 2021

Gankra commented Oct 19, 2021

Gankra commented Nov 11, 2021

Gankra commented May 4, 2021 •

edited

Loading