Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metabug: Make minidump-processor good enough to replace mozilla's minidump-stackwalk #153

Closed
22 tasks done
Gankra opened this issue May 4, 2021 · 8 comments
Closed
22 tasks done

Comments

@Gankra
Copy link
Collaborator

Gankra commented May 4, 2021

The feature work is done, see the milestone dashboard for remaining bugfixes

Original metabug:


This is a metabug for the first milestone of Kill Breakpad

Replace mozilla/minidump-stackwalk with rust-minidump's minidump-processor


Subtasks


Needed for strict parity but not necessarily a blocker


Probably lots more to do, will expand as we find them.

@luser
Copy link
Collaborator

luser commented May 4, 2021

Output the crashing thread (is this just the requesting_thread? or is there another field for this data?)

If the minidump has an exception in it, the exception stream has a thread_id and context.

I think you're correct though, since the existing code uses requesting_thread and just calls it crashing_thread when there's an exception record available.

@Gankra
Copy link
Collaborator Author

Gankra commented May 4, 2021

Things I don't know off the top of my head:

  • What's the deal with minidump-stackwalk's pipe-separated output? Do we still need/want it?
  • What's the deal with minidump-stackwalk's input json file? What's the schema/purpose?
  • Does minidump-stackwalk produce anything other than JSON
    • Does it output the full "processed crash" JSON in socorro, or just the json_dump subfield?
  • Is the symbols_cache/symbols_tmp machinery something we can/should reproduce, or will that be subsumed theoretically by just using symbolic?

@Gankra
Copy link
Collaborator Author

Gankra commented May 4, 2021

Initial schema sketch from when minidump-stackwalk was being developed
{
  "status": string, // OK | ERROR_* | SYMBOL_SUPPLIER_INTERRUPTED
  "system_info": {
    "os": string,
    "os_ver": string,
    "cpu_arch": string, // x86 | amd64 | arm | ppc | sparc
    "cpu_info": string,
    "cpu_count": int
  },
  "crash_info": {
    "type": string,
    "crash_address": string, // 0x[[:xdigit:]]+
    "crashing_thread": int // | null
  }
  "main_module": int, // index into modules
  "modules": [
         // zero or more
    {
      "base_addr": string, // 0x[[:xdigit:]]+
      "debug_file": string,
      "debug_id": string, // [[:xdigit:]]{33}
      "end_addr": string, // 0x[[:xdigit:]]+
      "filename": string,
      "version": string
    }
  ],
  "thread_count": int,
  "threads": [
    // for i in range(thread_count)
    {
      "frame_count": int,
      "frames": [
        // for i in range(frame_count)
        {
          "frame": int,
          "module": string,   // optional
          "function": string, // optional
          "file": string,     // optional
          "line": int,        // optional
          "offset": string,         // 0x[[:xdigit:]]+
          "module_offset": string,  // 0x[[:xdigit:]]+ , optional
          "function_offset": string // 0x[[:xdigit:]]+ , optional
        }
      ]
    }
  ],
  // repeated here for ease of searching
  // (max 10 frames)
  "crashing_thread": {
    "threads_index": int,
    "total_frames": int,
    "frames": [
      // for i in range(length)
      {
        // as per "frames" entries from "threads" above
      }
    ]
  }
}

@willkg
Copy link
Contributor

willkg commented May 17, 2021

What's the deal with minidump-stackwalk's pipe-separated output? Do we still need/want it?

That was an old format that Socorro doesn't use anymore. I don't think there are other users of minidump-stackwalk, so my vote is that we can ditch it. If someone wants it back, I think they can write a wrapper that converts the JSON output to their pipe-delimited monstrosity format.

What's the deal with minidump-stackwalk's input json file? What's the schema/purpose?

The input JSON file is the crash annotations sent with the crash report. We have a list of annotations at https://hg.mozilla.org/mozilla-central/file/tip/toolkit/crashreporter/CrashAnnotations.yaml but no schema. That's on my todo list because not having a schema is bonkers.

Does minidump-stackwalk produce anything other than JSON

Socorro runs minidump-stackwalk in the processor and only uses the JSON output. There's a bin/run_mdsw.sh file at https://github.com/mozilla-services/minidump-stackwalk/blob/main/bin/run_mdsw.sh which is lifted from how the Socorro processor runs it.

minidump-stackwalk has that pipe-delimited output format which we can ditch.

minidump-stackwalk also generates a lot of logging output to stderr which is helpful for debugging.

If I had my druthers, it'd be better for minidump-stackwalk to do its logging output to stdout/stderr in a conventional fashion and emit the JSON output into an output file specified in the command line arguments.

Does it output the full "processed crash" JSON in socorro, or just the json_dump subfield?

minidump-stackwalk outputs JSON and the Socorro processor sticks that in the processed crash in the "json_dump" field. Then the Socorro processor runs a bunch of other processing rules and extracts data from the json_dump structure and does all kinds of other exciting and helpful things to generate the other fields in the processed crash structure.

There's no schema for the processed crash, either. I've got that on my todo list, too.

Is the symbols_cache/symbols_tmp machinery something we can/should reproduce, or will that be subsumed theoretically by just using symbolic?

minidump-stackwalk spends the majority of its time downloading SYM files and parsing them. To reduce that time spent, it caches the SYM files on disk in a wildly irritating way. It's still spending all the time parsing them, though. I have a bug for investigating how not caching SYM files on disk affects minidump-stackwalk performance, but I haven't done it, yet. I'm guessing it affects it, but not much.

Let's talk about multiple-url support... minidump-stackwalk supports looking at multiple places for SYM files. That's great, but it doesn't keep track of where it found the SYM file when it caches the SYM file on disk. So then when it goes to populate the url for the SYM file, it's wrong in a bunch of cases. This is complicated by the fact we have a "private bucket" of SYM files and the url for that is something like http://localhost:8000/<debug_file>/<debug_id>/ because we run a proxy in the Socorro processor nodes that adds the AWS S3 auth header to requests so the Socorro processor can access the private symbols bucket. I think Symbolicator does this better and supports multiple sources and multiple kinds of sources with configurable authentication and all that. I don't think we should add all that to rust-minidump because it's a drag. If I add token support for downloading files, we could maybe make Tecken handle that. So then rust-minidump would support one single symbols source url with configurable api token sent as an auth header and everything could be groovy.

Back to parsing... Symbolic parses SYM files (which takes a long time) and then generates a symcache file. If we cache those on disk, then we get to skip the downloading and parsing steps and it's much faster. That's what Eliot does currently and I think that's what Symbolicator does, too.

Given that it'd be faster for all these systems if they didn't have to parse SYM files, I've been tossing around changing Tecken so that it generates a symcache file for every SYM file uploaded and stores that in S3. So then we'd change things to try to download the symcache file and if it's there, use that. If it's not or it's in an older symcache format, download the sym file, parse it, generate a symcache, and use that. symcache files are not small and I think they do change format periodically, so maybe it makes sense to just generate symcache files for often used files. That's a project for a different day.

I think for here, I'd cache the symbolic symcache file on disk because reusing that file where possible is a big win. minidump-stackwalk cached files, but didn't do anything to maintain or cleanup the cache--that was done by another process that runs on the Socorro processor nodes that I maintain. If caching symcache files on disk sounds arduous, then I'm game for thinking about other ideas, but I don't think I have time to rewrite Tecken upload for a while.

@willkg
Copy link
Contributor

willkg commented May 17, 2021

For the minidump-stackwalk JSON output schema, I think I want to spend some quality time documenting it officially in that repo. Then we can point to that as the "official schema".

Towards that end, I took your schema and worked on it and created a WIP PR: mozilla-services/minidump-stackwalk#32

@Gankra
Copy link
Collaborator Author

Gankra commented Jun 10, 2021

So much progress... so close........

@Gankra
Copy link
Collaborator Author

Gankra commented Oct 19, 2021

Ok after a bit of delay from illness and sentry deciding to pivot to other problems, I am reasonably confident that rust-minidump's minidump-stackwalk is ~ready to be seriously tested as a replacement for socorro's backend. I have published all the crates at version 0.9.0 to make it easy to grab the latest build and also to indicate we're close to a 1.0.

Some useful links:

  • minidump-stackwalk's README has a setup and usage guide.
  • socc-pair is a tool I built for diffing rust-minidump and socorro's processed json more intelligently. It can be used to quickly check how we compare with only a crash id. I've been sucessfully using it to refine our behaviour!

TODO:

  • convince ourselves that there aren't any serious regressions/problems
  • convince ourselves that performance is adequate
  • deploy to socorro????

I'm not really sure how to evaluate performance here. I could do some synthetic benchmarking but I feel like caching of sym files under a real workload must be such a huge part of the performance story that we should be testing in something resembling an actual production workload?

@Gankra Gankra mentioned this issue Nov 11, 2021
@Gankra
Copy link
Collaborator Author

Gankra commented Nov 11, 2021

With the 0.9.2 release published, I am tentatively declaring the rust-minidump side of the equation complete. I will now be focusing more on testing and hardening, and not feature development. I am hanging the issues I find off the socorro parity milestone so I don't have to manually manage the "dashboard".

If you find any compatibility issues of your own, you can also file them against it (if github allows that, idk).

@Gankra Gankra closed this as completed Nov 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants