Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use faulthandler instead of isys signal handlers #4350

Merged

Conversation

VladimirSlavik
Copy link
Contributor

This significantly changes outputs, but provides eventually the same kinds of information, only in different order.

Previously:

  • Print that a signal was received, and which one.
  • Log to syslog that this happened.
  • Print userland stack trace of main process.
  • Save core dump:
    • always to /tmp/anaconda.core.
    • always, independent of system settings
    • somehow very slowly - maybe forking python and running gcore isn't the most performant idea
    • file size ca. 1 GB
  • Journal gets only the syslog message.

Now:

  • Print that a fatal error happened, no mention of signals.
  • No unique message in syslog and journal to identify that "anaconda" crashed.
  • Print python stack of all the threads in the main process.
  • Save core dump:
    • to default location
    • according to system settings
    • needs invoking coredumpctl manually to work with the actual core dump
    • blazing fast saving compared to previous state
    • size ca. 430 MB, initally stored compressed to ca. 60 MB
    • apparently requires more debuginfo downloads, system can run out of space or memory
  • Print userland stack traces of all threads to journal.

In both cases, using the (coredumpctl+)gdb+debuginfod+dnf toolchain leads to a successful interactive debugging session.

@VladimirSlavik
Copy link
Contributor Author

/kickstart-test --testtype smoke

Copy link
Contributor

@rvykydal rvykydal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you!
I wonder if we need to add / update strings in log monitor of kickstart tests. May be worth checking, but not blocking the PR.

@VladimirSlavik
Copy link
Contributor Author

Certainly, and livemedia-creator too, and who knows how many other places.

@VladimirSlavik
Copy link
Contributor Author

VladimirSlavik commented Sep 21, 2022

A few more details:

  • Looks like we might have to watch for a line like "Process [0-9]+ (anaconda) of user [0-9]+ dumped core" - not sure where: kickstart tests, lorax? Fortunately the logging "filter" is mostly shared.
  • The signal is mentioned in coredumpctl listing (by name instead of number).
  • The "good enough" userland stack traces (same as before) are visible with coredumpctl info (...) even without going into gdb and loading all the debuginfos.
  • coredumpctl info without query gives the details of the last recorded core dump, which should be enough in most cases.

I will amend the commit message with that.

This changes how the core dump support works, as well as the intermediate
outputs. However, it provides eventually the same kinds of information, only
in different places and order.

Previously:
- Print that a signal was received, and which one.
- Log to syslog that this happened.
- Print userland stack trace of main process main thread.
- Save core dump:
  - always to /tmp/anaconda.core.<PID>
  - always, independent of system settings
  - somehow very slowly - maybe forking python and running gcore isn't the
    most performant idea
  - file size ca. 1 GB
- Journal gets only the syslog message.

Now:
- Print that a fatal error happened, no mention of signals.
- No unique message in syslog to identify that "anaconda" crashed.
- Print python stack of all the threads in the main process.
- Save core dump:
  - to default location
  - according to system settings
  - needs invoking coredumpctl manually to work with the actual core dump
  - blazing fast saving compared to previous state
  - initally stored compressed to ca. 60 MB, exports to ca. 430 MB
  - apparently requires more debuginfo downloads for loading in gdb, system
    can run out of space or memory
- Journal gets more:
 - unique message that "Process <PID> (anaconda) of user <UID> dumped core."
 - userland stack traces of *all* threads
- To analyze, use coredumpctl:
  - `coredumpctl` with no arguments lists the PID, signal, and executable
  - `coredumpctl info` (with no query) shows metadata and the same information
    as journal, in pager

In both cases, using the (coredumpctl+)gdb+debuginfod+dnf toolchain leads to
a successful interactive debugging session.
@VladimirSlavik
Copy link
Contributor Author

Based on the above, I think we can merge this and not lose anything.

@VladimirSlavik
Copy link
Contributor Author

/kickstart-test --testtype smoke

@VladimirSlavik VladimirSlavik added the release note required Write a release note for this change. label Sep 26, 2022
Copy link
Contributor

@poncovka poncovka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@VladimirSlavik VladimirSlavik merged commit 49a2d6d into rhinstaller:master Sep 26, 2022
@VladimirSlavik VladimirSlavik deleted the master-isys-to-faulthandler branch September 26, 2022 12:59
@VladimirSlavik
Copy link
Contributor Author

Looks like we might have to watch for a line like "Process [0-9]+ (anaconda) of user [0-9]+ dumped core" - not sure where: kickstart tests, lorax?

@bcl just a heads up that this might change how fatal errors look...

@bcl
Copy link
Contributor

bcl commented Sep 26, 2022

The code that lmc uses is here - https://github.com/weldr/lorax/blob/f33-branch/src/pylorax/monitor.py#L38

So as long as the new output has either 'Traceback' or 'Call Trace:' in it lmc will catch it.

@VladimirSlavik
Copy link
Contributor Author

VladimirSlavik commented Sep 27, 2022

Unfortunately that's none of that.

  • In shell: Fatal Python error: (...)
  • In syslog: CRIT systemd-coredumpctl:Process [0-9]+ (anaconda) of user [0-9]+ dumped core.
  • In journal: systemd-coredumpctl\[[0-9]+\]: \[■\] Process [0-9]+ \(anaconda\) of user [0-9]+ dumped core.

I'll make a PR.

@VladimirSlavik VladimirSlavik removed the release note required Write a release note for this change. label Apr 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
f38 Fedora 38
4 participants