New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segmentation fault reading certain queue files #2654
Comments
There is a tool for recreating the queue file indexes (.qi files) if they get
corrupted, writing them in such a way that they cannot be corrupted would
involve a lot of very expensive fsync() and fdatasync() calls, probably causing
you to loose logs because you can't write them fast enough.
now, it would be good for rsyslog to not segfault if there are problems with the
queue or .qij files, so this ticket is still valid, but you need to understand
the limits.
|
How is this tool called? You are probably referring to I agree about that it should not segfault - rsyslog would ideally just skip the invalid messages and try to recover (or exit cleanly). I managed to create a more detailed backtrace on my computer. It seems like it's failing when attempting to parse an psyTAG attribute, but I wasn't yet able to dig further.
|
On Wed, 18 Apr 2018, Sandro Gießl wrote:
How is this tool called? You are probably referring to `tools/recover_qi.pl` -
I will try it out on the faulty queue files that I have collected. Ideally,
there would be an automated way of detecting that a queue recovery is
required, or even an option that would direct rsyslogd to do unattended and
optimistic recovery.
That the tool.
unfortunantly it can take a long time to run, so I don't think it's best to run
it (or something like it) automatically.
We possibly should try to auto-create a 'corrupt_queue_files_XXX' subdirectory
and move the entire set of queue files into that directory for future recovery,
meanwhile starting up and creating new files as needed.
I seem to remember running into a case where the queue was corrupt, but not
badly enough to cause a segfault, and having rsyslog ignore the queue. This is
not as good as moving the queue files as it means that you can't write anything
new to the queue either.
I had a case where a destination with a queue was down for several weeks, and I
had to move and compress the queue files a few times to keep things running in
the meantime. I ended up creating a variation of my config that read from the
queue in a different directory and sent them out.
|
Coming back to this old ticket, we've seen a similar case in which a corrupted disk queue's data causes rsyslog to segfault on startup. This is unexpected. Trying to rebuild the queue index file made no difference, the recover tool runs successfully but rsyslogd still segfaults due to corrupted queue data. A minimal test case is the following:
From visual inspection it's apparent that on line 9 of the file, there's a corrupted record where
Skipping corrupted queue data or files is preferable to failing to start rsyslogd completely in our case, and others may share a similar need. Something which slowed down identifying the problem is that Rsyslog's error messages in general are useful (e.g. if a file permission issue exists), but in this case there's simply a segfault without any error message. One has to enable debugging output to get to the problem. An error message identifying the likely cause (corrupted queue data) and a filename would be useful in normal operation. |
I also just ran into this problem and would wish for rsyslog to handle this more gracefully, i.e. skip the corrupt data or at least produce a more instructional error message. |
pls create a new issue - I do not think this one still applies. Specify the rsyslog version. |
A new issue is opened here: #5319 |
Expected behavior
rsyslogd
uses a queue for messages cached during offline operation. it is normally playing them back. queue files which are corrupted should be handled well, attempting to recover as much of the log's content as possible.Actual behavior
Occasionally, rsyslog creates queue files which look slightly wrong (see separate report), and which lead to segmentation faults of rsyslog. This is reproduced both in 8.33 on armv7 as well as in 8.34 (official x86 rsyslog docker container), see below.
GDB output (rsyslog 8.33 on armv7):
Steps to reproduce the behavior
Reproduce in official rsyslog docker container (rsyslog 8.34 -- container ):
See relevant configuration attached, etc/rsyslogd.conf and var/spool/rsyslog/uniqName.{00000003,qi}:
rsyslog_segfault_20180417.tar.gz
The SIGSEGV most likely is caused by a corrupted spool message entry (cf. uniqName.00000003 file, the last object's attributes
tRcvdAt
etc. are duplicated.)Environment
The text was updated successfully, but these errors were encountered: