segmentation fault reading certain queue files #2654

sgiessl · 2018-04-18T13:55:57Z

Expected behavior

rsyslogd uses a queue for messages cached during offline operation. it is normally playing them back. queue files which are corrupted should be handled well, attempting to recover as much of the log's content as possible.

Actual behavior

Occasionally, rsyslog creates queue files which look slightly wrong (see separate report), and which lead to segmentation faults of rsyslog. This is reproduced both in 8.33 on armv7 as well as in 8.34 (official x86 rsyslog docker container), see below.

GDB output (rsyslog 8.33 on armv7):

Starting program: /usr/sbin/rsyslogd -n -iNONE
[...]
Thread 2 "rs:action 1 que" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 23372]
0x0004e578 in rsCStrDestruct ()
(gdb) bt
#0  0x0004e578 in rsCStrDestruct ()
#1  0x0005f2c0 in varDestruct ()
#2  0x000499f0 in MsgDeserialize ()
#3  0x000563dc in objDeserializeWithMethods ()
#4  0x00062250 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Steps to reproduce the behavior

Reproduce in official rsyslog docker container (rsyslog 8.34 -- container ):

sandro@pc-444:~/DatenNoBackup/rsyslog_crash$ docker run --rm -ti -v $PWD:/data --entrypoint /bin/sh rsyslog/syslog_appliance_alpine
/home/appliance # cd /data
/data # cp 
Dockerfile     config.tar.gz  etc/           run.sh         var/
/data # cp var/spool/rsyslog/ -r /var/spool/
/data # rsyslogd  -n -f etc/rsyslog.conf -i/tmp/rsyslogpid
Segmentation fault (core dumped)

See relevant configuration attached, etc/rsyslogd.conf and var/spool/rsyslog/uniqName.{00000003,qi}:
rsyslog_segfault_20180417.tar.gz

The SIGSEGV most likely is caused by a corrupted spool message entry (cf. uniqName.00000003 file, the last object's attributes tRcvdAt etc. are duplicated.)

Environment

rsyslog version: 8.34.0 / 8.33.0
platform: x86 (docker) / armv7l GNU/Linux (custom embedded linux)
for configuration questions/issues, include rsyslog.conf:

The text was updated successfully, but these errors were encountered:

davidelang · 2018-04-18T19:33:54Z

There is a tool for recreating the queue file indexes (.qi files) if they get corrupted, writing them in such a way that they cannot be corrupted would involve a lot of very expensive fsync() and fdatasync() calls, probably causing you to loose logs because you can't write them fast enough. now, it would be good for rsyslog to not segfault if there are problems with the queue or .qij files, so this ticket is still valid, but you need to understand the limits.

sgiessl · 2018-04-18T21:11:18Z

How is this tool called? You are probably referring to tools/recover_qi.pl - I will try it out on the faulty queue files that I have collected. Ideally, there would be an automated way of detecting that a queue recovery is required, or even an option that would direct rsyslogd to do unattended and optimistic recovery.

I agree about that it should not segfault - rsyslog would ideally just skip the invalid messages and try to recover (or exit cleanly). I managed to create a more detailed backtrace on my computer. It seems like it's failing when attempting to parse an psyTAG attribute, but I wasn't yet able to dig further.

Thread 2 "rs:action 1 que" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff5d48700 (LWP 16187)]
rsCStrDestruct (ppThis=ppThis@entry=0x7ffff0002638) at stringbuf.c:220
220             free((*ppThis)->pBuf);
(gdb) bt
#0  rsCStrDestruct (ppThis=ppThis@entry=0x7ffff0002638) at stringbuf.c:220
#1  0x00005555555b1460 in varDestruct (ppThis=0x7ffff5d47a40) at var.c:72
#2  0x0000555555597518 in MsgDeserialize (pMsg=0x7ffff00023a0, pStrm=pStrm@entry=0x55555582fbe0) at msg.c:1425
#3  0x00005555555a61bc in objDeserializeWithMethods (ppObj=0x7ffff5d47b80, pszTypeExpected=pszTypeExpected@entry=0x5555555d284e "msg", lenTypeExpected=lenTypeExpected@entry=3, 
    pStrm=0x55555582fbe0, fFixup=fFixup@entry=0x0, pUsr=pUsr@entry=0x0, objConstruct=0x555555594610 <msgConstructForDeserializer>, objConstructFinalize=0x0, 
    objDeserialize=0x555555597460 <MsgDeserialize>) at obj.c:930
#4  0x00005555555b7dea in qDeqDisk (pThis=0x5555558287b0, ppMsg=<optimized out>) at queue.c:992
#5  0x00005555555b9127 in qqueueDeq (ppMsg=0x7ffff5d47b80, pThis=0x5555558287b0) at queue.c:1125
#6  DequeueConsumableElements (pSkippedMsgs=0x7ffff5d47bd4, piRemainingQueueSize=<synthetic pointer>, pWti=0x55555582c570, pThis=0x5555558287b0) at queue.c:1743
#7  DequeueConsumable (pSkippedMsgs=0x7ffff5d47bd4, pWti=0x55555582c570, pThis=0x5555558287b0) at queue.c:1797
#8  DequeueForConsumer (pThis=pThis@entry=0x5555558287b0, pWti=pWti@entry=0x55555582c570, pSkippedMsgs=pSkippedMsgs@entry=0x7ffff5d47bd4) at queue.c:1943
#9  0x00005555555b97d7 in ConsumerReg (pThis=0x5555558287b0, pWti=0x55555582c570) at queue.c:1999
#10 0x00005555555b4b52 in wtiWorker (pThis=pThis@entry=0x55555582c570) at wti.c:369
#11 0x00005555555b2585 in wtpWorker (arg=0x55555582c570) at wtp.c:415
#12 0x00007ffff79a07fc in start_thread (arg=0x7ffff5d48700) at pthread_create.c:465
#13 0x00007ffff6aa8b5f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

davidelang · 2018-04-18T21:44:08Z

On Wed, 18 Apr 2018, Sandro Gießl wrote: How is this tool called? You are probably referring to `tools/recover_qi.pl` - I will try it out on the faulty queue files that I have collected. Ideally, there would be an automated way of detecting that a queue recovery is required, or even an option that would direct rsyslogd to do unattended and optimistic recovery.

That the tool. unfortunantly it can take a long time to run, so I don't think it's best to run it (or something like it) automatically. We possibly should try to auto-create a 'corrupt_queue_files_XXX' subdirectory and move the entire set of queue files into that directory for future recovery, meanwhile starting up and creating new files as needed. I seem to remember running into a case where the queue was corrupt, but not badly enough to cause a segfault, and having rsyslog ignore the queue. This is not as good as moving the queue files as it means that you can't write anything new to the queue either. I had a case where a destination with a queue was down for several weeks, and I had to move and compress the queue files a few times to keep things running in the meantime. I ended up creating a variation of my config that read from the queue in a different directory and sent them out.

thinkst-marco · 2021-01-25T10:58:58Z

Coming back to this old ticket, we've seen a similar case in which a corrupted disk queue's data causes rsyslog to segfault on startup. This is unexpected.

Trying to rebuild the queue index file made no difference, the recover tool runs successfully but rsyslogd still segfaults due to corrupted queue data.

A minimal test case is the following:

rsyslogd.conf contains a queue definition like this:

main_queue(
  queue.filename="main_queue"
  queue.type="linkedlist"
)

/var/spool/rsyslog contains a file called main_queue.00000603 with just this content:

<Obj:1:msg:1:
+iProtocolVersion:2:1:0:
+iSeverity:2:1:6:
+iFacility:2:2:17:
+msgFlags:2:2:36:
+ttGenTime:2:10:1605245965:
+tRcvdAt:3:35:2:2020:11:13:5:39:25:518552:6:+:0:0:
+tTIMESTAMP:3:35:2:2020:11:13:5:39:25:518552:6:+:0:0:
+pszTAG:1:24:PPPPPPPPPPPPPPPP[16812<Obj:1:msg:1:
+iProtocolVersion:2:1:0:
+iSeverity:2:1:6:
+iFacility:2:2:17:
+msgFlags:2:2:36:
+ttGenTime:2:10:1605247094:
+tRcvdAt:3:35:2:2020:11:13:5:58:14:624070:6:+:0:0:
+tTIMESTAMP:3:35:2:2020:11:13:5:58:14:624070:6:+:0:0:
+pszTAG:1:24:PPPPPPPPPPPPPPPP[18268]::
+pszRawMsg:1:504:<142>Nov 13 05:58:14 PPPPPPPPPPPPPPPP[18268]: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:
+pszInputName:1:8:imuxsock:
+pszRcvFrom:1:12:ZZZZZZZZ-com:
+pszRcvFromIP:1:9:127.0.0.1:
+offMSG:2:2:45:
>End
.

From visual inspection it's apparent that on line 9 of the file, there's a corrupted record where <Obj:1:msg:1: starts a new record before the previous record had been completed. This particular corruption was due to an unrelated disk space issue.

The receover tool is run to create the queue index:

$ recover_qi.pl -w /var/spool/rsyslog -f main_queue -d 8 > /var/spool/rsyslog/main_queue.q

Start Rsyslog with rsyslogd -dn, and it produces this output:

$ rsyslogd -dn
[ debugging output ]
0857.660115834:main Q[DA]:Reg/w0: obj.c: error property name: 'pszTAG'
0857.660159423:main Q[DA]:Reg/w0: obj.c: error var type: '1'
0857.660201556:main Q[DA]:Reg/w0: obj.c: error len: '24'
Segmentation fault (core dumped)

Skipping corrupted queue data or files is preferable to failing to start rsyslogd completely in our case, and others may share a similar need.

Something which slowed down identifying the problem is that Rsyslog's error messages in general are useful (e.g. if a file permission issue exists), but in this case there's simply a segfault without any error message. One has to enable debugging output to get to the problem. An error message identifying the likely cause (corrupted queue data) and a filename would be useful in normal operation.

OneToughMonkey · 2022-05-19T12:43:30Z

I also just ran into this problem and would wish for rsyslog to handle this more gracefully, i.e. skip the corrupt data or at least produce a more instructional error message.

rgerhards · 2022-05-19T12:48:39Z

pls create a new issue - I do not think this one still applies. Specify the rsyslog version.

tblume · 2024-03-01T13:28:58Z

A new issue is opened here: #5319

sgiessl mentioned this issue Apr 18, 2018

rsyslogd creates corrupted queue files #2655

Open

rgerhards closed this as completed May 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

segmentation fault reading certain queue files #2654

segmentation fault reading certain queue files #2654

sgiessl commented Apr 18, 2018 •

edited

davidelang commented Apr 18, 2018 via email

sgiessl commented Apr 18, 2018

davidelang commented Apr 18, 2018 via email

thinkst-marco commented Jan 25, 2021

OneToughMonkey commented May 19, 2022

rgerhards commented May 19, 2022

tblume commented Mar 1, 2024

segmentation fault reading certain queue files #2654

segmentation fault reading certain queue files #2654

Comments

sgiessl commented Apr 18, 2018 • edited

Expected behavior

Actual behavior

Steps to reproduce the behavior

Environment

davidelang commented Apr 18, 2018 via email

sgiessl commented Apr 18, 2018

davidelang commented Apr 18, 2018 via email

thinkst-marco commented Jan 25, 2021

OneToughMonkey commented May 19, 2022

rgerhards commented May 19, 2022

tblume commented Mar 1, 2024

sgiessl commented Apr 18, 2018 •

edited