Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segmentation fault reading certain queue files #2654

Closed
sgiessl opened this issue Apr 18, 2018 · 7 comments
Closed

segmentation fault reading certain queue files #2654

sgiessl opened this issue Apr 18, 2018 · 7 comments

Comments

@sgiessl
Copy link

sgiessl commented Apr 18, 2018

Expected behavior

rsyslogd uses a queue for messages cached during offline operation. it is normally playing them back. queue files which are corrupted should be handled well, attempting to recover as much of the log's content as possible.

Actual behavior

Occasionally, rsyslog creates queue files which look slightly wrong (see separate report), and which lead to segmentation faults of rsyslog. This is reproduced both in 8.33 on armv7 as well as in 8.34 (official x86 rsyslog docker container), see below.

GDB output (rsyslog 8.33 on armv7):

Starting program: /usr/sbin/rsyslogd -n -iNONE
[...]
Thread 2 "rs:action 1 que" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 23372]
0x0004e578 in rsCStrDestruct ()
(gdb) bt
#0  0x0004e578 in rsCStrDestruct ()
#1  0x0005f2c0 in varDestruct ()
#2  0x000499f0 in MsgDeserialize ()
#3  0x000563dc in objDeserializeWithMethods ()
#4  0x00062250 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Steps to reproduce the behavior

Reproduce in official rsyslog docker container (rsyslog 8.34 -- container ):

sandro@pc-444:~/DatenNoBackup/rsyslog_crash$ docker run --rm -ti -v $PWD:/data --entrypoint /bin/sh rsyslog/syslog_appliance_alpine
/home/appliance # cd /data
/data # cp 
Dockerfile     config.tar.gz  etc/           run.sh         var/
/data # cp var/spool/rsyslog/ -r /var/spool/
/data # rsyslogd  -n -f etc/rsyslog.conf -i/tmp/rsyslogpid
Segmentation fault (core dumped)

See relevant configuration attached, etc/rsyslogd.conf and var/spool/rsyslog/uniqName.{00000003,qi}:
rsyslog_segfault_20180417.tar.gz

The SIGSEGV most likely is caused by a corrupted spool message entry (cf. uniqName.00000003 file, the last object's attributes tRcvdAt etc. are duplicated.)

Environment

  • rsyslog version: 8.34.0 / 8.33.0
  • platform: x86 (docker) / armv7l GNU/Linux (custom embedded linux)
  • for configuration questions/issues, include rsyslog.conf:
@davidelang
Copy link
Contributor

davidelang commented Apr 18, 2018 via email

@sgiessl
Copy link
Author

sgiessl commented Apr 18, 2018

How is this tool called? You are probably referring to tools/recover_qi.pl - I will try it out on the faulty queue files that I have collected. Ideally, there would be an automated way of detecting that a queue recovery is required, or even an option that would direct rsyslogd to do unattended and optimistic recovery.

I agree about that it should not segfault - rsyslog would ideally just skip the invalid messages and try to recover (or exit cleanly). I managed to create a more detailed backtrace on my computer. It seems like it's failing when attempting to parse an psyTAG attribute, but I wasn't yet able to dig further.

Thread 2 "rs:action 1 que" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff5d48700 (LWP 16187)]
rsCStrDestruct (ppThis=ppThis@entry=0x7ffff0002638) at stringbuf.c:220
220             free((*ppThis)->pBuf);
(gdb) bt
#0  rsCStrDestruct (ppThis=ppThis@entry=0x7ffff0002638) at stringbuf.c:220
#1  0x00005555555b1460 in varDestruct (ppThis=0x7ffff5d47a40) at var.c:72
#2  0x0000555555597518 in MsgDeserialize (pMsg=0x7ffff00023a0, pStrm=pStrm@entry=0x55555582fbe0) at msg.c:1425
#3  0x00005555555a61bc in objDeserializeWithMethods (ppObj=0x7ffff5d47b80, pszTypeExpected=pszTypeExpected@entry=0x5555555d284e "msg", lenTypeExpected=lenTypeExpected@entry=3, 
    pStrm=0x55555582fbe0, fFixup=fFixup@entry=0x0, pUsr=pUsr@entry=0x0, objConstruct=0x555555594610 <msgConstructForDeserializer>, objConstructFinalize=0x0, 
    objDeserialize=0x555555597460 <MsgDeserialize>) at obj.c:930
#4  0x00005555555b7dea in qDeqDisk (pThis=0x5555558287b0, ppMsg=<optimized out>) at queue.c:992
#5  0x00005555555b9127 in qqueueDeq (ppMsg=0x7ffff5d47b80, pThis=0x5555558287b0) at queue.c:1125
#6  DequeueConsumableElements (pSkippedMsgs=0x7ffff5d47bd4, piRemainingQueueSize=<synthetic pointer>, pWti=0x55555582c570, pThis=0x5555558287b0) at queue.c:1743
#7  DequeueConsumable (pSkippedMsgs=0x7ffff5d47bd4, pWti=0x55555582c570, pThis=0x5555558287b0) at queue.c:1797
#8  DequeueForConsumer (pThis=pThis@entry=0x5555558287b0, pWti=pWti@entry=0x55555582c570, pSkippedMsgs=pSkippedMsgs@entry=0x7ffff5d47bd4) at queue.c:1943
#9  0x00005555555b97d7 in ConsumerReg (pThis=0x5555558287b0, pWti=0x55555582c570) at queue.c:1999
#10 0x00005555555b4b52 in wtiWorker (pThis=pThis@entry=0x55555582c570) at wti.c:369
#11 0x00005555555b2585 in wtpWorker (arg=0x55555582c570) at wtp.c:415
#12 0x00007ffff79a07fc in start_thread (arg=0x7ffff5d48700) at pthread_create.c:465
#13 0x00007ffff6aa8b5f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@davidelang
Copy link
Contributor

davidelang commented Apr 18, 2018 via email

@thinkst-marco
Copy link
Contributor

Coming back to this old ticket, we've seen a similar case in which a corrupted disk queue's data causes rsyslog to segfault on startup. This is unexpected.

Trying to rebuild the queue index file made no difference, the recover tool runs successfully but rsyslogd still segfaults due to corrupted queue data.

A minimal test case is the following:

  1. rsyslogd.conf contains a queue definition like this:
main_queue(
  queue.filename="main_queue"
  queue.type="linkedlist"
)
  1. /var/spool/rsyslog contains a file called main_queue.00000603 with just this content:
<Obj:1:msg:1:
+iProtocolVersion:2:1:0:
+iSeverity:2:1:6:
+iFacility:2:2:17:
+msgFlags:2:2:36:
+ttGenTime:2:10:1605245965:
+tRcvdAt:3:35:2:2020:11:13:5:39:25:518552:6:+:0:0:
+tTIMESTAMP:3:35:2:2020:11:13:5:39:25:518552:6:+:0:0:
+pszTAG:1:24:PPPPPPPPPPPPPPPP[16812<Obj:1:msg:1:
+iProtocolVersion:2:1:0:
+iSeverity:2:1:6:
+iFacility:2:2:17:
+msgFlags:2:2:36:
+ttGenTime:2:10:1605247094:
+tRcvdAt:3:35:2:2020:11:13:5:58:14:624070:6:+:0:0:
+tTIMESTAMP:3:35:2:2020:11:13:5:58:14:624070:6:+:0:0:
+pszTAG:1:24:PPPPPPPPPPPPPPPP[18268]::
+pszRawMsg:1:504:<142>Nov 13 05:58:14 PPPPPPPPPPPPPPPP[18268]: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:
+pszInputName:1:8:imuxsock:
+pszRcvFrom:1:12:ZZZZZZZZ-com:
+pszRcvFromIP:1:9:127.0.0.1:
+offMSG:2:2:45:
>End
.

From visual inspection it's apparent that on line 9 of the file, there's a corrupted record where <Obj:1:msg:1: starts a new record before the previous record had been completed. This particular corruption was due to an unrelated disk space issue.

  1. The receover tool is run to create the queue index:
$ recover_qi.pl -w /var/spool/rsyslog -f main_queue -d 8 > /var/spool/rsyslog/main_queue.q
  1. Start Rsyslog with rsyslogd -dn, and it produces this output:
$ rsyslogd -dn
[ debugging output ]
0857.660115834:main Q[DA]:Reg/w0: obj.c: error property name: 'pszTAG'
0857.660159423:main Q[DA]:Reg/w0: obj.c: error var type: '1'
0857.660201556:main Q[DA]:Reg/w0: obj.c: error len: '24'
Segmentation fault (core dumped)

Skipping corrupted queue data or files is preferable to failing to start rsyslogd completely in our case, and others may share a similar need.

Something which slowed down identifying the problem is that Rsyslog's error messages in general are useful (e.g. if a file permission issue exists), but in this case there's simply a segfault without any error message. One has to enable debugging output to get to the problem. An error message identifying the likely cause (corrupted queue data) and a filename would be useful in normal operation.

@OneToughMonkey
Copy link

I also just ran into this problem and would wish for rsyslog to handle this more gracefully, i.e. skip the corrupt data or at least produce a more instructional error message.

@rgerhards
Copy link
Member

pls create a new issue - I do not think this one still applies. Specify the rsyslog version.

@tblume
Copy link
Contributor

tblume commented Mar 1, 2024

A new issue is opened here: #5319

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants