varnishncsa stops consuming logs after Assert error in vslc_vsm_next() #1873

Closed
jtru opened this Issue Mar 8, 2016 · 5 comments

Projects

None yet

5 participants

@jtru
jtru commented Mar 8, 2016

I originally files this in the trac system over at varnish-cache.org as Ticket no. 1866, because I did not realize you had migrated the issue tracking to GH. Sorry :) My original report follows.

Last night, our logging pipe showed no web traffic data since our varnishncsa process had suddenly stopped to consume the varnishd shared memory log. The only output documenting the failure we got was captured on stderr and read:

Mar 07 11:03:59 web-front-01 varnishncsa-syslog-wrapper[889]: Assert error in vslc_vsm_next(), vsl_cursor.c line 130:
Mar 07 11:03:59 web-front-01 varnishncsa-syslog-wrapper[889]: Condition((t) != 0) not true.```

Cursiously, that did not cause the process to terminate - it just stopped relaying log data. We're using varnichncsa to relay access logs to other hosts via syslog for analysis. For that purpose, we have a systemd service that wraps the varnishncsa binary like so:

#!/bin/bash
export LC_ALL=C

FORMAT='%{VCL_Log:vxid}x %{X-BE}o %{Host}i %h %l %u %t \"%r\" %s %b \"%{Referer}i\" \"%{User-agent}i\" TIME=%D %{Varnish:time_firstbyte}x %{Varnish:handling}x XFP=%{X-Forwarded-Proto}i CT=%{Content-Type}o XFF=%{X-Forwarded-For}i'

TAG="varnishncsa"
PRIORITY="local0.info"

/usr/bin/varnishncsa -g request -F "${FORMAT}" | /usr/local/bin/logger.pl -t "${TAG}" -p "${PRIORITY}"

(The pipe's consumer, logger.pl, is a tiny Perl script that works around a peculiarity of systemd, which would make all logs show up in the system's journal, as well as relaying elsewhere via syslog.)

The box we have this running on receives 1500reqs/s, and is equipped with ECC RDIMM (with EDAC not reporting any errors). The hardware was tested for reliability extensively before having been put into service, so I think we can safely rule out memory problems. We're running Debian Jessie with the latest backports Kernel (4.3+70bpo8+1, 4.3.0-0.bpo.1-amd64).

We were still running Varnish 4.1.1 when the issue appeared, and upgraded to 4.1.2 today. I'll update this ticket if the problem resurfaces (and also if everything stays calm for a significant amount of time).

@mbgrydeland
Contributor

This looks like the logger got stuck on IO writing the log to disk. Could you try to increase the Varnish log buffer size (the -l argument to varnishd) and see if that improves the situation?

@jtru
jtru commented Mar 31, 2016

The logger.pl we pipe into actually merely forwards the collected data to a syslog receiver (we use a Perl implementation of our own because /usr/bin/logger has a nasty habit of silently cutting off messages if they exceed a certain number of bytes) via UDP, and doesn't write to something that could block.

We increased the VSL buffer to 120M instead of the default 80M, and haven't seen the issue appear since.

@nigoroll
Contributor

@jtru update?

@jtru
jtru commented May 11, 2016

@nigoroll No further occurrences of this problem so far.

@nigoroll nigoroll closed this in f91c6f9 May 11, 2016
@nigoroll
Contributor
nigoroll commented May 11, 2016 edited

The most likely scenario here is that we have a vsm overrun and failed to handle it properly. Increasing the vsm size is the correct way to address this, but still the vsm reader should handle the overrun gracefully - which it should now with the fix I just pushed.

@lkarsten lkarsten pushed a commit that referenced this issue Jun 1, 2016
@nigoroll nigoroll + Lasse Karstensen check for overrun before looking at the next vsm record
Fixes #1873
5ddcf6a
@lkarsten lkarsten pushed a commit that referenced this issue Jun 1, 2016
Lasse Karstensen Add 1873 to changelog.
Ref: #1873
0cea407
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment