New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

varnishncsa stops consuming logs after Assert error in vslc_vsm_next() #1873

Closed
jtru opened this Issue Mar 8, 2016 · 5 comments

Comments

Projects
None yet
5 participants
@jtru

jtru commented Mar 8, 2016

I originally files this in the trac system over at varnish-cache.org as Ticket no. 1866, because I did not realize you had migrated the issue tracking to GH. Sorry :) My original report follows.

Last night, our logging pipe showed no web traffic data since our varnishncsa process had suddenly stopped to consume the varnishd shared memory log. The only output documenting the failure we got was captured on stderr and read:

Mar 07 11:03:59 web-front-01 varnishncsa-syslog-wrapper[889]: Assert error in vslc_vsm_next(), vsl_cursor.c line 130:
Mar 07 11:03:59 web-front-01 varnishncsa-syslog-wrapper[889]: Condition((t) != 0) not true.```

Cursiously, that did not cause the process to terminate - it just stopped relaying log data. We're using varnichncsa to relay access logs to other hosts via syslog for analysis. For that purpose, we have a systemd service that wraps the varnishncsa binary like so:

#!/bin/bash
export LC_ALL=C

FORMAT='%{VCL_Log:vxid}x %{X-BE}o %{Host}i %h %l %u %t \"%r\" %s %b \"%{Referer}i\" \"%{User-agent}i\" TIME=%D %{Varnish:time_firstbyte}x %{Varnish:handling}x XFP=%{X-Forwarded-Proto}i CT=%{Content-Type}o XFF=%{X-Forwarded-For}i'

TAG="varnishncsa"
PRIORITY="local0.info"

/usr/bin/varnishncsa -g request -F "${FORMAT}" | /usr/local/bin/logger.pl -t "${TAG}" -p "${PRIORITY}"

(The pipe's consumer, logger.pl, is a tiny Perl script that works around a peculiarity of systemd, which would make all logs show up in the system's journal, as well as relaying elsewhere via syslog.)

The box we have this running on receives 1500reqs/s, and is equipped with ECC RDIMM (with EDAC not reporting any errors). The hardware was tested for reliability extensively before having been put into service, so I think we can safely rule out memory problems. We're running Debian Jessie with the latest backports Kernel (4.3+70bpo8+1, 4.3.0-0.bpo.1-amd64).

We were still running Varnish 4.1.1 when the issue appeared, and upgraded to 4.1.2 today. I'll update this ticket if the problem resurfaces (and also if everything stays calm for a significant amount of time).

@mbgrydeland

This comment has been minimized.

Show comment
Hide comment
@mbgrydeland

mbgrydeland Mar 14, 2016

Contributor

This looks like the logger got stuck on IO writing the log to disk. Could you try to increase the Varnish log buffer size (the -l argument to varnishd) and see if that improves the situation?

Contributor

mbgrydeland commented Mar 14, 2016

This looks like the logger got stuck on IO writing the log to disk. Could you try to increase the Varnish log buffer size (the -l argument to varnishd) and see if that improves the situation?

@jtru

This comment has been minimized.

Show comment
Hide comment
@jtru

jtru Mar 31, 2016

The logger.pl we pipe into actually merely forwards the collected data to a syslog receiver (we use a Perl implementation of our own because /usr/bin/logger has a nasty habit of silently cutting off messages if they exceed a certain number of bytes) via UDP, and doesn't write to something that could block.

We increased the VSL buffer to 120M instead of the default 80M, and haven't seen the issue appear since.

jtru commented Mar 31, 2016

The logger.pl we pipe into actually merely forwards the collected data to a syslog receiver (we use a Perl implementation of our own because /usr/bin/logger has a nasty habit of silently cutting off messages if they exceed a certain number of bytes) via UDP, and doesn't write to something that could block.

We increased the VSL buffer to 120M instead of the default 80M, and haven't seen the issue appear since.

@nigoroll

This comment has been minimized.

Show comment
Hide comment
@nigoroll

nigoroll May 11, 2016

Contributor

@jtru update?

Contributor

nigoroll commented May 11, 2016

@jtru update?

@jtru

This comment has been minimized.

Show comment
Hide comment
@jtru

jtru May 11, 2016

@nigoroll No further occurrences of this problem so far.

jtru commented May 11, 2016

@nigoroll No further occurrences of this problem so far.

@nigoroll nigoroll closed this in f91c6f9 May 11, 2016

@nigoroll

This comment has been minimized.

Show comment
Hide comment
@nigoroll

nigoroll May 11, 2016

Contributor

The most likely scenario here is that we have a vsm overrun and failed to handle it properly. Increasing the vsm size is the correct way to address this, but still the vsm reader should handle the overrun gracefully - which it should now with the fix I just pushed.

Contributor

nigoroll commented May 11, 2016

The most likely scenario here is that we have a vsm overrun and failed to handle it properly. Increasing the vsm size is the correct way to address this, but still the vsm reader should handle the overrun gracefully - which it should now with the fix I just pushed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment