New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in 4.1.5 with "Assert error in SES_RxReInit(), cache/cache_session.c line 179" #2219

Closed
akuktin opened this Issue Feb 17, 2017 · 6 comments

Comments

Projects
None yet
7 participants
@akuktin

akuktin commented Feb 17, 2017

We just had two crashes with varnish-4.1.5 installed from packages from the repo at http://repo.varnish-cache.org/ubuntu . Syslog reports:

Feb 17 15:00:19 lbs01-pa varnishd[26949]: Child (26953) Last panic at: Fri, 17 Feb 2017 14:00:19 GMT
"Assert error in SES_RxReInit(), cache/cache_session.c line 179:
  Condition(l <= htc->ws->r - htc->rxbuf_b) not true.
thread = (cache-worker)
version = varnish-4.1.5 revision 2c82b1c
ident = Linux,3.13.0-63-generic,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll
now = 275547.556234 (mono), 1487340017.329933 (real)
Backtrace:
  0x433f42: varnishd() [0x433f42]
  0x43adca: varnishd(SES_RxReInit+0xba) [0x43adca]
  0x45dc74: varnishd(VPX_Proto_Sess+0x1c4) [0x45dc74]
  0x44a419: varnishd() [0x44a419]
  0x44a88b: varnishd() [0x44a88b]
  0x7fe514af6184: libpthread.so.0(+0x8184) [0x7fe514af6184]
  0x7fe51482337d: libc.so.6(clone+0x6d) [0x7fe51482337d]

"
Feb 17 15:20:21 lbs01-pa varnishd[26949]: Child (29560) Last panic at: Fri, 17 Feb 2017 14:20:21 GMT
"Assert error in SES_RxReInit(), cache/cache_session.c line 179:
  Condition(l <= htc->ws->r - htc->rxbuf_b) not true.
thread = (cache-worker)
version = varnish-4.1.5 revision 2c82b1c
ident = Linux,3.13.0-63-generic,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll
now = 276660.009976 (mono), 1487341129.783674 (real)
Backtrace:
  0x433f42: varnishd() [0x433f42]
  0x43adca: varnishd(SES_RxReInit+0xba) [0x43adca]
  0x45dc74: varnishd(VPX_Proto_Sess+0x1c4) [0x45dc74]
  0x44a419: varnishd() [0x44a419]
  0x44a88b: varnishd() [0x44a88b]
  0x7fe514af6184: libpthread.so.0(+0x8184) [0x7fe514af6184]
  0x7fe51482337d: libc.so.6(clone+0x6d) [0x7fe51482337d]

"

This is a new installation, running on ubuntu-14.04 . The crashes may or may not be related to bug #1834 . The certainly look exactly the same and happened in the same function that was patched between 4.1.4 and 4.1.5.

We don't have any concrete steps to reproduce the bug. The crashes: both these two and the ones we had with #1834, happened both during peak hours and off-peak. One thing we have noticed is that this time one of our varnish backends was non-administratively down for its own unrelated reasons. Maybe the bug only gets triggered when there is a down backend?

We are available for additional queries and can perform interesting tasks to track the bug down, if need be.

@akuktin

This comment has been minimized.

Show comment
Hide comment
@akuktin

akuktin Feb 17, 2017

In addition, we also realized that we, for unrelated reasons, had an IP conflict in the down backend when the two crashes happened. So the backend was down, then up with a conflict and varnish crashed twice during that period.

akuktin commented Feb 17, 2017

In addition, we also realized that we, for unrelated reasons, had an IP conflict in the down backend when the two crashes happened. So the backend was down, then up with a conflict and varnish crashed twice during that period.

@akuktin

This comment has been minimized.

Show comment
Hide comment
@akuktin

akuktin Feb 27, 2017

We have since put our backends on intensive care, rebooting the virtualization hosts every few days and it's been almost a week without varnish crashes. This really isn't a tenable solution. If varnish crashes together with a single bad backend, it really isn't much of a HA load-balancer.

In the last crash, we managed to recover a probably intact vsm shared memory file. Unfortunately, it seems it will be less than useful because varnish only logs AFTER it is done handling the request, either client-side or backend-side. We were hoping the vsm would tell us what kinds of requests were being handled about the time of a crash (was it a GET or POST?).

We're currently going through tests where we try various combinations of failing backends in a non-production environment. We're assuming a GET is enough to trigger the bug. The current undestanding is that a simple down backend is not a problem, but the act of backend GOING down, with all that entails, could be what is causing the problem. Could there be a race condition in a particularly sensitive part of varnish?

akuktin commented Feb 27, 2017

We have since put our backends on intensive care, rebooting the virtualization hosts every few days and it's been almost a week without varnish crashes. This really isn't a tenable solution. If varnish crashes together with a single bad backend, it really isn't much of a HA load-balancer.

In the last crash, we managed to recover a probably intact vsm shared memory file. Unfortunately, it seems it will be less than useful because varnish only logs AFTER it is done handling the request, either client-side or backend-side. We were hoping the vsm would tell us what kinds of requests were being handled about the time of a crash (was it a GET or POST?).

We're currently going through tests where we try various combinations of failing backends in a non-production environment. We're assuming a GET is enough to trigger the bug. The current undestanding is that a simple down backend is not a problem, but the act of backend GOING down, with all that entails, could be what is causing the problem. Could there be a race condition in a particularly sensitive part of varnish?

@Dridi

This comment has been minimized.

Show comment
Hide comment
@Dridi

Dridi Feb 27, 2017

Member

Since @mbgrydeland is absent I will at least try to share a couple cents on this issue. I took part of the diagnostic for #1834 that introduced the assertion you are triggering. The short story is that 1) Varnish assumes that backends are trustworthy, and 2) you probably simply need to increase some [1] workspace to work around the bug.

The weird thing is that you don't get anything in the panic message. That is where we expect to see what kind of request triggered the transaction. Regarding logs, they are indeed committed when a transaction ends. Maybe having a failsafe commit of the logs during a panic could be done, I donn't know. The lack of information in the panic message could indicate a problem on the client side which isn't boding well.

The test case for #1834 fails on the 4.1 branch, but with a different error than yours so we'll need more eyeballing on this area.

[1] It looks like you should increase the workspace_client

Member

Dridi commented Feb 27, 2017

Since @mbgrydeland is absent I will at least try to share a couple cents on this issue. I took part of the diagnostic for #1834 that introduced the assertion you are triggering. The short story is that 1) Varnish assumes that backends are trustworthy, and 2) you probably simply need to increase some [1] workspace to work around the bug.

The weird thing is that you don't get anything in the panic message. That is where we expect to see what kind of request triggered the transaction. Regarding logs, they are indeed committed when a transaction ends. Maybe having a failsafe commit of the logs during a panic could be done, I donn't know. The lack of information in the panic message could indicate a problem on the client side which isn't boding well.

The test case for #1834 fails on the 4.1 branch, but with a different error than yours so we'll need more eyeballing on this area.

[1] It looks like you should increase the workspace_client

@Dridi

This comment has been minimized.

Show comment
Hide comment
@Dridi

Dridi Feb 28, 2017

Member

I had a quick look at the test case and it appears that too many moving parts make it hard to back-port to 4.1, so that's probably why @mbgrydeland hasn't.

Member

Dridi commented Feb 28, 2017

I had a quick look at the test case and it appears that too many moving parts make it hard to back-port to 4.1, so that's probably why @mbgrydeland hasn't.

@fgsch

This comment has been minimized.

Show comment
Hide comment
@fgsch

fgsch Apr 9, 2017

Member

PING @mbgrydeland. I can reproduce this.

Member

fgsch commented Apr 9, 2017

PING @mbgrydeland. I can reproduce this.

@stbc

This comment has been minimized.

Show comment
Hide comment
@stbc

stbc Apr 11, 2017

Last night we got exactly the same:
Child (3286) Last panic at: Tue, 11 Apr 2017 00:15:34 GMT#012"Assert error in SES_RxReInit(), cache/cache_session.c line 179:#012 Condition(l <= htc->ws->r - htc->rxbuf_b) not true.#012thread = (cache-worker)#012version = varnish-4.1.5 revision 2c82b1c#012ident = Linux,3.16.0-4 -amd64,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll#012now = 813504.842682 (mono), 1491869732.832655 (real)#012Backtrace:#012 0x434391: varnishd() [0x434391 ]#012 0x43b2ca: varnishd(SES_RxReInit+0xba) [0x43b2ca]#012 0x45e92c: varnishd(VPX_Proto_Sess+0x1cc) [0x45e92c]#012 0x44acc2: varnishd() [0x44acc2]#012 0x44b 14b: varnishd() [0x44b14b]#012 0x7f6f99250064: libpthread.so.0(+0x8064) [0x7f6f99250064]#012 0x7f6f98f8562d: libc.so.6(clone+0x6d) [0x7f6f98f8562d]#012#012"

stbc commented Apr 11, 2017

Last night we got exactly the same:
Child (3286) Last panic at: Tue, 11 Apr 2017 00:15:34 GMT#012"Assert error in SES_RxReInit(), cache/cache_session.c line 179:#012 Condition(l <= htc->ws->r - htc->rxbuf_b) not true.#012thread = (cache-worker)#012version = varnish-4.1.5 revision 2c82b1c#012ident = Linux,3.16.0-4 -amd64,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll#012now = 813504.842682 (mono), 1491869732.832655 (real)#012Backtrace:#012 0x434391: varnishd() [0x434391 ]#012 0x43b2ca: varnishd(SES_RxReInit+0xba) [0x43b2ca]#012 0x45e92c: varnishd(VPX_Proto_Sess+0x1cc) [0x45e92c]#012 0x44acc2: varnishd() [0x44acc2]#012 0x44b 14b: varnishd() [0x44b14b]#012 0x7f6f99250064: libpthread.so.0(+0x8064) [0x7f6f99250064]#012 0x7f6f98f8562d: libc.so.6(clone+0x6d) [0x7f6f98f8562d]#012#012"

mbgrydeland added a commit that referenced this issue Jun 15, 2017

Remember to reset the workspace
Original test case by Federico G. Schwindt

Fixes: #2219

Conflicts:
	bin/varnishd/proxy/cache_proxy_proto.c

ibreger added a commit to thomsonreuters/varnish-cache that referenced this issue Jun 27, 2017

Remember to reset the workspace
Original test case by Federico G. Schwindt

Fixes: varnishcache#2219

Conflicts:
	bin/varnishd/proxy/cache_proxy_proto.c

wmfgerrit pushed a commit to wikimedia/operations-debs-varnish4 that referenced this issue Jun 29, 2017

4.1.7-1wm1: new upstream, new counters
Package varnish 4.1.7, add counters for transient storage.

Introduce a new counter for shortlived objects creation,
cache_shortlived, and another one for uncacheable responses,
cache_uncacheable. They should provide insights when it comes to
monitoring transient storage usage.

Changes in 4.1.7:

 - Correctly honor nuke_limit parameter
   varnishcache/varnish-cache#1764
 - Prevent storage backends name collisions
   varnishcache/varnish-cache#2321
 - varnishstat -1 -f field inclusion glob doesn't allow VBE backend fields
   varnishcache/varnish-cache#2022
 - Health probes fail when HTTP response does not contain reason phrase
   varnishcache/varnish-cache#2069
 - "varnishstat -f MAIN.sess_conn -1" produces empty output
   varnishcache/varnish-cache#2118
 - Remember to reset workspace
   varnishcache/varnish-cache#2219
 - Rework and fix varnishstat counter filtering
   varnishcache/varnish-cache#2320
 - Docfix: Only root can jail
   varnishcache/varnish-cache#2329
 - Don't panic on a null ban
 - Add extra locking to protect the pools list and refcounts
 - Add -vsl_catchup to varnishtest
 - Add record-prefix support to varnishncsa

Bug: T164768
Ref: https://github.com/varnishcache/varnish-cache/blob/4.1/doc/changes.rst#varnish-cache-417-2017-06-28
Change-Id: I8a8f3a8103feb83b1a55a6788ea6c5d12963b4f5

mbgrydeland added a commit to mbgrydeland/varnish-cache that referenced this issue Jun 7, 2018

Make HTC_RxStuff reserve a byte for NUL character again and fix maxbytes
handling

This partly reverses the fix for varnishcache#2285, as that fix was missdiagnosed. The
original fix for varnishcache#2285 also introduced a bug similar to varnishcache#2219 where the
proxy code would not find space to insert a NUL character.

The real problem in varnishcache#2285 (and the duplicate varnishcache#2624) appears during H/2
prior knowledge upgrade, and the difference in maxbytes argument to
HTC_RxStuff between the H/1 and H/2 code paths. If the pipelined bytes
exceeded the maxbytes input on the second run we would run into an assert.

(varnishcache#2624 has received a workaround in the H/2 code that perhaps should be
reverted).

Changes in this patch:

HTC_RxStuff will now reserve a space to insert a NUL character for those
completion functions that needs that (HTTP/1 and Proxy_V1).

HTC_RxStuff will no longer assert on seeing pipelined data exceeding the
maxbytes input.

mbgrydeland added a commit to mbgrydeland/varnish-cache that referenced this issue Jun 7, 2018

Make HTC_RxStuff reserve a byte for NUL character again and fix maxbytes
handling

This partly reverses the fix for varnishcache#2285, as that fix was missdiagnosed. The
original fix for varnishcache#2285 also introduced a bug similar to varnishcache#2219 where the
proxy code would not find space to insert a NUL character.

The real problem in varnishcache#2285 (and the duplicate varnishcache#2624) appears during H/2
prior knowledge upgrade, and the difference in maxbytes argument to
HTC_RxStuff between the H/1 and H/2 code paths. If the pipelined bytes
exceeded the maxbytes input on the second run we would run into an assert.

(varnishcache#2624 has received a workaround in the H/2 code that perhaps should be
reverted).

Changes in this patch:

HTC_RxStuff will now reserve a space to insert a NUL character for those
completion functions that needs that (HTTP/1 and Proxy_V1).

HTC_RxStuff will no longer assert on seeing pipelined data exceeding the
maxbytes input.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment