Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in 4.1.5 with "Assert error in SES_RxReInit(), cache/cache_session.c line 179" #2219

Closed
akuktin opened this issue Feb 17, 2017 · 6 comments
Closed
Assignees
Labels

Comments

@akuktin
Copy link

@akuktin akuktin commented Feb 17, 2017

We just had two crashes with varnish-4.1.5 installed from packages from the repo at http://repo.varnish-cache.org/ubuntu . Syslog reports:

Feb 17 15:00:19 lbs01-pa varnishd[26949]: Child (26953) Last panic at: Fri, 17 Feb 2017 14:00:19 GMT
"Assert error in SES_RxReInit(), cache/cache_session.c line 179:
  Condition(l <= htc->ws->r - htc->rxbuf_b) not true.
thread = (cache-worker)
version = varnish-4.1.5 revision 2c82b1c
ident = Linux,3.13.0-63-generic,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll
now = 275547.556234 (mono), 1487340017.329933 (real)
Backtrace:
  0x433f42: varnishd() [0x433f42]
  0x43adca: varnishd(SES_RxReInit+0xba) [0x43adca]
  0x45dc74: varnishd(VPX_Proto_Sess+0x1c4) [0x45dc74]
  0x44a419: varnishd() [0x44a419]
  0x44a88b: varnishd() [0x44a88b]
  0x7fe514af6184: libpthread.so.0(+0x8184) [0x7fe514af6184]
  0x7fe51482337d: libc.so.6(clone+0x6d) [0x7fe51482337d]

"
Feb 17 15:20:21 lbs01-pa varnishd[26949]: Child (29560) Last panic at: Fri, 17 Feb 2017 14:20:21 GMT
"Assert error in SES_RxReInit(), cache/cache_session.c line 179:
  Condition(l <= htc->ws->r - htc->rxbuf_b) not true.
thread = (cache-worker)
version = varnish-4.1.5 revision 2c82b1c
ident = Linux,3.13.0-63-generic,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll
now = 276660.009976 (mono), 1487341129.783674 (real)
Backtrace:
  0x433f42: varnishd() [0x433f42]
  0x43adca: varnishd(SES_RxReInit+0xba) [0x43adca]
  0x45dc74: varnishd(VPX_Proto_Sess+0x1c4) [0x45dc74]
  0x44a419: varnishd() [0x44a419]
  0x44a88b: varnishd() [0x44a88b]
  0x7fe514af6184: libpthread.so.0(+0x8184) [0x7fe514af6184]
  0x7fe51482337d: libc.so.6(clone+0x6d) [0x7fe51482337d]

"

This is a new installation, running on ubuntu-14.04 . The crashes may or may not be related to bug #1834 . The certainly look exactly the same and happened in the same function that was patched between 4.1.4 and 4.1.5.

We don't have any concrete steps to reproduce the bug. The crashes: both these two and the ones we had with #1834, happened both during peak hours and off-peak. One thing we have noticed is that this time one of our varnish backends was non-administratively down for its own unrelated reasons. Maybe the bug only gets triggered when there is a down backend?

We are available for additional queries and can perform interesting tasks to track the bug down, if need be.

@akuktin
Copy link
Author

@akuktin akuktin commented Feb 17, 2017

In addition, we also realized that we, for unrelated reasons, had an IP conflict in the down backend when the two crashes happened. So the backend was down, then up with a conflict and varnish crashed twice during that period.

@akuktin
Copy link
Author

@akuktin akuktin commented Feb 27, 2017

We have since put our backends on intensive care, rebooting the virtualization hosts every few days and it's been almost a week without varnish crashes. This really isn't a tenable solution. If varnish crashes together with a single bad backend, it really isn't much of a HA load-balancer.

In the last crash, we managed to recover a probably intact vsm shared memory file. Unfortunately, it seems it will be less than useful because varnish only logs AFTER it is done handling the request, either client-side or backend-side. We were hoping the vsm would tell us what kinds of requests were being handled about the time of a crash (was it a GET or POST?).

We're currently going through tests where we try various combinations of failing backends in a non-production environment. We're assuming a GET is enough to trigger the bug. The current undestanding is that a simple down backend is not a problem, but the act of backend GOING down, with all that entails, could be what is causing the problem. Could there be a race condition in a particularly sensitive part of varnish?

@Dridi
Copy link
Member

@Dridi Dridi commented Feb 27, 2017

Since @mbgrydeland is absent I will at least try to share a couple cents on this issue. I took part of the diagnostic for #1834 that introduced the assertion you are triggering. The short story is that 1) Varnish assumes that backends are trustworthy, and 2) you probably simply need to increase some [1] workspace to work around the bug.

The weird thing is that you don't get anything in the panic message. That is where we expect to see what kind of request triggered the transaction. Regarding logs, they are indeed committed when a transaction ends. Maybe having a failsafe commit of the logs during a panic could be done, I donn't know. The lack of information in the panic message could indicate a problem on the client side which isn't boding well.

The test case for #1834 fails on the 4.1 branch, but with a different error than yours so we'll need more eyeballing on this area.

[1] It looks like you should increase the workspace_client

@Dridi
Copy link
Member

@Dridi Dridi commented Feb 28, 2017

I had a quick look at the test case and it appears that too many moving parts make it hard to back-port to 4.1, so that's probably why @mbgrydeland hasn't.

@fgsch
Copy link
Member

@fgsch fgsch commented Apr 9, 2017

PING @mbgrydeland. I can reproduce this.

@stbc
Copy link

@stbc stbc commented Apr 11, 2017

Last night we got exactly the same:
Child (3286) Last panic at: Tue, 11 Apr 2017 00:15:34 GMT#012"Assert error in SES_RxReInit(), cache/cache_session.c line 179:#012 Condition(l <= htc->ws->r - htc->rxbuf_b) not true.#012thread = (cache-worker)#012version = varnish-4.1.5 revision 2c82b1c#012ident = Linux,3.16.0-4 -amd64,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll#012now = 813504.842682 (mono), 1491869732.832655 (real)#012Backtrace:#012 0x434391: varnishd() [0x434391 ]#012 0x43b2ca: varnishd(SES_RxReInit+0xba) [0x43b2ca]#012 0x45e92c: varnishd(VPX_Proto_Sess+0x1cc) [0x45e92c]#012 0x44acc2: varnishd() [0x44acc2]#012 0x44b 14b: varnishd() [0x44b14b]#012 0x7f6f99250064: libpthread.so.0(+0x8064) [0x7f6f99250064]#012 0x7f6f98f8562d: libc.so.6(clone+0x6d) [0x7f6f98f8562d]#012#012"

mbgrydeland added a commit that referenced this issue Jun 15, 2017
Original test case by Federico G. Schwindt

Fixes: #2219

Conflicts:
	bin/varnishd/proxy/cache_proxy_proto.c
ibreger added a commit to Refinitiv/varnish-cache that referenced this issue Jun 27, 2017
Original test case by Federico G. Schwindt

Fixes: varnishcache#2219

Conflicts:
	bin/varnishd/proxy/cache_proxy_proto.c
wmfgerrit pushed a commit to wikimedia/operations-debs-varnish4 that referenced this issue Jun 29, 2017
Package varnish 4.1.7, add counters for transient storage.

Introduce a new counter for shortlived objects creation,
cache_shortlived, and another one for uncacheable responses,
cache_uncacheable. They should provide insights when it comes to
monitoring transient storage usage.

Changes in 4.1.7:

 - Correctly honor nuke_limit parameter
   varnishcache/varnish-cache#1764
 - Prevent storage backends name collisions
   varnishcache/varnish-cache#2321
 - varnishstat -1 -f field inclusion glob doesn't allow VBE backend fields
   varnishcache/varnish-cache#2022
 - Health probes fail when HTTP response does not contain reason phrase
   varnishcache/varnish-cache#2069
 - "varnishstat -f MAIN.sess_conn -1" produces empty output
   varnishcache/varnish-cache#2118
 - Remember to reset workspace
   varnishcache/varnish-cache#2219
 - Rework and fix varnishstat counter filtering
   varnishcache/varnish-cache#2320
 - Docfix: Only root can jail
   varnishcache/varnish-cache#2329
 - Don't panic on a null ban
 - Add extra locking to protect the pools list and refcounts
 - Add -vsl_catchup to varnishtest
 - Add record-prefix support to varnishncsa

Bug: T164768
Ref: https://github.com/varnishcache/varnish-cache/blob/4.1/doc/changes.rst#varnish-cache-417-2017-06-28
Change-Id: I8a8f3a8103feb83b1a55a6788ea6c5d12963b4f5
mbgrydeland added a commit to mbgrydeland/varnish-cache that referenced this issue Jun 7, 2018
handling

This partly reverses the fix for varnishcache#2285, as that fix was missdiagnosed. The
original fix for varnishcache#2285 also introduced a bug similar to varnishcache#2219 where the
proxy code would not find space to insert a NUL character.

The real problem in varnishcache#2285 (and the duplicate varnishcache#2624) appears during H/2
prior knowledge upgrade, and the difference in maxbytes argument to
HTC_RxStuff between the H/1 and H/2 code paths. If the pipelined bytes
exceeded the maxbytes input on the second run we would run into an assert.

(varnishcache#2624 has received a workaround in the H/2 code that perhaps should be
reverted).

Changes in this patch:

HTC_RxStuff will now reserve a space to insert a NUL character for those
completion functions that needs that (HTTP/1 and Proxy_V1).

HTC_RxStuff will no longer assert on seeing pipelined data exceeding the
maxbytes input.
mbgrydeland added a commit to mbgrydeland/varnish-cache that referenced this issue Jun 7, 2018
handling

This partly reverses the fix for varnishcache#2285, as that fix was missdiagnosed. The
original fix for varnishcache#2285 also introduced a bug similar to varnishcache#2219 where the
proxy code would not find space to insert a NUL character.

The real problem in varnishcache#2285 (and the duplicate varnishcache#2624) appears during H/2
prior knowledge upgrade, and the difference in maxbytes argument to
HTC_RxStuff between the H/1 and H/2 code paths. If the pipelined bytes
exceeded the maxbytes input on the second run we would run into an assert.

(varnishcache#2624 has received a workaround in the H/2 code that perhaps should be
reverted).

Changes in this patch:

HTC_RxStuff will now reserve a space to insert a NUL character for those
completion functions that needs that (HTTP/1 and Proxy_V1).

HTC_RxStuff will no longer assert on seeing pipelined data exceeding the
maxbytes input.
dmatetelki added a commit to dmatetelki/varnish-cache that referenced this issue Mar 14, 2019
Original test case by Federico G. Schwindt

Fixes: varnishcache#2219
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants
You can’t perform that action at this time.