WS_Assert(), cache/cache_ws.c line 59 #1834

Closed
bsdphk opened this Issue Mar 4, 2016 · 13 comments

Projects

None yet

9 participants

@bsdphk
Contributor
bsdphk commented Mar 4, 2016

Old ticket imported from Trac
See archived copy here: https://varnish-cache.org/trac/ticket/1834

@nigoroll
Contributor

@lkarsten close?

@lkarsten
Contributor
lkarsten commented May 27, 2016 edited

I don't think we've squashed this one. I asked the reporter to check if it still happens with 4.1HEAD, timeout in 10d if nothing heard.

@lkarsten lkarsten self-assigned this May 30, 2016
@porcospino
porcospino commented Jun 16, 2016 edited

Just had it happen on 4.1.2
Last panic at: Thu, 16 Jun 2016 12:02:44 GMT
"Assert error in WS_Assert(), cache/cache_ws.c line 59:
Condition(*ws->e == 0x15) not true.
thread = (cache-worker)
version = varnish-4.1.2 revision 0d7404e
ident = Linux,3.13.0-87-generic,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll
Backtrace:
0x433976: varnishd() [0x433976]
0x44a5ab: varnishd(WS_Assert+0x18b) [0x44a5ab]
0x44af71: varnishd(WS_ReleaseP+0x11) [0x44af71]
0x43a865: varnishd(SES_RxStuff+0x205) [0x43a865]
0x44ea78: varnishd(HTTP1_Session+0x178) [0x44ea78]
0x439d0d: varnishd(SES_Proto_Req+0x5d) [0x439d0d]
0x4493d1: varnishd() [0x4493d1]
0x44983b: varnishd() [0x44983b]
0x7f897fec9184: libpthread.so.0(+0x8184) [0x7f897fec9184]
0x7f897fbf637d: libc.so.6(clone+0x6d) [0x7f897fbf637d]
req = 0x7f89660ec020 {
vxid = 0, step = R_STP_RESTART,
req_body = R_BODY_INIT,
restarts = 0, esi_level = 0,
sp = 0x7f8957ad2a20 {
fd = 25, vxid = 29036,
client = 192.168.69.189 59741,
step = S_STP_H1NEWREQ,
},
ws = 0x7f89660ec200 {
id = "req",
{s,f,r,e} = {0x7f89660ee000,+32768,+57336,+57336},
},
http_conn = 0x7f89660ec128 {
fd = 25,
doclose = NULL,
ws = 0x7f89660ec200,
{rxbuf_b, rxbuf_e} = {0x7f89660f6000, 0x7f89660fdfe3},
{pipeline_b, pipeline_e} = {(nil), (nil)},
content_length = 0,
body_status = none,
first_byte_timeout = 0.000000,
between_bytes_timeout = 0.000000,
},
http[req] = 0x7f89660ec298 {
ws[] = (nil),
hdrs {
},
},
flags = {
},
},

@rnsanchez
Contributor

We do see this panic on our custom 4.1 branch (some internal functionalities on top of upstream/4.1). As suggested in the old ticket (Trac), we're checking whether 96 KB for both client and backend workspaces help.

Last panic at: Thu, 16 Jun 2016 06:00:27 GMT
"Assert error in WS_Assert(), cache/cache_ws.c line 59:
  Condition(*ws->e == 0x15) not true.
thread = (cache-worker)
version = varnish-4.1.3-dev revision 96c00a5
ident = Linux,4.4.11,x86_64,-junix,-sfile,-smalloc,-smalloc,-hcritbit,epoll
Backtrace:
  0x434502: pan_ic+0x152
  0x44b7c6: WS_Assert+0x176
  0x44c171: WS_ReleaseP+0x11
  0x43b56d: SES_RxStuff+0x1fd
  0x4500c2: HTTP1_Session+0x1b2
  0x43aa2d: SES_Proto_Req+0x5d
  0x44a62a: WRK_Thread+0x47a
  0x44aa7b: pool_thread+0x2b
  0x7f33fa0ca654: libpthread.so.0(+0x7654) [0x7f33fa0ca654]
  0x7f33f9e007dd: libc.so.6(clone+0x6d) [0x7f33f9e007dd]
req = 0x7c459fa47020 {
      vxid = 0, step = R_STP_RESTART,
      req_body = R_BODY_INIT,
      restarts = 0, esi_level = 0,
      sp = 0x7c45a844d620 {
        fd = 114, vxid = 67983,
        client = 186.250.8.9 11882,
        t_open = 1466056827.205797,
        t_idle = 1466056827.205797,
    step = S_STP_H1NEWREQ,
  },
  ws = 0x7c459fa47208 {
    id = \"req\",
    {s,f,r,e} = {0x7c459fa49008,+32768,+57328,+57328},
  },
  http_conn = 0x7c459fa47130 {
    fd = 114,
    doclose = NULL,
    ws = 0x7c459fa47208,
    {rxbuf_b, rxbuf_e} = {0x7c459fa51008, 0x7c459fa58feb},
    {pipeline_b, pipeline_e} = {(nil), (nil)},
    content_length = 0,
    body_status = none,
    first_byte_timeout = 0.000000,
    between_bytes_timeout = 0.000000,
  },
  http[req] = 0x7c459fa472a0 {
    ws[] = (nil),
    hdrs {
    },
  },
  flags = {
  },
},

" 
@lkarsten
Contributor

@rnsanchez and @porcospino - Can you please try the 4.1.3-beta1 version and see if that helps? See https://www.varnish-cache.org/releases/index.html#releases .

There isn't a specific fix for this in there, but it may be that one of the many other fixed bugs is provoking this.

@rnsanchez
Contributor

@lkarsten I need to review the commits up to 4.1.3-beta1 before rolling out to this server, but I will keep you posted.

@lkarsten
Contributor

@rnsanchez Is still current? Did you rebase your branch to 4.1.3? Do you still see it with the raised workspace sizes you describe in #2008 ?

@rnsanchez
Contributor

@lkarsten we're still using 96 KB, and our deployed branch is based on commit 1d73aa7. We have not seen anything specific to this ticket since new binaries were deployed (around July 9th).

Sorry for the delay.

@bsdphk
Contributor
bsdphk commented Sep 12, 2016

timing this out.

@bsdphk bsdphk closed this Sep 12, 2016
@hermunn
Contributor
hermunn commented Sep 13, 2016

backport review: Timed out, nothing to backport

@Dridi Dridi reopened this Nov 25, 2016
@Dridi
Member
Dridi commented Nov 25, 2016

We have a theory and a likely culprit for this problem. Basically, if planets^Wpointers align, session initialization [1,2] may overwrite the end-of-workspace marker with a NUL char in 4.1. This is also true in current master [3] and possibly even more unsafe for H2. This is basically because we disregard the result of the workspace reservation.

[1] https://github.com/varnishcache/varnish-cache/blob/9dc2ec5/bin/varnishd/cache/cache_session.c#L154-L157
[2] https://github.com/varnishcache/varnish-cache/blob/9dc2ec5/bin/varnishd/cache/cache_session.c#L174-L185
[3] https://github.com/varnishcache/varnish-cache/blob/3f0b634/bin/varnishd/cache/cache_session.c#L193-L204

@mbgrydeland
Contributor
mbgrydeland commented Nov 25, 2016 edited

FWIW, the attached test case triggers the assertion on my machine. I am not suggesting this test case for inclusion, as the sizes used is waaaay to fragile to be useful as a regression test. The size of the bogus allocation needs to be exactly large enough to have zero WS left at the right point in order to trigger the bug. I ended up on 208 bytes by trial and success ;-)
Edit: 192 is the magic number on my computer for Varnish 4.1 HEAD.
hunch-ws.vtc.txt

@mbgrydeland mbgrydeland added a commit to mbgrydeland/varnish-cache that referenced this issue Nov 29, 2016
@mbgrydeland mbgrydeland Fix buffer overflow in HTC_RxInit on workspace exhaustion
HTC_RxInit could write a single '\0' NUL character outside of the
workspace when its called and there is zero bytes left in the
workspace. This would trigger the workspace canary causing subsequent
assertion.

Fix by releaving HTC_RxInit of adding the '\0' character.

HTC_RxStuff now returns HTC_S_Overflow early if the available buffer
space is zero. The '\0' character is inserted just before calling the
completion check function.

Also fix an off-by-one error on the http_{req|resp}_size calculations,
where the maximum number of bytes accepted was one less than the
paramter indicated. c00039.vtc and c00040.vtc has been edited to
reflect that and to be more expressive about the sizes they generate.

Fixes: #1834
809f4d8
@mbgrydeland mbgrydeland added a commit to mbgrydeland/varnish-cache that referenced this issue Nov 29, 2016
@mbgrydeland mbgrydeland Fix buffer overflow in HTC_RxInit on workspace exhaustion
HTC_RxInit could write a single '\0' NUL character outside of the
workspace when its called and there is zero bytes left in the
workspace. This would trigger the workspace canary causing subsequent
assertion.

Fix by releaving HTC_RxInit of adding the '\0' character.

HTC_RxStuff now returns HTC_S_Overflow early if the available buffer
space is zero. The '\0' character is inserted just before calling the
completion check function.

Also fix an off-by-one error on the http_{req|resp}_size calculations,
where the maximum number of bytes accepted was one less than the
paramter indicated. c00039.vtc and c00040.vtc has been edited to
reflect that and to be more expressive about the sizes they generate.

Fixes: #1834
fdd0869
@mbgrydeland mbgrydeland added a commit that closed this issue Dec 5, 2016
@mbgrydeland mbgrydeland Fix buffer overflow in HTC_RxInit on workspace exhaustion
HTC_RxInit could write a single '\0' NUL character outside of the
workspace when its called and there is zero bytes left in the
workspace. This would trigger the workspace canary causing subsequent
assertion.

Fix by releaving HTC_RxInit of adding the '\0' character.

HTC_RxStuff now returns HTC_S_Overflow early if the available buffer
space is zero. The '\0' character is inserted just before calling the
completion check function.

Also fix an off-by-one error on the http_{req|resp}_size calculations,
where the maximum number of bytes accepted was one less than the
paramter indicated. c00039.vtc and c00040.vtc has been edited to
reflect that and to be more expressive about the sizes they generate.

Fixes: #1834
4bcc1c2
@mbgrydeland mbgrydeland added a commit that referenced this issue Dec 5, 2016
@mbgrydeland mbgrydeland Fix buffer overflow in HTC_RxInit on workspace exhaustion
HTC_RxInit and HTC_RxReInit could write a single '\0' NUL character
outside of the workspace when its called and there is zero bytes left
in the workspace. This would trigger the workspace canary causing
subsequent assertion.

Fix by releaving HTC_RxInit and HTC_RxReInit of adding the '\0'
character.

HTC_RxStuff and V1F_FetchRespHdr returns HTC_S_OVERFLOW if the
available buffer space is zero. Both make sure to insert the '\0'
character just before calling the completion check function.

Note that this fix does not change the fact that we have exchausted
the workspace and are unable to continue. Varnishd will panic
nonetheless, but at least we have not stepped out of our boundries.

Ref: #1834
c1fa324
@hermunn
Contributor
hermunn commented Jan 27, 2017

FYI: The final commit, c1fa324 is a backport to 4.1. Thanks again @mbgrydeland!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment