Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WS_Assert(), cache/cache_ws.c line 59 #1834

Closed
bsdphk opened this issue Mar 4, 2016 · 13 comments
Closed

WS_Assert(), cache/cache_ws.c line 59 #1834

bsdphk opened this issue Mar 4, 2016 · 13 comments

Comments

@bsdphk
Copy link
Contributor

bsdphk commented Mar 4, 2016

Old ticket imported from Trac
See archived copy here: https://varnish-cache.org/trac/ticket/1834

@nigoroll
Copy link
Member

@lkarsten close?

@lkarsten
Copy link
Contributor

lkarsten commented May 27, 2016

I don't think we've squashed this one. I asked the reporter to check if it still happens with 4.1HEAD, timeout in 10d if nothing heard.

@lkarsten lkarsten self-assigned this May 30, 2016
@porcospino
Copy link

porcospino commented Jun 16, 2016

Just had it happen on 4.1.2
Last panic at: Thu, 16 Jun 2016 12:02:44 GMT
"Assert error in WS_Assert(), cache/cache_ws.c line 59:
Condition(*ws->e == 0x15) not true.
thread = (cache-worker)
version = varnish-4.1.2 revision 0d7404e
ident = Linux,3.13.0-87-generic,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll
Backtrace:
0x433976: varnishd() [0x433976]
0x44a5ab: varnishd(WS_Assert+0x18b) [0x44a5ab]
0x44af71: varnishd(WS_ReleaseP+0x11) [0x44af71]
0x43a865: varnishd(SES_RxStuff+0x205) [0x43a865]
0x44ea78: varnishd(HTTP1_Session+0x178) [0x44ea78]
0x439d0d: varnishd(SES_Proto_Req+0x5d) [0x439d0d]
0x4493d1: varnishd() [0x4493d1]
0x44983b: varnishd() [0x44983b]
0x7f897fec9184: libpthread.so.0(+0x8184) [0x7f897fec9184]
0x7f897fbf637d: libc.so.6(clone+0x6d) [0x7f897fbf637d]
req = 0x7f89660ec020 {
vxid = 0, step = R_STP_RESTART,
req_body = R_BODY_INIT,
restarts = 0, esi_level = 0,
sp = 0x7f8957ad2a20 {
fd = 25, vxid = 29036,
client = 192.168.69.189 59741,
step = S_STP_H1NEWREQ,
},
ws = 0x7f89660ec200 {
id = "req",
{s,f,r,e} = {0x7f89660ee000,+32768,+57336,+57336},
},
http_conn = 0x7f89660ec128 {
fd = 25,
doclose = NULL,
ws = 0x7f89660ec200,
{rxbuf_b, rxbuf_e} = {0x7f89660f6000, 0x7f89660fdfe3},
{pipeline_b, pipeline_e} = {(nil), (nil)},
content_length = 0,
body_status = none,
first_byte_timeout = 0.000000,
between_bytes_timeout = 0.000000,
},
http[req] = 0x7f89660ec298 {
ws[] = (nil),
hdrs {
},
},
flags = {
},
},

@rnsanchez
Copy link
Contributor

We do see this panic on our custom 4.1 branch (some internal functionalities on top of upstream/4.1). As suggested in the old ticket (Trac), we're checking whether 96 KB for both client and backend workspaces help.

Last panic at: Thu, 16 Jun 2016 06:00:27 GMT
"Assert error in WS_Assert(), cache/cache_ws.c line 59:
  Condition(*ws->e == 0x15) not true.
thread = (cache-worker)
version = varnish-4.1.3-dev revision 96c00a5
ident = Linux,4.4.11,x86_64,-junix,-sfile,-smalloc,-smalloc,-hcritbit,epoll
Backtrace:
  0x434502: pan_ic+0x152
  0x44b7c6: WS_Assert+0x176
  0x44c171: WS_ReleaseP+0x11
  0x43b56d: SES_RxStuff+0x1fd
  0x4500c2: HTTP1_Session+0x1b2
  0x43aa2d: SES_Proto_Req+0x5d
  0x44a62a: WRK_Thread+0x47a
  0x44aa7b: pool_thread+0x2b
  0x7f33fa0ca654: libpthread.so.0(+0x7654) [0x7f33fa0ca654]
  0x7f33f9e007dd: libc.so.6(clone+0x6d) [0x7f33f9e007dd]
req = 0x7c459fa47020 {
      vxid = 0, step = R_STP_RESTART,
      req_body = R_BODY_INIT,
      restarts = 0, esi_level = 0,
      sp = 0x7c45a844d620 {
        fd = 114, vxid = 67983,
        client = 186.250.8.9 11882,
        t_open = 1466056827.205797,
        t_idle = 1466056827.205797,
    step = S_STP_H1NEWREQ,
  },
  ws = 0x7c459fa47208 {
    id = \"req\",
    {s,f,r,e} = {0x7c459fa49008,+32768,+57328,+57328},
  },
  http_conn = 0x7c459fa47130 {
    fd = 114,
    doclose = NULL,
    ws = 0x7c459fa47208,
    {rxbuf_b, rxbuf_e} = {0x7c459fa51008, 0x7c459fa58feb},
    {pipeline_b, pipeline_e} = {(nil), (nil)},
    content_length = 0,
    body_status = none,
    first_byte_timeout = 0.000000,
    between_bytes_timeout = 0.000000,
  },
  http[req] = 0x7c459fa472a0 {
    ws[] = (nil),
    hdrs {
    },
  },
  flags = {
  },
},

" 

@lkarsten
Copy link
Contributor

@rnsanchez and @porcospino - Can you please try the 4.1.3-beta1 version and see if that helps? See https://www.varnish-cache.org/releases/index.html#releases .

There isn't a specific fix for this in there, but it may be that one of the many other fixed bugs is provoking this.

@rnsanchez
Copy link
Contributor

@lkarsten I need to review the commits up to 4.1.3-beta1 before rolling out to this server, but I will keep you posted.

@lkarsten
Copy link
Contributor

@rnsanchez Is still current? Did you rebase your branch to 4.1.3? Do you still see it with the raised workspace sizes you describe in #2008 ?

@rnsanchez
Copy link
Contributor

@lkarsten we're still using 96 KB, and our deployed branch is based on commit 1d73aa7. We have not seen anything specific to this ticket since new binaries were deployed (around July 9th).

Sorry for the delay.

@bsdphk
Copy link
Contributor Author

bsdphk commented Sep 12, 2016

timing this out.

@hermunn
Copy link
Member

hermunn commented Sep 13, 2016

backport review: Timed out, nothing to backport

@Dridi
Copy link
Member

Dridi commented Nov 25, 2016

We have a theory and a likely culprit for this problem. Basically, if planets^Wpointers align, session initialization [1,2] may overwrite the end-of-workspace marker with a NUL char in 4.1. This is also true in current master [3] and possibly even more unsafe for H2. This is basically because we disregard the result of the workspace reservation.

[1] https://github.com/varnishcache/varnish-cache/blob/9dc2ec5/bin/varnishd/cache/cache_session.c#L154-L157
[2] https://github.com/varnishcache/varnish-cache/blob/9dc2ec5/bin/varnishd/cache/cache_session.c#L174-L185
[3] https://github.com/varnishcache/varnish-cache/blob/3f0b634/bin/varnishd/cache/cache_session.c#L193-L204

@mbgrydeland
Copy link
Contributor

mbgrydeland commented Nov 25, 2016

FWIW, the attached test case triggers the assertion on my machine. I am not suggesting this test case for inclusion, as the sizes used is waaaay to fragile to be useful as a regression test. The size of the bogus allocation needs to be exactly large enough to have zero WS left at the right point in order to trigger the bug. I ended up on 208 bytes by trial and success ;-)
Edit: 192 is the magic number on my computer for Varnish 4.1 HEAD.
hunch-ws.vtc.txt

mbgrydeland added a commit to mbgrydeland/varnish-cache that referenced this issue Nov 29, 2016
HTC_RxInit could write a single '\0' NUL character outside of the
workspace when its called and there is zero bytes left in the
workspace. This would trigger the workspace canary causing subsequent
assertion.

Fix by releaving HTC_RxInit of adding the '\0' character.

HTC_RxStuff now returns HTC_S_Overflow early if the available buffer
space is zero. The '\0' character is inserted just before calling the
completion check function.

Also fix an off-by-one error on the http_{req|resp}_size calculations,
where the maximum number of bytes accepted was one less than the
paramter indicated. c00039.vtc and c00040.vtc has been edited to
reflect that and to be more expressive about the sizes they generate.

Fixes: varnishcache#1834
mbgrydeland added a commit to mbgrydeland/varnish-cache that referenced this issue Nov 29, 2016
HTC_RxInit could write a single '\0' NUL character outside of the
workspace when its called and there is zero bytes left in the
workspace. This would trigger the workspace canary causing subsequent
assertion.

Fix by releaving HTC_RxInit of adding the '\0' character.

HTC_RxStuff now returns HTC_S_Overflow early if the available buffer
space is zero. The '\0' character is inserted just before calling the
completion check function.

Also fix an off-by-one error on the http_{req|resp}_size calculations,
where the maximum number of bytes accepted was one less than the
paramter indicated. c00039.vtc and c00040.vtc has been edited to
reflect that and to be more expressive about the sizes they generate.

Fixes: varnishcache#1834
mbgrydeland added a commit that referenced this issue Dec 5, 2016
HTC_RxInit and HTC_RxReInit could write a single '\0' NUL character
outside of the workspace when its called and there is zero bytes left
in the workspace. This would trigger the workspace canary causing
subsequent assertion.

Fix by releaving HTC_RxInit and HTC_RxReInit of adding the '\0'
character.

HTC_RxStuff and V1F_FetchRespHdr returns HTC_S_OVERFLOW if the
available buffer space is zero. Both make sure to insert the '\0'
character just before calling the completion check function.

Note that this fix does not change the fact that we have exchausted
the workspace and are unable to continue. Varnishd will panic
nonetheless, but at least we have not stepped out of our boundries.

Ref: #1834
@hermunn
Copy link
Member

hermunn commented Jan 27, 2017

FYI: The final commit, c1fa324 is a backport to 4.1. Thanks again @mbgrydeland!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants