New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WS_Assert(), cache/cache_ws.c line 59 #1834

Closed
bsdphk opened this Issue Mar 4, 2016 · 13 comments

Comments

Projects
None yet
9 participants
@bsdphk
Contributor

bsdphk commented Mar 4, 2016

Old ticket imported from Trac
See archived copy here: https://varnish-cache.org/trac/ticket/1834

@nigoroll

This comment has been minimized.

Show comment
Hide comment
@nigoroll

nigoroll May 27, 2016

Contributor

@lkarsten close?

Contributor

nigoroll commented May 27, 2016

@lkarsten close?

@lkarsten

This comment has been minimized.

Show comment
Hide comment
@lkarsten

lkarsten May 27, 2016

Contributor

I don't think we've squashed this one. I asked the reporter to check if it still happens with 4.1HEAD, timeout in 10d if nothing heard.

Contributor

lkarsten commented May 27, 2016

I don't think we've squashed this one. I asked the reporter to check if it still happens with 4.1HEAD, timeout in 10d if nothing heard.

@lkarsten lkarsten self-assigned this May 30, 2016

@porcospino

This comment has been minimized.

Show comment
Hide comment
@porcospino

porcospino Jun 16, 2016

Just had it happen on 4.1.2
Last panic at: Thu, 16 Jun 2016 12:02:44 GMT
"Assert error in WS_Assert(), cache/cache_ws.c line 59:
Condition(*ws->e == 0x15) not true.
thread = (cache-worker)
version = varnish-4.1.2 revision 0d7404e
ident = Linux,3.13.0-87-generic,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll
Backtrace:
0x433976: varnishd() [0x433976]
0x44a5ab: varnishd(WS_Assert+0x18b) [0x44a5ab]
0x44af71: varnishd(WS_ReleaseP+0x11) [0x44af71]
0x43a865: varnishd(SES_RxStuff+0x205) [0x43a865]
0x44ea78: varnishd(HTTP1_Session+0x178) [0x44ea78]
0x439d0d: varnishd(SES_Proto_Req+0x5d) [0x439d0d]
0x4493d1: varnishd() [0x4493d1]
0x44983b: varnishd() [0x44983b]
0x7f897fec9184: libpthread.so.0(+0x8184) [0x7f897fec9184]
0x7f897fbf637d: libc.so.6(clone+0x6d) [0x7f897fbf637d]
req = 0x7f89660ec020 {
vxid = 0, step = R_STP_RESTART,
req_body = R_BODY_INIT,
restarts = 0, esi_level = 0,
sp = 0x7f8957ad2a20 {
fd = 25, vxid = 29036,
client = 192.168.69.189 59741,
step = S_STP_H1NEWREQ,
},
ws = 0x7f89660ec200 {
id = "req",
{s,f,r,e} = {0x7f89660ee000,+32768,+57336,+57336},
},
http_conn = 0x7f89660ec128 {
fd = 25,
doclose = NULL,
ws = 0x7f89660ec200,
{rxbuf_b, rxbuf_e} = {0x7f89660f6000, 0x7f89660fdfe3},
{pipeline_b, pipeline_e} = {(nil), (nil)},
content_length = 0,
body_status = none,
first_byte_timeout = 0.000000,
between_bytes_timeout = 0.000000,
},
http[req] = 0x7f89660ec298 {
ws[] = (nil),
hdrs {
},
},
flags = {
},
},

porcospino commented Jun 16, 2016

Just had it happen on 4.1.2
Last panic at: Thu, 16 Jun 2016 12:02:44 GMT
"Assert error in WS_Assert(), cache/cache_ws.c line 59:
Condition(*ws->e == 0x15) not true.
thread = (cache-worker)
version = varnish-4.1.2 revision 0d7404e
ident = Linux,3.13.0-87-generic,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll
Backtrace:
0x433976: varnishd() [0x433976]
0x44a5ab: varnishd(WS_Assert+0x18b) [0x44a5ab]
0x44af71: varnishd(WS_ReleaseP+0x11) [0x44af71]
0x43a865: varnishd(SES_RxStuff+0x205) [0x43a865]
0x44ea78: varnishd(HTTP1_Session+0x178) [0x44ea78]
0x439d0d: varnishd(SES_Proto_Req+0x5d) [0x439d0d]
0x4493d1: varnishd() [0x4493d1]
0x44983b: varnishd() [0x44983b]
0x7f897fec9184: libpthread.so.0(+0x8184) [0x7f897fec9184]
0x7f897fbf637d: libc.so.6(clone+0x6d) [0x7f897fbf637d]
req = 0x7f89660ec020 {
vxid = 0, step = R_STP_RESTART,
req_body = R_BODY_INIT,
restarts = 0, esi_level = 0,
sp = 0x7f8957ad2a20 {
fd = 25, vxid = 29036,
client = 192.168.69.189 59741,
step = S_STP_H1NEWREQ,
},
ws = 0x7f89660ec200 {
id = "req",
{s,f,r,e} = {0x7f89660ee000,+32768,+57336,+57336},
},
http_conn = 0x7f89660ec128 {
fd = 25,
doclose = NULL,
ws = 0x7f89660ec200,
{rxbuf_b, rxbuf_e} = {0x7f89660f6000, 0x7f89660fdfe3},
{pipeline_b, pipeline_e} = {(nil), (nil)},
content_length = 0,
body_status = none,
first_byte_timeout = 0.000000,
between_bytes_timeout = 0.000000,
},
http[req] = 0x7f89660ec298 {
ws[] = (nil),
hdrs {
},
},
flags = {
},
},

@rnsanchez

This comment has been minimized.

Show comment
Hide comment
@rnsanchez

rnsanchez Jun 16, 2016

Contributor

We do see this panic on our custom 4.1 branch (some internal functionalities on top of upstream/4.1). As suggested in the old ticket (Trac), we're checking whether 96 KB for both client and backend workspaces help.

Last panic at: Thu, 16 Jun 2016 06:00:27 GMT
"Assert error in WS_Assert(), cache/cache_ws.c line 59:
  Condition(*ws->e == 0x15) not true.
thread = (cache-worker)
version = varnish-4.1.3-dev revision 96c00a5
ident = Linux,4.4.11,x86_64,-junix,-sfile,-smalloc,-smalloc,-hcritbit,epoll
Backtrace:
  0x434502: pan_ic+0x152
  0x44b7c6: WS_Assert+0x176
  0x44c171: WS_ReleaseP+0x11
  0x43b56d: SES_RxStuff+0x1fd
  0x4500c2: HTTP1_Session+0x1b2
  0x43aa2d: SES_Proto_Req+0x5d
  0x44a62a: WRK_Thread+0x47a
  0x44aa7b: pool_thread+0x2b
  0x7f33fa0ca654: libpthread.so.0(+0x7654) [0x7f33fa0ca654]
  0x7f33f9e007dd: libc.so.6(clone+0x6d) [0x7f33f9e007dd]
req = 0x7c459fa47020 {
      vxid = 0, step = R_STP_RESTART,
      req_body = R_BODY_INIT,
      restarts = 0, esi_level = 0,
      sp = 0x7c45a844d620 {
        fd = 114, vxid = 67983,
        client = 186.250.8.9 11882,
        t_open = 1466056827.205797,
        t_idle = 1466056827.205797,
    step = S_STP_H1NEWREQ,
  },
  ws = 0x7c459fa47208 {
    id = \"req\",
    {s,f,r,e} = {0x7c459fa49008,+32768,+57328,+57328},
  },
  http_conn = 0x7c459fa47130 {
    fd = 114,
    doclose = NULL,
    ws = 0x7c459fa47208,
    {rxbuf_b, rxbuf_e} = {0x7c459fa51008, 0x7c459fa58feb},
    {pipeline_b, pipeline_e} = {(nil), (nil)},
    content_length = 0,
    body_status = none,
    first_byte_timeout = 0.000000,
    between_bytes_timeout = 0.000000,
  },
  http[req] = 0x7c459fa472a0 {
    ws[] = (nil),
    hdrs {
    },
  },
  flags = {
  },
},

" 
Contributor

rnsanchez commented Jun 16, 2016

We do see this panic on our custom 4.1 branch (some internal functionalities on top of upstream/4.1). As suggested in the old ticket (Trac), we're checking whether 96 KB for both client and backend workspaces help.

Last panic at: Thu, 16 Jun 2016 06:00:27 GMT
"Assert error in WS_Assert(), cache/cache_ws.c line 59:
  Condition(*ws->e == 0x15) not true.
thread = (cache-worker)
version = varnish-4.1.3-dev revision 96c00a5
ident = Linux,4.4.11,x86_64,-junix,-sfile,-smalloc,-smalloc,-hcritbit,epoll
Backtrace:
  0x434502: pan_ic+0x152
  0x44b7c6: WS_Assert+0x176
  0x44c171: WS_ReleaseP+0x11
  0x43b56d: SES_RxStuff+0x1fd
  0x4500c2: HTTP1_Session+0x1b2
  0x43aa2d: SES_Proto_Req+0x5d
  0x44a62a: WRK_Thread+0x47a
  0x44aa7b: pool_thread+0x2b
  0x7f33fa0ca654: libpthread.so.0(+0x7654) [0x7f33fa0ca654]
  0x7f33f9e007dd: libc.so.6(clone+0x6d) [0x7f33f9e007dd]
req = 0x7c459fa47020 {
      vxid = 0, step = R_STP_RESTART,
      req_body = R_BODY_INIT,
      restarts = 0, esi_level = 0,
      sp = 0x7c45a844d620 {
        fd = 114, vxid = 67983,
        client = 186.250.8.9 11882,
        t_open = 1466056827.205797,
        t_idle = 1466056827.205797,
    step = S_STP_H1NEWREQ,
  },
  ws = 0x7c459fa47208 {
    id = \"req\",
    {s,f,r,e} = {0x7c459fa49008,+32768,+57328,+57328},
  },
  http_conn = 0x7c459fa47130 {
    fd = 114,
    doclose = NULL,
    ws = 0x7c459fa47208,
    {rxbuf_b, rxbuf_e} = {0x7c459fa51008, 0x7c459fa58feb},
    {pipeline_b, pipeline_e} = {(nil), (nil)},
    content_length = 0,
    body_status = none,
    first_byte_timeout = 0.000000,
    between_bytes_timeout = 0.000000,
  },
  http[req] = 0x7c459fa472a0 {
    ws[] = (nil),
    hdrs {
    },
  },
  flags = {
  },
},

" 
@lkarsten

This comment has been minimized.

Show comment
Hide comment
@lkarsten

lkarsten Jun 17, 2016

Contributor

@rnsanchez and @porcospino - Can you please try the 4.1.3-beta1 version and see if that helps? See https://www.varnish-cache.org/releases/index.html#releases .

There isn't a specific fix for this in there, but it may be that one of the many other fixed bugs is provoking this.

Contributor

lkarsten commented Jun 17, 2016

@rnsanchez and @porcospino - Can you please try the 4.1.3-beta1 version and see if that helps? See https://www.varnish-cache.org/releases/index.html#releases .

There isn't a specific fix for this in there, but it may be that one of the many other fixed bugs is provoking this.

@rnsanchez

This comment has been minimized.

Show comment
Hide comment
@rnsanchez

rnsanchez Jun 17, 2016

Contributor

@lkarsten I need to review the commits up to 4.1.3-beta1 before rolling out to this server, but I will keep you posted.

Contributor

rnsanchez commented Jun 17, 2016

@lkarsten I need to review the commits up to 4.1.3-beta1 before rolling out to this server, but I will keep you posted.

@lkarsten

This comment has been minimized.

Show comment
Hide comment
@lkarsten

lkarsten Aug 25, 2016

Contributor

@rnsanchez Is still current? Did you rebase your branch to 4.1.3? Do you still see it with the raised workspace sizes you describe in #2008 ?

Contributor

lkarsten commented Aug 25, 2016

@rnsanchez Is still current? Did you rebase your branch to 4.1.3? Do you still see it with the raised workspace sizes you describe in #2008 ?

@rnsanchez

This comment has been minimized.

Show comment
Hide comment
@rnsanchez

rnsanchez Aug 25, 2016

Contributor

@lkarsten we're still using 96 KB, and our deployed branch is based on commit 1d73aa7. We have not seen anything specific to this ticket since new binaries were deployed (around July 9th).

Sorry for the delay.

Contributor

rnsanchez commented Aug 25, 2016

@lkarsten we're still using 96 KB, and our deployed branch is based on commit 1d73aa7. We have not seen anything specific to this ticket since new binaries were deployed (around July 9th).

Sorry for the delay.

@bsdphk

This comment has been minimized.

Show comment
Hide comment
@bsdphk

bsdphk Sep 12, 2016

Contributor

timing this out.

Contributor

bsdphk commented Sep 12, 2016

timing this out.

@hermunn

This comment has been minimized.

Show comment
Hide comment
@hermunn

hermunn Sep 13, 2016

Contributor

backport review: Timed out, nothing to backport

Contributor

hermunn commented Sep 13, 2016

backport review: Timed out, nothing to backport

@Dridi Dridi reopened this Nov 25, 2016

@Dridi

This comment has been minimized.

Show comment
Hide comment
@Dridi

Dridi Nov 25, 2016

Member

We have a theory and a likely culprit for this problem. Basically, if planets^Wpointers align, session initialization [1,2] may overwrite the end-of-workspace marker with a NUL char in 4.1. This is also true in current master [3] and possibly even more unsafe for H2. This is basically because we disregard the result of the workspace reservation.

[1] https://github.com/varnishcache/varnish-cache/blob/9dc2ec5/bin/varnishd/cache/cache_session.c#L154-L157
[2] https://github.com/varnishcache/varnish-cache/blob/9dc2ec5/bin/varnishd/cache/cache_session.c#L174-L185
[3] https://github.com/varnishcache/varnish-cache/blob/3f0b634/bin/varnishd/cache/cache_session.c#L193-L204

Member

Dridi commented Nov 25, 2016

We have a theory and a likely culprit for this problem. Basically, if planets^Wpointers align, session initialization [1,2] may overwrite the end-of-workspace marker with a NUL char in 4.1. This is also true in current master [3] and possibly even more unsafe for H2. This is basically because we disregard the result of the workspace reservation.

[1] https://github.com/varnishcache/varnish-cache/blob/9dc2ec5/bin/varnishd/cache/cache_session.c#L154-L157
[2] https://github.com/varnishcache/varnish-cache/blob/9dc2ec5/bin/varnishd/cache/cache_session.c#L174-L185
[3] https://github.com/varnishcache/varnish-cache/blob/3f0b634/bin/varnishd/cache/cache_session.c#L193-L204

@mbgrydeland

This comment has been minimized.

Show comment
Hide comment
@mbgrydeland

mbgrydeland Nov 25, 2016

Contributor

FWIW, the attached test case triggers the assertion on my machine. I am not suggesting this test case for inclusion, as the sizes used is waaaay to fragile to be useful as a regression test. The size of the bogus allocation needs to be exactly large enough to have zero WS left at the right point in order to trigger the bug. I ended up on 208 bytes by trial and success ;-)
Edit: 192 is the magic number on my computer for Varnish 4.1 HEAD.
hunch-ws.vtc.txt

Contributor

mbgrydeland commented Nov 25, 2016

FWIW, the attached test case triggers the assertion on my machine. I am not suggesting this test case for inclusion, as the sizes used is waaaay to fragile to be useful as a regression test. The size of the bogus allocation needs to be exactly large enough to have zero WS left at the right point in order to trigger the bug. I ended up on 208 bytes by trial and success ;-)
Edit: 192 is the magic number on my computer for Varnish 4.1 HEAD.
hunch-ws.vtc.txt

mbgrydeland added a commit to mbgrydeland/varnish-cache that referenced this issue Nov 29, 2016

Fix buffer overflow in HTC_RxInit on workspace exhaustion
HTC_RxInit could write a single '\0' NUL character outside of the
workspace when its called and there is zero bytes left in the
workspace. This would trigger the workspace canary causing subsequent
assertion.

Fix by releaving HTC_RxInit of adding the '\0' character.

HTC_RxStuff now returns HTC_S_Overflow early if the available buffer
space is zero. The '\0' character is inserted just before calling the
completion check function.

Also fix an off-by-one error on the http_{req|resp}_size calculations,
where the maximum number of bytes accepted was one less than the
paramter indicated. c00039.vtc and c00040.vtc has been edited to
reflect that and to be more expressive about the sizes they generate.

Fixes: #1834

mbgrydeland added a commit to mbgrydeland/varnish-cache that referenced this issue Nov 29, 2016

Fix buffer overflow in HTC_RxInit on workspace exhaustion
HTC_RxInit could write a single '\0' NUL character outside of the
workspace when its called and there is zero bytes left in the
workspace. This would trigger the workspace canary causing subsequent
assertion.

Fix by releaving HTC_RxInit of adding the '\0' character.

HTC_RxStuff now returns HTC_S_Overflow early if the available buffer
space is zero. The '\0' character is inserted just before calling the
completion check function.

Also fix an off-by-one error on the http_{req|resp}_size calculations,
where the maximum number of bytes accepted was one less than the
paramter indicated. c00039.vtc and c00040.vtc has been edited to
reflect that and to be more expressive about the sizes they generate.

Fixes: #1834

mbgrydeland added a commit that referenced this issue Dec 5, 2016

Fix buffer overflow in HTC_RxInit on workspace exhaustion
HTC_RxInit and HTC_RxReInit could write a single '\0' NUL character
outside of the workspace when its called and there is zero bytes left
in the workspace. This would trigger the workspace canary causing
subsequent assertion.

Fix by releaving HTC_RxInit and HTC_RxReInit of adding the '\0'
character.

HTC_RxStuff and V1F_FetchRespHdr returns HTC_S_OVERFLOW if the
available buffer space is zero. Both make sure to insert the '\0'
character just before calling the completion check function.

Note that this fix does not change the fact that we have exchausted
the workspace and are unable to continue. Varnishd will panic
nonetheless, but at least we have not stepped out of our boundries.

Ref: #1834
@hermunn

This comment has been minimized.

Show comment
Hide comment
@hermunn

hermunn Jan 27, 2017

Contributor

FYI: The final commit, c1fa324 is a backport to 4.1. Thanks again @mbgrydeland!

Contributor

hermunn commented Jan 27, 2017

FYI: The final commit, c1fa324 is a backport to 4.1. Thanks again @mbgrydeland!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment