panic vmod_rr_resolve() round_robin.c line 75 (be) != NULL #2024

Closed
nigoroll opened this Issue Jul 25, 2016 · 8 comments

Projects

None yet

5 participants

@nigoroll
Contributor

panic after 6 minutes running on 1ce9cfc - no unusual activity

Panic at: Mon, 25 Jul 2016 20:22:06 GMT
"Assert error in vmod_rr_resolve(), round_robin.c line 75:
  Condition((be) != NULL) not true.
thread = (cache-worker)
version = varnish-trunk revision 1ce9cfc
ident = -jsolaris,-smalloc,-smalloc,-hcritbit,ports
Backtrace:
  454990: pan_backtrace+0x1d
  454d52: pan_ic+0x25e
  ffffbf7ff4a05ffa: libvmod_directors.so'vmod_rr_resolve+0x27b [0xffffbf7ff4a05ffa]
  434795: vdi_resolve+0x14b
  43493c: VDI_GetHdr+0xde
  43e386: vbf_stp_startfetch+0x299
  441242: vbf_fetch_thread+0x36f
  472692: Pool_Work_Thread+0x4d6
  471395: WRK_Thread+0x299
  4727c2: pool_thread+0x96
busyobj = f417030 {
  ws = f4170b0 {
    id = \"bo\",
    {s, f, r, e} = {f41ff68, +512, 0, +225440},
  },
  retries = 0, failed = 0, flags = {do_stream},
  director_req = 8af310 {
    vcl_name = php_rr,
    type = round-robin {
    },
  },
  http[bereq] = f417678 {
    ws[bo] = f4170b0,
    hdrs {
      \"GET\",
      \"/*redacted*\",
      \"HTTP/1.1\",
      \"Host: *redacted*\",
      \"*redacted*\",
      \"*redacted*\",
      \"Via: 1.1 *redacted*\",
      \"Accept-Encoding: gzip\",
      \"Surrogate-Capability: *redacted*=%22Surrogate/1.0 ESI/1.0%22\",
      \"X-Forwarded-For: *redacted*\",
      \"X-Varnish: 1376745\",
    },
  },
  http[beresp] = f417af0 {
    ws[bo] = f4170b0,
    hdrs {
    },
  },
  objcore[fetch] = f973250 {
    refcnt = 2,
    flags = {busy},
    exp_flags = {},
    boc = f95b670 {
      refcnt = 2,
      state = req_done,
      vary = 0,
      stevedore_priv = 0,
    },
    exp = {0.000000, 0.000000, 0.000000, 0.000000}
    objhead = f9c1ad0,
    stevedore = 0,
  },
  vcl = {
    busy = 70
    discard = 0,
    state = auto,
    temp = warm,
    conf = {
      srcname = {
        \"/etc/varnish/default.vcl\",
        \"Builtin\",
      },
    },
  },
},

"

@nigoroll
Contributor

Interesting. This looks exactly like #2005 and #1898 but the platform is different: The other issues were for linux and the one I had a detailed look at was KVM so I suspected this to be related - but this panic was on SmartOS, so there is no hypervisor to blame.

@nigoroll nigoroll added a commit that closed this issue Jul 25, 2016
@nigoroll nigoroll close a potential race which could cause an out-of-bounds array access
We're only holding a read lock on the director, but we're updating
the nxt member concurrently. This should be acceptable as a performance
tradeoff - the only consequence is that round-robin is not strictly
going around - it may occasionally skip a backend or hand out the same
multiple times in a row.

the race is:

	thread	code

	A:	rr->nxt %= rr->vd->n_backend;
	// rr->nxt == rr->vd->n_backend - 1
	B:	rr->nxt++;
	// rr->nxt == rr->vd->n_backend
	A:	be = rr->vd->backend[nxt];
	// BOOM

should fix #2024
52d8dd9
@nigoroll nigoroll closed this in 52d8dd9 Jul 25, 2016
@nigoroll nigoroll self-assigned this Jul 25, 2016
@nigoroll
Contributor

this got introduced with be69a03

@fgsch
Member
fgsch commented Jul 25, 2016

I'm on my mobile so cannot check the code but my first thought is whether
we should consider reverting this change.

Any reason not to?

On 25 Jul 2016 10:09 p.m., "Nils Goroll" notifications@github.com wrote:

this got introduced with be69a03
be69a03


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#2024 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAOweeuFBUcO7mifbZqWkY9-X-3W112kks5qZSYUgaJpZM4JUgxP
.

@Dridi
Member
Dridi commented Jul 25, 2016

@fgsch I think this change was prep work for allowing add/delete operations after the init phase, enabling dynamic backends in the built-in directors.

@nigoroll
Contributor

I think we should keep the reader/writer lock to improve concurrency with the potentially higher amount of traffic for dynamic backends - plus many directors will actually only issue read access.
If anything I'd vote for rw access of the rr director.

@Dridi
Member
Dridi commented Jul 25, 2016

Summoning @slimhazard

@nigoroll
Contributor

@Dridi the original code aquired a mutex always, so that would also avoid the race. But @slimhazard (whom I pinged by email also) intended to improve concurrency. This was something we discussed at the Rotterdam VDD IIUC

@hermunn hermunn added a commit that referenced this issue Aug 10, 2016
@nigoroll @hermunn nigoroll + hermunn close a potential race which could cause an out-of-bounds array access
We're only holding a read lock on the director, but we're updating
the nxt member concurrently. This should be acceptable as a performance
tradeoff - the only consequence is that round-robin is not strictly
going around - it may occasionally skip a backend or hand out the same
multiple times in a row.

the race is:

	thread	code

	A:	rr->nxt %= rr->vd->n_backend;
	// rr->nxt == rr->vd->n_backend - 1
	B:	rr->nxt++;
	// rr->nxt == rr->vd->n_backend
	A:	be = rr->vd->backend[nxt];
	// BOOM

should fix #2024
9edb2b6
@hermunn
Contributor
hermunn commented Aug 10, 2016

Backport review: Backported as 9edb2b6

@hermunn hermunn added a commit that referenced this issue Aug 10, 2016
@hermunn hermunn Add #2024 to the changelog 4292ce9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment