New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic vmod_rr_resolve() round_robin.c line 75 (be) != NULL #2024

Closed
nigoroll opened this Issue Jul 25, 2016 · 12 comments

Comments

Projects
None yet
6 participants
@nigoroll
Contributor

nigoroll commented Jul 25, 2016

panic after 6 minutes running on 1ce9cfc - no unusual activity

Panic at: Mon, 25 Jul 2016 20:22:06 GMT
"Assert error in vmod_rr_resolve(), round_robin.c line 75:
  Condition((be) != NULL) not true.
thread = (cache-worker)
version = varnish-trunk revision 1ce9cfc
ident = -jsolaris,-smalloc,-smalloc,-hcritbit,ports
Backtrace:
  454990: pan_backtrace+0x1d
  454d52: pan_ic+0x25e
  ffffbf7ff4a05ffa: libvmod_directors.so'vmod_rr_resolve+0x27b [0xffffbf7ff4a05ffa]
  434795: vdi_resolve+0x14b
  43493c: VDI_GetHdr+0xde
  43e386: vbf_stp_startfetch+0x299
  441242: vbf_fetch_thread+0x36f
  472692: Pool_Work_Thread+0x4d6
  471395: WRK_Thread+0x299
  4727c2: pool_thread+0x96
busyobj = f417030 {
  ws = f4170b0 {
    id = \"bo\",
    {s, f, r, e} = {f41ff68, +512, 0, +225440},
  },
  retries = 0, failed = 0, flags = {do_stream},
  director_req = 8af310 {
    vcl_name = php_rr,
    type = round-robin {
    },
  },
  http[bereq] = f417678 {
    ws[bo] = f4170b0,
    hdrs {
      \"GET\",
      \"/*redacted*\",
      \"HTTP/1.1\",
      \"Host: *redacted*\",
      \"*redacted*\",
      \"*redacted*\",
      \"Via: 1.1 *redacted*\",
      \"Accept-Encoding: gzip\",
      \"Surrogate-Capability: *redacted*=%22Surrogate/1.0 ESI/1.0%22\",
      \"X-Forwarded-For: *redacted*\",
      \"X-Varnish: 1376745\",
    },
  },
  http[beresp] = f417af0 {
    ws[bo] = f4170b0,
    hdrs {
    },
  },
  objcore[fetch] = f973250 {
    refcnt = 2,
    flags = {busy},
    exp_flags = {},
    boc = f95b670 {
      refcnt = 2,
      state = req_done,
      vary = 0,
      stevedore_priv = 0,
    },
    exp = {0.000000, 0.000000, 0.000000, 0.000000}
    objhead = f9c1ad0,
    stevedore = 0,
  },
  vcl = {
    busy = 70
    discard = 0,
    state = auto,
    temp = warm,
    conf = {
      srcname = {
        \"/etc/varnish/default.vcl\",
        \"Builtin\",
      },
    },
  },
},

"

@nigoroll

This comment has been minimized.

Show comment
Hide comment
@nigoroll

nigoroll Jul 25, 2016

Contributor

Interesting. This looks exactly like #2005 and #1898 but the platform is different: The other issues were for linux and the one I had a detailed look at was KVM so I suspected this to be related - but this panic was on SmartOS, so there is no hypervisor to blame.

Contributor

nigoroll commented Jul 25, 2016

Interesting. This looks exactly like #2005 and #1898 but the platform is different: The other issues were for linux and the one I had a detailed look at was KVM so I suspected this to be related - but this panic was on SmartOS, so there is no hypervisor to blame.

@nigoroll

This comment has been minimized.

Show comment
Hide comment
@nigoroll

nigoroll Jul 25, 2016

Contributor

this got introduced with be69a03

Contributor

nigoroll commented Jul 25, 2016

this got introduced with be69a03

@fgsch

This comment has been minimized.

Show comment
Hide comment
@fgsch

fgsch Jul 25, 2016

Member

I'm on my mobile so cannot check the code but my first thought is whether
we should consider reverting this change.

Any reason not to?

On 25 Jul 2016 10:09 p.m., "Nils Goroll" notifications@github.com wrote:

this got introduced with be69a03
be69a03


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#2024 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAOweeuFBUcO7mifbZqWkY9-X-3W112kks5qZSYUgaJpZM4JUgxP
.

Member

fgsch commented Jul 25, 2016

I'm on my mobile so cannot check the code but my first thought is whether
we should consider reverting this change.

Any reason not to?

On 25 Jul 2016 10:09 p.m., "Nils Goroll" notifications@github.com wrote:

this got introduced with be69a03
be69a03


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#2024 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAOweeuFBUcO7mifbZqWkY9-X-3W112kks5qZSYUgaJpZM4JUgxP
.

@Dridi

This comment has been minimized.

Show comment
Hide comment
@Dridi

Dridi Jul 25, 2016

Member

@fgsch I think this change was prep work for allowing add/delete operations after the init phase, enabling dynamic backends in the built-in directors.

Member

Dridi commented Jul 25, 2016

@fgsch I think this change was prep work for allowing add/delete operations after the init phase, enabling dynamic backends in the built-in directors.

@nigoroll

This comment has been minimized.

Show comment
Hide comment
@nigoroll

nigoroll Jul 25, 2016

Contributor

I think we should keep the reader/writer lock to improve concurrency with the potentially higher amount of traffic for dynamic backends - plus many directors will actually only issue read access.
If anything I'd vote for rw access of the rr director.

Contributor

nigoroll commented Jul 25, 2016

I think we should keep the reader/writer lock to improve concurrency with the potentially higher amount of traffic for dynamic backends - plus many directors will actually only issue read access.
If anything I'd vote for rw access of the rr director.

@Dridi

This comment has been minimized.

Show comment
Hide comment
@Dridi

Dridi Jul 25, 2016

Member

Summoning @slimhazard

Member

Dridi commented Jul 25, 2016

Summoning @slimhazard

@nigoroll

This comment has been minimized.

Show comment
Hide comment
@nigoroll

nigoroll Jul 25, 2016

Contributor

@Dridi the original code aquired a mutex always, so that would also avoid the race. But @slimhazard (whom I pinged by email also) intended to improve concurrency. This was something we discussed at the Rotterdam VDD IIUC

Contributor

nigoroll commented Jul 25, 2016

@Dridi the original code aquired a mutex always, so that would also avoid the race. But @slimhazard (whom I pinged by email also) intended to improve concurrency. This was something we discussed at the Rotterdam VDD IIUC

hermunn added a commit that referenced this issue Aug 10, 2016

close a potential race which could cause an out-of-bounds array access
We're only holding a read lock on the director, but we're updating
the nxt member concurrently. This should be acceptable as a performance
tradeoff - the only consequence is that round-robin is not strictly
going around - it may occasionally skip a backend or hand out the same
multiple times in a row.

the race is:

	thread	code

	A:	rr->nxt %= rr->vd->n_backend;
	// rr->nxt == rr->vd->n_backend - 1
	B:	rr->nxt++;
	// rr->nxt == rr->vd->n_backend
	A:	be = rr->vd->backend[nxt];
	// BOOM

should fix #2024
@hermunn

This comment has been minimized.

Show comment
Hide comment
@hermunn

hermunn Aug 10, 2016

Contributor

Backport review: Backported as 9edb2b6

Contributor

hermunn commented Aug 10, 2016

Backport review: Backported as 9edb2b6

@rnsanchez

This comment has been minimized.

Show comment
Hide comment
@rnsanchez

rnsanchez Sep 18, 2017

Contributor

@hermunn: any chances b8290db is going to be backported to 4.1?

Contributor

rnsanchez commented Sep 18, 2017

@hermunn: any chances b8290db is going to be backported to 4.1?

@hermunn

This comment has been minimized.

Show comment
Hide comment
@hermunn

hermunn Sep 19, 2017

Contributor

@rnsanchez I will work on backporting this week, and make sure this is considered.

Contributor

hermunn commented Sep 19, 2017

@rnsanchez I will work on backporting this week, and make sure this is considered.

@rnsanchez

This comment has been minimized.

Show comment
Hide comment
@rnsanchez

rnsanchez Sep 19, 2017

Contributor

Great news! 4.1 is hitting this issue from time to time in production.

Contributor

rnsanchez commented Sep 19, 2017

Great news! 4.1 is hitting this issue from time to time in production.

@hermunn

This comment has been minimized.

Show comment
Hide comment
@hermunn

hermunn Sep 19, 2017

Contributor

The patch b8290db has been back ported as 1ec38ef.

Contributor

hermunn commented Sep 19, 2017

The patch b8290db has been back ported as 1ec38ef.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment