Panic: Assert error in VBE_Delete() #2008

Closed
rnsanchez opened this Issue Jul 6, 2016 · 21 comments

Projects

None yet

5 participants

@rnsanchez
Contributor
rnsanchez commented Jul 6, 2016 edited
Last panic at: Wed, 06 Jul 2016 16:59:20 GMT
"Assert error in VBE_Delete(), cache/cache_backend_cfg.c line 249:
  Condition((be->vsc) == 0) not true.
thread = (cache-main)
version = varnish-4.1.3-beta2 revision 43aa5d8
Backtrace:
  0x434532: pan_ic+0x152
  0x414a84: varnishd() [0x414a84]
  0x414ce5: VBE_Poll+0x75
  0x41b423: cli_cb_before+0x33
  0x7f3de351edde: libvarnish.so(+0x6dde) [0x7f3de351edde]
  0x7f3de351f417: libvarnish.so(+0x7417) [0x7f3de351f417]
  0x7f3de35245b1: libvarnish.so(VLU_Fd+0xc1) [0x7f3de35245b1]
  0x7f3de351ffa7: libvarnish.so(VCLS_Poll+0x147) [0x7f3de351ffa7]
  0x41b531: CLI_Run+0x41
  0x430a7b: child_main+0x14b

This is a custom branch rebased on top of 45c8bcbd708556de1e0bb9d24d1430e202d6a289 running on Slackware64/14.2 (freshly installed), Linux kernel 4.4.14 (stock, no external modules).

Please advise whether you need more information.

@Dridi
Member
Dridi commented Jul 7, 2016 edited

I can't find the commit you mentioned, so more details about this custom branch could help. Especially the nature of your changes, possible VMODs in use, and what you are doing with backends.

@rnsanchez
Contributor

I'm sorry, it seems that the commit I mentioned is now rehashed due to merge/rebase; here is the correct one: 0577f3f

VMOD in use: named, for dynamic backends using Amazon servers. Also, low TTL (3 seconds), and a larger workspace (96 KB) to avoid panics we had with #1834 (not yet allowed to lower it back to 64 KB). Our branch has 25 small-ish commits for byte-account billing, traffic shaping, API keys.

Let me know if you need more info.

@Dridi Dridi self-assigned this Jul 7, 2016
@Dridi Dridi removed the more-info-needed label Jul 7, 2016
@Dridi
Member
Dridi commented Jul 7, 2016

Assigning to myself, this looks like a dynamic backend issue as I suspected :)

@lkarsten lkarsten pushed a commit that referenced this issue Jul 7, 2016
Lasse Karstensen Stop incorrectly reporting overflow on zeroed ws.
Ref: #2008
11692e7
@lkarsten
Contributor
lkarsten commented Jul 7, 2016

Sorry, incorrect issue in 11692e7. It concerns #1999.

@Dridi
Member
Dridi commented Jul 11, 2016

@rnsanchez I'm currently suspecting a bug in vmod-named but you mention byte-accounting among your 25 commits and this is a panic related to backend counters.

So now I'm also suspecting a bug in your fork... Especially since the triggering assertion reported on line 249 in your panic message is on line 216 in the branched commit:

https://github.com/varnishcache/varnish-cache/blob/0577f3f/bin/varnishd/cache/cache_backend_cfg.c#L192-L219

@rnsanchez
Contributor

@Dridi thanks for the pointer, I will investigate on our side and report back.

@rnsanchez
Contributor

We simply duplicate some attributes from ReqAcct, for convenience. What we did notice is that the problem is triggered when:

  1. vcl.load t1 /New.vcl
  2. vcl.use t1
  3. vcl.state t0 cold

The only VMOD we use is libvmod-named. We just deployed it without the last step, just in case this is race-related (setting it as cold too early). To simplify things further, we rebased our repo, so recent fixes are included as well.

It takes some time to trigger it (10~15k dynamic backends and then changing the VCL), maybe a couple days. We'll report back soon with a backtrace (with symbols), as soon as it is reproduced.

@Dridi
Member
Dridi commented Jul 26, 2016

Can you please confirm you upgraded to 4.1.3?

From the previous panic, it doesn't look like changing the temperature triggers the panic. It happens during the VCL and backends cleanup performed before every CLI commands. So any command sent via varnishadm could trigger it for instance and in your case the VCL isn't cold/cooling yet.

There must be a race somewhere between vmod-named deleting a backend when at the same time VBE_Poll is called by the CLI thread.

@Dridi
Member
Dridi commented Jul 27, 2016 edited

I've been looking hard a vmod-named and especially how backends may be deleted or not at discard time (with different expectations from Varnish) and couldn't find anything. So I had a look at the VRT_backend_delete function provided by Varnish and found a missing lock acquire when the backend is moved to the cool backends list.

Below is a simple patch that fixes the locking issues and checks other locking requirements, but it doesn't solve your problem.

diff --git i/bin/varnishd/cache/cache_backend_cfg.c w/bin/varnishd/cache/cache_backend_cfg.c
index ad96330..06f5fe8 100644
--- i/bin/varnishd/cache/cache_backend_cfg.c
+++ w/bin/varnishd/cache/cache_backend_cfg.c
@@ -157,9 +157,11 @@ VRT_delete_backend(VRT_CTX, struct director **dp)
        be->admin_health = vbe_ah_deleted;
        be->health_changed = VTIM_real();
        be->cooled = VTIM_real() + 60.;
+       Lck_Unlock(&be->mtx);
+       Lck_Lock(&backends_mtx);
        VTAILQ_REMOVE(&backends, be, list);
        VTAILQ_INSERT_TAIL(&cool_backends, be, list);
-       Lck_Unlock(&be->mtx);
+       Lck_Unlock(&backends_mtx);

        // NB. The backend is still usable for the ongoing transactions,
        // this is why we don't bust the director's magic number.

The VBE_Poll function guarantees a COLD event on a warm backend before calling VBE_Delete. And when a VCL cools down, all its backends invariably get a COLD event. All of that happens on the CLI thread, sequentially, so no locking is needed.

A newly created backend on the other hand can be created outside of the CLI thread and reads the unguarded VCL temperature to then emit a WARM event. I think we have our race, now the question is how to close it. At first glance it could be covered by a lock in cache_backend_cfg.c, only used in this file.

@rnsanchez
Contributor

Thanks @Dridi, I'm passing this info along. We're trying to reliably trigger the panic, in the hopes of confirming an eventual correction.

@Dridi
Member
Dridi commented Jul 27, 2016

Second glance, the VCL's lock could fit too.

@Dridi
Member
Dridi commented Jul 27, 2016 edited

After one more stare at the code I came to the conclusion that we need a rwlock in struct vcl. The temp field is only ever written in the CLI thread, but it is read during a backend creation as explained above. And an event may be fired depending on unguarded reads.

Moreover, the temp field is used in assertions outside of the CLI thread, even more dangerous unguarded reads. I need to verify the correctness of four assertions on temperature that look incomplete or temporary.

I'm concerned about the potential size of critical sections for write locks given the current shape of temperature code. Although it might be just fine because it's only one (leaf) branch at a time, and further staring-at-the-code may show nothing scary.

I'll be working on this when I come back from holidays, I'm dumping some notes here to help my future self.

Off-topic note to future self: there is probably room for a condvar on vcl_mtx.

@Dridi Dridi added a commit that referenced this issue Jul 28, 2016
@Dridi Dridi Lock when a backend is moved between lists
Refs #2008
d398402
@rnsanchez
Contributor

@Dridi we're in the process of updating our branches to include d398402. It takes a while to trigger the panic (usually more than 10k backends).

@Dridi
Member
Dridi commented Aug 25, 2016

Thanks for the confirmation, I will follow the other lead documented above anyway. Except that the master branch now includes support for VCL labels and VCL switching (not sure we have a fancy name for the latter) so fixing this for 5.0 and 4.1 will take twice as long once I get to it.

@rnsanchez
Contributor

@Dridi: we have more info: rebasing our local branch to match 4.1 (02df97a) and cherry-picking your fix (d398402), the panic still happens.

When the third element in vcl.list reaches zero, that's where the panic usually happens. I was told that varnishadm felt frozen for quite a while when it happened. Also, that the uptime did not increase until the panic completely flushed out. It was not related to load, as the machine had barely any traffic.

Here is a minimal VCL used for testing:

probe probe_D__B_147223793380_s3 {
    .request = 
        "HEAD / HTTP/1.1"
        "Host: testbucket.s3.amazonaws.com"
        "Connection: close";
    .initial  = 0;
    .interval = 3s;
    .timeout  = 2m;
    .window   = 2;
    .expected_response = 200;
    .threshold = 1;
}

sub vcl_init {
    # Director: named director
    new D___B_147223793380_s3 = named.director(
        port = "80",
        usage_timeout = 300s,
        first_lookup_timeout = 2s,
        probe = probe_D__B_147223793380_s3,
        ttl = 60s
    );
}

sub vcl_recv {
    set req.backend_hint = D___B_147223793380_s3.backend("testbucket.s3.amazonaws.com");
}

Commands used for the test:

varnish> vcl.list
200        
active     auto/warm         20 p2

varnish> vcl.load p3 /tmp/singular/0826-1472237933/0826-1472237933.vcl 
200        
VCL compiled.

varnish> vcl.use p3
200        
VCL 'p3' now active
varnish> vcl.discard p2
200        

varnish> vcl.list
200        
discarded  auto/busy         16 p2
active     auto/warm          8 p3

Panic:

panic.show 
200        
Last panic at: Fri, 26 Aug 2016 15:54:53 GMT
"Assert error in VBP_Remove(), cache/cache_backend_probe.c line 587:
  Condition(vt->heap_idx == BINHEAP_NOIDX) not true.
thread = (cache-main)
version = varnish-4.1.3 revision 92d62e5
ident = Linux,4.6.3,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll
Backtrace:
  0x43e5ab: pan_backtrace+0x1d
  0x43e9b6: pan_ic+0x2a7
  0x416272: VBP_Remove+0x17c
  0x4139cd: VBE_Delete+0x82
  0x4145e4: VBE_Poll+0x86
  0x41cf0e: cli_cb_before+0x71
  0x7f6f654a9072: libvarnish.so(+0x8072) [0x7f6f654a9072]
  0x7f6f654a964e: libvarnish.so(+0x864e) [0x7f6f654a964e]
  0x7f6f654b05e8: libvarnish.so(+0xf5e8) [0x7f6f654b05e8]
  0x7f6f654b07af: libvarnish.so(VLU_Fd+0xe3) [0x7f6f654b07af]

"
@Dridi Dridi added a commit to Dridi/varnish-cache that referenced this issue Aug 31, 2016
@Dridi Dridi Non volatile reads/writes use the temperature lock
Acquiring a read lock on the CLI thread is unnecessary since a write
lock may only be acquired on that very thread.

The locking order when both are required is:

    (struct vcl).temp_rwl => vcl_mtx

Fixes #2008
8c8820b
@Dridi
Member
Dridi commented Aug 31, 2016

Hello @rnsanchez, the last panic you reported is a new one. Please open a new issue if you get this panic again after fixing the first one.

For the first panic, please give a spin to #2068.

@hermunn hermunn added a commit to hermunn/varnish-cache that referenced this issue Sep 8, 2016
@hermunn Lasse Karstensen + hermunn Stop incorrectly reporting overflow on zeroed ws.
Ref: #2008
adee6ec
@hermunn hermunn added a commit to hermunn/varnish-cache that referenced this issue Sep 8, 2016
@Dridi @hermunn Dridi + hermunn Lock when a backend is moved between lists
Refs #2008
09d9876
@hermunn hermunn added a commit to hermunn/varnish-cache that referenced this issue Sep 9, 2016
@hermunn Lasse Karstensen + hermunn Stop incorrectly reporting overflow on zeroed ws.
Ref: #2008
5df35e8
@hermunn hermunn added a commit to hermunn/varnish-cache that referenced this issue Sep 9, 2016
@Dridi @hermunn Dridi + hermunn Lock when a backend is moved between lists
Refs #2008
73b33d6
@Dridi Dridi added a commit that closed this issue Sep 9, 2016
@Dridi Dridi Non volatile reads/writes use the temperature lock
Acquiring a read lock on the CLI thread is unnecessary since a write
lock may only be acquired on that very thread.

The locking order when both are required is:

    (struct vcl).temp_rwl => vcl_mtx

Fixes #2008
6f360c9
@Dridi Dridi closed this in 6f360c9 Sep 9, 2016
@Dridi Dridi added bug 4.1 varnishd and removed more-info-needed labels Sep 9, 2016
@Dridi
Member
Dridi commented Sep 9, 2016

Reopening until I land a fix in the 4.1 branch too. I'd like to thank @rnsanchez for reporting and my past self for making his future easier.

@Dridi Dridi reopened this Sep 9, 2016
@Dridi Dridi added a commit that referenced this issue Sep 12, 2016
@Dridi Dridi Non volatile reads/writes use the temperature lock
Acquiring a read lock on the CLI thread is unnecessary since a write
lock may only be acquired on that very thread.

The locking order when both are required is:

    (struct vcl).temp_rwl => vcl_mtx

Fixes #2008

Conflicts:
	bin/varnishd/cache/cache_vcl.c
9ae3861
@Dridi
Member
Dridi commented Sep 12, 2016

@rnsanchez please test the current 4.1 branch and let me know whether it fixes your crash.

@rnsanchez
Contributor

@Dridi: I've got confirmation that current 4.1 fixes the problem. We are not able to reproduce it anymore.

Thank you!

@Dridi
Member
Dridi commented Sep 13, 2016

Great, thanks for confirming.

@Dridi Dridi closed this Sep 13, 2016
@hermunn
Contributor
hermunn commented Sep 23, 2016

Backport review: Fixed in 4.1 in 9ae3861.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment