Panic: Assert error in ban_mark_completed(), cache/cache_ban.c line 176 #3006

Jacketbg · 2019-05-29T12:09:15Z

This issue looks identical to #2770 and #2779. However we are running version 6.2.0 (latest release as of today) and it is still happening although it's marked as fixed in the official changelog of 6.0.0. We get couple of Panics per day resulting in restart of the child process, which of course leads to a complete freeze of the service for 30-40 seconds.

I will try to provide as much info as I can, but let me know if there's anything else that might be helpful.

We are using Varnish exclusively with Magento 1.9.3.3 + Turpentine.

Here's a daily graph of our Cache performance. Each drop to 0 corresponds to a Panic log entry:

Here's the complete message from our syslog:

May 29 10:03:54 varnish varnishd[18718]: Child (11644) Panic at: Wed, 29 May 2019 07:03:54 GMT
Assert error in ban_mark_completed(), cache/cache_ban.c line 176:
  Condition((b)->magic == 0x700b08ea) not true.
version = varnish-6.2.0 revision b14a3d38dbe918ad50d3838b11aa596f42179b54, vrt api = 9.0
ident = Linux,3.10.0-957.12.1.el7.x86_64,x86_64,-junix,-smalloc,-sdefault,-hcritbit,epoll
now = 1311157.412429 (mono), 1559113432.877476 (real)
Backtrace:
  0x43cf0b: /usr/sbin/varnishd() [0x43cf0b]
  0x4a0002: /usr/sbin/varnishd(VAS_Fail+0x42) [0x4a0002]
  0x41defc: /usr/sbin/varnishd() [0x41defc]
  0x42115d: /usr/sbin/varnishd(ban_lurker+0x94d) [0x42115d]
  0x45c4f7: /usr/sbin/varnishd() [0x45c4f7]
  0x7f7fb9c4cdd5: /lib64/libpthread.so.0(+0x7dd5) [0x7f7fb9c4cdd5]
  0x7f7fb9975ead: /lib64/libc.so.6(clone+0x6d) [0x7f7fb9975ead]
errno = 110 (Connection timed out)
thread = (ban-lurker)
pthread.attr = {
  guard = 4096,
  stack_bottom = 0x7f7fac400000,
  stack_top = 0x7f7facc00000,
  stack_size = 8388608,
}
thr.req = (nil) {
},
thr.busyobj = (nil) {
},
vmods = {
  std = {Varnish 6.2.0 b14a3d38dbe918ad50d3838b11aa596f42179b54, 0.0},
  directors = {Varnish 6.2.0 b14a3d38dbe918ad50d3838b11aa596f42179b54, 0.0},
},

Our VCL is completely generated from Turpentine. It's attached here:

varnish.vcl.txt

Some additional info about OS and versions:

 [root@varnish ~]# cat /etc/centos-release
 CentOS Linux release 7.6.1810 (Core)
 [root@varnish ~]# uname -a
 Linux varnish 3.10.0-957.12.1.el7.x86_64 #1 SMP Mon Apr 29 14:59:59 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
 [root@varnish ~]# varnishd -V
 varnishd (varnish-6.2.0 revision b14a3d38dbe918ad50d3838b11aa596f42179b54)
 Copyright (c) 2006 Verdens Gang AS
 Copyright (c) 2006-2019 Varnish Software AS
 [root@varnish ~]# free -m
               total        used        free      shared  buff/cache   available
 Mem:          15885        1117       10722         840        4045       13471
 Swap:             0           0

The VM has 1xCPU and 16GB RAM.

Varnish command line:
/usr/sbin/varnishd -P /var/run/varnish.pid -f /etc/varnish/default.vcl -a :6081 -T 192.168.103.104:6082 -S /etc/varnish/secret -s malloc,10G -p thread_pool_add_delay=2 -p thread_pools=4 -p thread_pool_min=200 -p thread_pool_max=4000 -p vcc_allow_inline_c=on -p feature=+esi_ignore_other_elements

Even though there is no swap configured on the VM we don't have any trace of OOM kicking in in dmesg and Varnish usually crashes even before reaching its maximum of 10GB memory.
There is no CPU starvation on the VM too, so it looks completely unrelated to system load.

I'm available if more information is needed.

The text was updated successfully, but these errors were encountered:

nigoroll · 2019-05-29T12:26:04Z

thank you for the report.
Can you please check if this is also triggered after issuing
varnishadm param.set ban_dups off
please keep in mind that this does not survive a management process restart, so if you restart varnishd externally, you might want to add this to the command line as -p ban_dups=off

Jacketbg · 2019-05-29T12:28:59Z

Thanks. I've set the parameter and will watch out for new Panics. We usually have at least 1 per day, sometimes more, so I will report tomorrow on this 👍

nigoroll · 2019-05-29T13:25:52Z

I hope to have understood the issue, working on a test case. If my understanding is correct now, this is unrelated to ban_dups.

Jacketbg · 2019-05-29T13:45:35Z

Alright. I'll be happy to provide more data if needed.

93d8050 made execution of ban_lurker_test_ban() conditional on bd != b, which effectively caused objects hanging off bans below request bans to not get tested against relevant bans. Because object bans (from the obans list) are being marked completed, the objects which were skipped would also be missed to get evaluated against the relevant bans at lookup time unless they were evaluated in request context. So, in effect, we would simply miss to test bans. Fixes #3007 Maybe related to #3006

nigoroll · 2019-05-29T17:06:03Z

I am not sure if this is the complete fix already, but 44ea36e should also work with 6.0, 6.1 and 6.2

If you can, please test and report.

I am still trying to make up my mind if there are other possible causes.

Jacketbg · 2019-05-29T17:17:15Z

That’s great news. I’m a bit concerned with testing though since it’s a production system. We have a 1:1 development deployment but the issue doesn’t occur there of course. Unless there’s a way to force it somehow?
What will be the safest way to test this? I’m only installing binaries from the (official maybe?) CentOS repo, so not very experienced with installing Varnish from source, but I can try. Should I just checkout the 6.2 branch and compile?

nigoroll · 2019-05-29T17:19:17Z

Yes, this should work: check out 6.2, apply the changes from 44ea36e (just the two lines from cache_ban_lurker.c suffice), run it.
We are all just humans, but I would be pretty laid back about the risk of this fix.

Jacketbg · 2019-05-29T17:31:55Z

Got it! I’ll obtain a maintenance window to deploy and will report back.

Jacketbg · 2019-05-29T18:33:05Z

I compiled it and it's already working on the dev environment. Sorry to bother you for this, but is there a way to see the current binary's configure options? I think the CentOS RPM has some non-default ones. I had to add those two to make it start without complaining for missing directories and files, but I'm afraid I might be missing something less obvious:

./configure --localstatedir=/var/lib/ --libdir=/usr/lib64/

I'm using RPMs from packagecloud.io.

Jacketbg · 2019-05-30T05:08:11Z

The patched binary is now on our production system. We usually have at least one crash in 24 hours, so I will check back tomorrow and will let you know.

Jacketbg · 2019-05-31T05:14:54Z

Unfortunately it crashed again:

May 30 23:03:30 varnish varnishd[24397]: Child (24407) Panic at: Thu, 30 May 2019 20:03:30 GMT
Assert error in ban_mark_completed(), cache/cache_ban.c line 176:
  Condition((b)->magic == 0x700b08ea) not true.
version = varnish-6.2.0 revision b14a3d38dbe918ad50d3838b11aa596f42179b54, vrt api = 9.0
ident = Linux,3.10.0-957.12.1.el7.x86_64,x86_64,-junix,-smalloc,-sdefault,-hcritbit,epoll
now = 1444332.856700 (mono), 1559246608.321747 (real)
Backtrace:
  0x43be2b: /usr/sbin/varnishd() [0x43be2b]
  0x49cbf2: /usr/sbin/varnishd(VAS_Fail+0x42) [0x49cbf2]
  0x41d82c: /usr/sbin/varnishd() [0x41d82c]
  0x4209f4: /usr/sbin/varnishd(ban_lurker+0x9f4) [0x4209f4]
  0x45aae4: /usr/sbin/varnishd() [0x45aae4]
  0x7f28c4903dd5: /lib64/libpthread.so.0(+0x7dd5) [0x7f28c4903dd5]
  0x7f28c462cead: /lib64/libc.so.6(clone+0x6d) [0x7f28c462cead]
errno = 110 (Connection timed out)
thread = (ban-lurker)
pthread.attr = {
  guard = 4096,
  stack_bottom = 0x7f28b7000000,
  stack_top = 0x7f28b7800000,
  stack_size = 8388608,
}
thr.req = (nil) {
},
thr.busyobj = (nil) {
},
vmods = {
  std = {Varnish 6.2.0 b14a3d38dbe918ad50d3838b11aa596f42179b54, 0.0},
  directors = {Varnish 6.2.0 b14a3d38dbe918ad50d3838b11aa596f42179b54, 0.0},
},

I have some more graphs if they can give out any clues:

nigoroll · 2019-05-31T07:25:33Z

Thank you. I had thought about this more and actually the fact that the issue is still present matches my updated expectation.
Would you be able to share your varnishadm ban.list output from as close before a crash as possible? Anonymizing it is fine.

Jacketbg · 2019-05-31T07:30:50Z

Sure. It crashes at random intervals, so I will set up a crontab to generate new file at each minute. I see that the output is about 800 lines ending with [response was truncated]. Can I do something to get more lines or this will be enough?

nigoroll · 2019-05-31T07:34:28Z

it would be helpful to get the full output. Adjusting the cli_limit parameter should help.
I would recommend to also increase cli_timeout substantially because working on long cli outputs may delay varnishd's response to ping commands and we want to avoid additional crashes for no cli response. cli_timeout should be at least 60 seconds in addition to the runtime of the ban.list command (wall clock time)

nigoroll · 2019-05-31T07:35:53Z

BTW, you may also send your output in private to varnish-support@uplex.de

Jacketbg · 2019-05-31T07:45:12Z

Got it. I've set cli_limit to 10000000 and cli_timeout to 60. Now I get around 7k lines without the truncated message. I will send it to this email once we have a crash.

I just saw something else! Someone previously added this to the crontab:

2 * * * * /usr/bin/varnishadm  "ban "req.url \~ ^/\$""

I don't really know who and why... Looking at the logs I can see that all crashes are happening around 3rd to 5th minute of the hour, which looks very suspicious now. Could this be related? I don't see this on the dev environment and maybe that's why it never crashes there. I moved it to start at the 30th minute, so we'll see from the next crash if the pattern will change.

nigoroll · 2019-05-31T07:52:02Z

Yes, in fact I am very sure that the issue is related to req bans.
It would be very helpful if you could keep the conjob until the next crash, then it could help to remove it to reduce the risk of more crashes

Jacketbg · 2019-05-31T11:51:30Z

We just crashed again and as predicted it happend exactly at 14:31:04, so it's definitely related to the above crontab entry. I have a snapshot of varnishadm ban.list 4 seconds prior to the crash. I've send it to the provided email. Apparently the crontab entry was put there to invalidate the front page of the store every hour because of some specific widgets that were not invalidating it.

background: When the ban lurker has finished working the bottom of the ban list, conceptually we mark all bans it has evaluated as completed and then remove the tail of the ban list which has no references any more. Yet, for efficiency, we first remove the tail and then mark only those bans completed, which we did not remove. Doing so depends on knowing where in the (obans) list of bans to be completed is the new tail of the bans list after pruning. 5dd54f8 was intended to solve this, but the fix was incomplete (and also unnecessarily complicated): For example when a duplicate ban was issued, ban_lurker_test_ban() could remove a ban from the obans list which later happens to become the new ban tail. We now - hopefully - solve the problem for real by properly cleaning the obans list when we prune the ban list. Fixes varnishcache#3006 Fixes varnishcache#2779 Fixes varnishcache#2556 for real (5dd54f8 was incomplete)

nigoroll · 2019-05-31T13:54:33Z

You might want to give nigoroll@cd774ca a try. Meanwhile I will try to come up with a specific test
If you test, please put back the ban req.url cronjob at least until you see another crash.

Jacketbg · 2019-05-31T13:57:39Z

Thank you so much for looking into this! I will try to deploy this over the weekend and will get back to you with results.

nigoroll · 2019-05-31T16:29:40Z

As I do now have a test case which proves that at least my hypothesis on the cause of the bug holds, I have pushed the fix to master and thus closed the issue.
Please reopen should you see this resurfacing with the fix in place.

93d8050 made execution of ban_lurker_test_ban() conditional on bd != b, which effectively caused objects hanging off bans below request bans to not get tested against relevant bans. Because object bans (from the obans list) are being marked completed, the objects which were skipped would also be missed to get evaluated against the relevant bans at lookup time unless they were evaluated in request context. So, in effect, we would simply miss to test bans. Fixes varnishcache#3007 Maybe related to varnishcache#3006

background: When the ban lurker has finished working the bottom of the ban list, conceptually we mark all bans it has evaluated as completed and then remove the tail of the ban list which has no references any more. Yet, for efficiency, we first remove the tail and then mark only those bans completed, which we did not remove. Doing so depends on knowing where in the (obans) list of bans to be completed is the new tail of the bans list after pruning. 5dd54f8 was intended to solve this, but the fix was incomplete (and also unnecessarily complicated): For example when a duplicate ban was issued, ban_lurker_test_ban() could remove a ban from the obans list which later happens to become the new ban tail. We now - hopefully - solve the problem for real by properly cleaning the obans list when we prune the ban list. Fixes varnishcache#3006 Fixes varnishcache#2779 Fixes varnishcache#2556 for real (5dd54f8 was incomplete)

93d8050 made execution of ban_lurker_test_ban() conditional on bd != b, which effectively caused objects hanging off bans below request bans to not get tested against relevant bans. Because object bans (from the obans list) are being marked completed, the objects which were skipped would also be missed to get evaluated against the relevant bans at lookup time unless they were evaluated in request context. So, in effect, we would simply miss to test bans. Fixes varnishcache#3007 Maybe related to varnishcache#3006

background: When the ban lurker has finished working the bottom of the ban list, conceptually we mark all bans it has evaluated as completed and then remove the tail of the ban list which has no references any more. Yet, for efficiency, we first remove the tail and then mark only those bans completed, which we did not remove. Doing so depends on knowing where in the (obans) list of bans to be completed is the new tail of the bans list after pruning. 5dd54f8 was intended to solve this, but the fix was incomplete (and also unnecessarily complicated): For example when a duplicate ban was issued, ban_lurker_test_ban() could remove a ban from the obans list which later happens to become the new ban tail. We now - hopefully - solve the problem for real by properly cleaning the obans list when we prune the ban list. Fixes #3006 Fixes #2779 Fixes #2556 for real (5dd54f8 was incomplete) Conflicts: bin/varnishd/cache/cache_ban_lurker.c

Jacketbg changed the title ~~Assert error in ban_mark_completed(), cache/cache_ban.c line 176~~ Panic: Assert error in ban_mark_completed(), cache/cache_ban.c line 176 May 29, 2019

nigoroll closed this as completed in 991d8d1 May 31, 2019

nigoroll mentioned this issue Oct 2, 2019

fix regression: ban-lurker skips objects #3074

Merged

stevendore mentioned this issue Apr 13, 2021

[LTS Backport] properly maintain the obans list when pruning the ban list tail #3589

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic: Assert error in ban_mark_completed(), cache/cache_ban.c line 176 #3006

Panic: Assert error in ban_mark_completed(), cache/cache_ban.c line 176 #3006

Jacketbg commented May 29, 2019 •

edited

nigoroll commented May 29, 2019

Jacketbg commented May 29, 2019

nigoroll commented May 29, 2019

Jacketbg commented May 29, 2019

nigoroll commented May 29, 2019

Jacketbg commented May 29, 2019

nigoroll commented May 29, 2019

Jacketbg commented May 29, 2019

Jacketbg commented May 29, 2019

Jacketbg commented May 30, 2019

Jacketbg commented May 31, 2019 •

edited

nigoroll commented May 31, 2019 •

edited

Jacketbg commented May 31, 2019

nigoroll commented May 31, 2019

nigoroll commented May 31, 2019

Jacketbg commented May 31, 2019 •

edited

nigoroll commented May 31, 2019

Jacketbg commented May 31, 2019 •

edited

nigoroll commented May 31, 2019 •

edited

Jacketbg commented May 31, 2019

nigoroll commented May 31, 2019

Panic: Assert error in ban_mark_completed(), cache/cache_ban.c line 176 #3006

Panic: Assert error in ban_mark_completed(), cache/cache_ban.c line 176 #3006

Comments

Jacketbg commented May 29, 2019 • edited

nigoroll commented May 29, 2019

Jacketbg commented May 29, 2019

nigoroll commented May 29, 2019

Jacketbg commented May 29, 2019

nigoroll commented May 29, 2019

Jacketbg commented May 29, 2019

nigoroll commented May 29, 2019

Jacketbg commented May 29, 2019

Jacketbg commented May 29, 2019

Jacketbg commented May 30, 2019

Jacketbg commented May 31, 2019 • edited

nigoroll commented May 31, 2019 • edited

Jacketbg commented May 31, 2019

nigoroll commented May 31, 2019

nigoroll commented May 31, 2019

Jacketbg commented May 31, 2019 • edited

nigoroll commented May 31, 2019

Jacketbg commented May 31, 2019 • edited

nigoroll commented May 31, 2019 • edited

Jacketbg commented May 31, 2019

nigoroll commented May 31, 2019

Jacketbg commented May 29, 2019 •

edited

Jacketbg commented May 31, 2019 •

edited

nigoroll commented May 31, 2019 •

edited

Jacketbg commented May 31, 2019 •

edited

Jacketbg commented May 31, 2019 •

edited

nigoroll commented May 31, 2019 •

edited