Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic: Assert error in ban_mark_completed(), cache/cache_ban.c line 176 #3006

Closed
Jacketbg opened this issue May 29, 2019 · 21 comments
Closed

Comments

@Jacketbg
Copy link

Jacketbg commented May 29, 2019

This issue looks identical to #2770 and #2779. However we are running version 6.2.0 (latest release as of today) and it is still happening although it's marked as fixed in the official changelog of 6.0.0. We get couple of Panics per day resulting in restart of the child process, which of course leads to a complete freeze of the service for 30-40 seconds.

I will try to provide as much info as I can, but let me know if there's anything else that might be helpful.

We are using Varnish exclusively with Magento 1.9.3.3 + Turpentine.

Here's a daily graph of our Cache performance. Each drop to 0 corresponds to a Panic log entry:

varnish

Here's the complete message from our syslog:

May 29 10:03:54 varnish varnishd[18718]: Child (11644) Panic at: Wed, 29 May 2019 07:03:54 GMT
Assert error in ban_mark_completed(), cache/cache_ban.c line 176:
  Condition((b)->magic == 0x700b08ea) not true.
version = varnish-6.2.0 revision b14a3d38dbe918ad50d3838b11aa596f42179b54, vrt api = 9.0
ident = Linux,3.10.0-957.12.1.el7.x86_64,x86_64,-junix,-smalloc,-sdefault,-hcritbit,epoll
now = 1311157.412429 (mono), 1559113432.877476 (real)
Backtrace:
  0x43cf0b: /usr/sbin/varnishd() [0x43cf0b]
  0x4a0002: /usr/sbin/varnishd(VAS_Fail+0x42) [0x4a0002]
  0x41defc: /usr/sbin/varnishd() [0x41defc]
  0x42115d: /usr/sbin/varnishd(ban_lurker+0x94d) [0x42115d]
  0x45c4f7: /usr/sbin/varnishd() [0x45c4f7]
  0x7f7fb9c4cdd5: /lib64/libpthread.so.0(+0x7dd5) [0x7f7fb9c4cdd5]
  0x7f7fb9975ead: /lib64/libc.so.6(clone+0x6d) [0x7f7fb9975ead]
errno = 110 (Connection timed out)
thread = (ban-lurker)
pthread.attr = {
  guard = 4096,
  stack_bottom = 0x7f7fac400000,
  stack_top = 0x7f7facc00000,
  stack_size = 8388608,
}
thr.req = (nil) {
},
thr.busyobj = (nil) {
},
vmods = {
  std = {Varnish 6.2.0 b14a3d38dbe918ad50d3838b11aa596f42179b54, 0.0},
  directors = {Varnish 6.2.0 b14a3d38dbe918ad50d3838b11aa596f42179b54, 0.0},
},

Our VCL is completely generated from Turpentine. It's attached here:

varnish.vcl.txt

Some additional info about OS and versions:

 [root@varnish ~]# cat /etc/centos-release
 CentOS Linux release 7.6.1810 (Core)
 [root@varnish ~]# uname -a
 Linux varnish 3.10.0-957.12.1.el7.x86_64 #1 SMP Mon Apr 29 14:59:59 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
 [root@varnish ~]# varnishd -V
 varnishd (varnish-6.2.0 revision b14a3d38dbe918ad50d3838b11aa596f42179b54)
 Copyright (c) 2006 Verdens Gang AS
 Copyright (c) 2006-2019 Varnish Software AS
 [root@varnish ~]# free -m
               total        used        free      shared  buff/cache   available
 Mem:          15885        1117       10722         840        4045       13471
 Swap:             0           0

The VM has 1xCPU and 16GB RAM.

Varnish command line:
/usr/sbin/varnishd -P /var/run/varnish.pid -f /etc/varnish/default.vcl -a :6081 -T 192.168.103.104:6082 -S /etc/varnish/secret -s malloc,10G -p thread_pool_add_delay=2 -p thread_pools=4 -p thread_pool_min=200 -p thread_pool_max=4000 -p vcc_allow_inline_c=on -p feature=+esi_ignore_other_elements

Even though there is no swap configured on the VM we don't have any trace of OOM kicking in in dmesg and Varnish usually crashes even before reaching its maximum of 10GB memory.
There is no CPU starvation on the VM too, so it looks completely unrelated to system load.

I'm available if more information is needed.

@Jacketbg Jacketbg changed the title Assert error in ban_mark_completed(), cache/cache_ban.c line 176 Panic: Assert error in ban_mark_completed(), cache/cache_ban.c line 176 May 29, 2019
@nigoroll
Copy link
Member

thank you for the report.
Can you please check if this is also triggered after issuing
varnishadm param.set ban_dups off
please keep in mind that this does not survive a management process restart, so if you restart varnishd externally, you might want to add this to the command line as -p ban_dups=off

@Jacketbg
Copy link
Author

Thanks. I've set the parameter and will watch out for new Panics. We usually have at least 1 per day, sometimes more, so I will report tomorrow on this 👍

@nigoroll
Copy link
Member

I hope to have understood the issue, working on a test case. If my understanding is correct now, this is unrelated to ban_dups.

@Jacketbg
Copy link
Author

Alright. I'll be happy to provide more data if needed.

nigoroll added a commit that referenced this issue May 29, 2019
93d8050 made execution of
ban_lurker_test_ban() conditional on bd != b, which effectively caused
objects hanging off bans below request bans to not get tested against
relevant bans.

Because object bans (from the obans list) are being marked completed,
the objects which were skipped would also be missed to get evaluated
against the relevant bans at lookup time unless they were evaluated in
request context. So, in effect, we would simply miss to test bans.

Fixes #3007

Maybe related to #3006
@nigoroll
Copy link
Member

I am not sure if this is the complete fix already, but 44ea36e should also work with 6.0, 6.1 and 6.2

If you can, please test and report.

I am still trying to make up my mind if there are other possible causes.

@Jacketbg
Copy link
Author

That’s great news. I’m a bit concerned with testing though since it’s a production system. We have a 1:1 development deployment but the issue doesn’t occur there of course. Unless there’s a way to force it somehow?
What will be the safest way to test this? I’m only installing binaries from the (official maybe?) CentOS repo, so not very experienced with installing Varnish from source, but I can try. Should I just checkout the 6.2 branch and compile?

@nigoroll
Copy link
Member

Yes, this should work: check out 6.2, apply the changes from 44ea36e (just the two lines from cache_ban_lurker.c suffice), run it.
We are all just humans, but I would be pretty laid back about the risk of this fix.

@Jacketbg
Copy link
Author

Got it! I’ll obtain a maintenance window to deploy and will report back.

@Jacketbg
Copy link
Author

I compiled it and it's already working on the dev environment. Sorry to bother you for this, but is there a way to see the current binary's configure options? I think the CentOS RPM has some non-default ones. I had to add those two to make it start without complaining for missing directories and files, but I'm afraid I might be missing something less obvious:

./configure --localstatedir=/var/lib/ --libdir=/usr/lib64/

I'm using RPMs from packagecloud.io.

@Jacketbg
Copy link
Author

The patched binary is now on our production system. We usually have at least one crash in 24 hours, so I will check back tomorrow and will let you know.

@Jacketbg
Copy link
Author

Jacketbg commented May 31, 2019

Unfortunately it crashed again:

May 30 23:03:30 varnish varnishd[24397]: Child (24407) Panic at: Thu, 30 May 2019 20:03:30 GMT
Assert error in ban_mark_completed(), cache/cache_ban.c line 176:
  Condition((b)->magic == 0x700b08ea) not true.
version = varnish-6.2.0 revision b14a3d38dbe918ad50d3838b11aa596f42179b54, vrt api = 9.0
ident = Linux,3.10.0-957.12.1.el7.x86_64,x86_64,-junix,-smalloc,-sdefault,-hcritbit,epoll
now = 1444332.856700 (mono), 1559246608.321747 (real)
Backtrace:
  0x43be2b: /usr/sbin/varnishd() [0x43be2b]
  0x49cbf2: /usr/sbin/varnishd(VAS_Fail+0x42) [0x49cbf2]
  0x41d82c: /usr/sbin/varnishd() [0x41d82c]
  0x4209f4: /usr/sbin/varnishd(ban_lurker+0x9f4) [0x4209f4]
  0x45aae4: /usr/sbin/varnishd() [0x45aae4]
  0x7f28c4903dd5: /lib64/libpthread.so.0(+0x7dd5) [0x7f28c4903dd5]
  0x7f28c462cead: /lib64/libc.so.6(clone+0x6d) [0x7f28c462cead]
errno = 110 (Connection timed out)
thread = (ban-lurker)
pthread.attr = {
  guard = 4096,
  stack_bottom = 0x7f28b7000000,
  stack_top = 0x7f28b7800000,
  stack_size = 8388608,
}
thr.req = (nil) {
},
thr.busyobj = (nil) {
},
vmods = {
  std = {Varnish 6.2.0 b14a3d38dbe918ad50d3838b11aa596f42179b54, 0.0},
  directors = {Varnish 6.2.0 b14a3d38dbe918ad50d3838b11aa596f42179b54, 0.0},
},

I have some more graphs if they can give out any clues:

varnish-2 (2)
varnish-2 (1)
varnish-2
varnish (1)
varnish (2)
rrdgraph (2)
rrdgraph (3)

@nigoroll
Copy link
Member

nigoroll commented May 31, 2019

Thank you. I had thought about this more and actually the fact that the issue is still present matches my updated expectation.
Would you be able to share your varnishadm ban.list output from as close before a crash as possible? Anonymizing it is fine.

@Jacketbg
Copy link
Author

Sure. It crashes at random intervals, so I will set up a crontab to generate new file at each minute. I see that the output is about 800 lines ending with [response was truncated]. Can I do something to get more lines or this will be enough?

@nigoroll
Copy link
Member

it would be helpful to get the full output. Adjusting the cli_limit parameter should help.
I would recommend to also increase cli_timeout substantially because working on long cli outputs may delay varnishd's response to ping commands and we want to avoid additional crashes for no cli response. cli_timeout should be at least 60 seconds in addition to the runtime of the ban.list command (wall clock time)

@nigoroll
Copy link
Member

BTW, you may also send your output in private to varnish-support@uplex.de

@Jacketbg
Copy link
Author

Jacketbg commented May 31, 2019

Got it. I've set cli_limit to 10000000 and cli_timeout to 60. Now I get around 7k lines without the truncated message. I will send it to this email once we have a crash.

I just saw something else! Someone previously added this to the crontab:

2 * * * * /usr/bin/varnishadm  "ban "req.url \~ ^/\$""

I don't really know who and why... Looking at the logs I can see that all crashes are happening around 3rd to 5th minute of the hour, which looks very suspicious now. Could this be related? I don't see this on the dev environment and maybe that's why it never crashes there. I moved it to start at the 30th minute, so we'll see from the next crash if the pattern will change.

@nigoroll
Copy link
Member

Yes, in fact I am very sure that the issue is related to req bans.
It would be very helpful if you could keep the conjob until the next crash, then it could help to remove it to reduce the risk of more crashes

@Jacketbg
Copy link
Author

Jacketbg commented May 31, 2019

We just crashed again and as predicted it happend exactly at 14:31:04, so it's definitely related to the above crontab entry. I have a snapshot of varnishadm ban.list 4 seconds prior to the crash. I've send it to the provided email. Apparently the crontab entry was put there to invalidate the front page of the store every hour because of some specific widgets that were not invalidating it.

nigoroll added a commit to nigoroll/varnish-cache that referenced this issue May 31, 2019
background: When the ban lurker has finished working the bottom of the
ban list, conceptually we mark all bans it has evaluated as completed
and then remove the tail of the ban list which has no references any
more.

Yet, for efficiency, we first remove the tail and then mark only those
bans completed, which we did not remove. Doing so depends on knowing
where in the (obans) list of bans to be completed is the new tail of
the bans list after pruning.

5dd54f8 was intended to solve this,
but the fix was incomplete (and also unnecessarily complicated): For
example when a duplicate ban was issued, ban_lurker_test_ban() could
remove a ban from the obans list which later happens to become the new
ban tail.

We now - hopefully - solve the problem for real by properly cleaning
the obans list when we prune the ban list.

Fixes varnishcache#3006
Fixes varnishcache#2779
Fixes varnishcache#2556 for real (5dd54f8 was
incomplete)
@nigoroll
Copy link
Member

nigoroll commented May 31, 2019

You might want to give nigoroll@cd774ca a try. Meanwhile I will try to come up with a specific test
If you test, please put back the ban req.url cronjob at least until you see another crash.

@Jacketbg
Copy link
Author

Thank you so much for looking into this! I will try to deploy this over the weekend and will get back to you with results.

@nigoroll
Copy link
Member

As I do now have a test case which proves that at least my hypothesis on the cause of the bug holds, I have pushed the fix to master and thus closed the issue.
Please reopen should you see this resurfacing with the fix in place.

rezan pushed a commit to rezan/varnish-cache that referenced this issue Jul 11, 2019
93d8050 made execution of
ban_lurker_test_ban() conditional on bd != b, which effectively caused
objects hanging off bans below request bans to not get tested against
relevant bans.

Because object bans (from the obans list) are being marked completed,
the objects which were skipped would also be missed to get evaluated
against the relevant bans at lookup time unless they were evaluated in
request context. So, in effect, we would simply miss to test bans.

Fixes varnishcache#3007

Maybe related to varnishcache#3006
rezan pushed a commit to rezan/varnish-cache that referenced this issue Jul 11, 2019
background: When the ban lurker has finished working the bottom of the
ban list, conceptually we mark all bans it has evaluated as completed
and then remove the tail of the ban list which has no references any
more.

Yet, for efficiency, we first remove the tail and then mark only those
bans completed, which we did not remove. Doing so depends on knowing
where in the (obans) list of bans to be completed is the new tail of
the bans list after pruning.

5dd54f8 was intended to solve this,
but the fix was incomplete (and also unnecessarily complicated): For
example when a duplicate ban was issued, ban_lurker_test_ban() could
remove a ban from the obans list which later happens to become the new
ban tail.

We now - hopefully - solve the problem for real by properly cleaning
the obans list when we prune the ban list.

Fixes varnishcache#3006
Fixes varnishcache#2779
Fixes varnishcache#2556 for real (5dd54f8 was
incomplete)
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Oct 2, 2019
93d8050 made execution of
ban_lurker_test_ban() conditional on bd != b, which effectively caused
objects hanging off bans below request bans to not get tested against
relevant bans.

Because object bans (from the obans list) are being marked completed,
the objects which were skipped would also be missed to get evaluated
against the relevant bans at lookup time unless they were evaluated in
request context. So, in effect, we would simply miss to test bans.

Fixes varnishcache#3007

Maybe related to varnishcache#3006
hermunn pushed a commit to hermunn/varnish-cache that referenced this issue Nov 3, 2020
93d8050 made execution of
ban_lurker_test_ban() conditional on bd != b, which effectively caused
objects hanging off bans below request bans to not get tested against
relevant bans.

Because object bans (from the obans list) are being marked completed,
the objects which were skipped would also be missed to get evaluated
against the relevant bans at lookup time unless they were evaluated in
request context. So, in effect, we would simply miss to test bans.

Fixes varnishcache#3007

Maybe related to varnishcache#3006
rezan pushed a commit that referenced this issue Apr 21, 2021
background: When the ban lurker has finished working the bottom of the
ban list, conceptually we mark all bans it has evaluated as completed
and then remove the tail of the ban list which has no references any
more.

Yet, for efficiency, we first remove the tail and then mark only those
bans completed, which we did not remove. Doing so depends on knowing
where in the (obans) list of bans to be completed is the new tail of
the bans list after pruning.

5dd54f8 was intended to solve this,
but the fix was incomplete (and also unnecessarily complicated): For
example when a duplicate ban was issued, ban_lurker_test_ban() could
remove a ban from the obans list which later happens to become the new
ban tail.

We now - hopefully - solve the problem for real by properly cleaning
the obans list when we prune the ban list.

Fixes #3006
Fixes #2779
Fixes #2556 for real (5dd54f8 was
incomplete)

 Conflicts:
	bin/varnishd/cache/cache_ban_lurker.c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants