-
Notifications
You must be signed in to change notification settings - Fork 363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Panic: Assert error in ban_mark_completed(), cache/cache_ban.c line 176 #3006
Comments
thank you for the report. |
Thanks. I've set the parameter and will watch out for new Panics. We usually have at least 1 per day, sometimes more, so I will report tomorrow on this 👍 |
I hope to have understood the issue, working on a test case. If my understanding is correct now, this is unrelated to |
Alright. I'll be happy to provide more data if needed. |
93d8050 made execution of ban_lurker_test_ban() conditional on bd != b, which effectively caused objects hanging off bans below request bans to not get tested against relevant bans. Because object bans (from the obans list) are being marked completed, the objects which were skipped would also be missed to get evaluated against the relevant bans at lookup time unless they were evaluated in request context. So, in effect, we would simply miss to test bans. Fixes #3007 Maybe related to #3006
I am not sure if this is the complete fix already, but 44ea36e should also work with 6.0, 6.1 and 6.2 If you can, please test and report. I am still trying to make up my mind if there are other possible causes. |
That’s great news. I’m a bit concerned with testing though since it’s a production system. We have a 1:1 development deployment but the issue doesn’t occur there of course. Unless there’s a way to force it somehow? |
Yes, this should work: check out 6.2, apply the changes from 44ea36e (just the two lines from |
Got it! I’ll obtain a maintenance window to deploy and will report back. |
I compiled it and it's already working on the dev environment. Sorry to bother you for this, but is there a way to see the current binary's configure options? I think the CentOS RPM has some non-default ones. I had to add those two to make it start without complaining for missing directories and files, but I'm afraid I might be missing something less obvious:
I'm using RPMs from packagecloud.io. |
The patched binary is now on our production system. We usually have at least one crash in 24 hours, so I will check back tomorrow and will let you know. |
Unfortunately it crashed again:
I have some more graphs if they can give out any clues: |
Thank you. I had thought about this more and actually the fact that the issue is still present matches my updated expectation. |
Sure. It crashes at random intervals, so I will set up a crontab to generate new file at each minute. I see that the output is about 800 lines ending with |
it would be helpful to get the full output. Adjusting the |
BTW, you may also send your output in private to varnish-support@uplex.de |
Got it. I've set I just saw something else! Someone previously added this to the crontab:
I don't really know who and why... Looking at the logs I can see that all crashes are happening around 3rd to 5th minute of the hour, which looks very suspicious now. Could this be related? I don't see this on the dev environment and maybe that's why it never crashes there. I moved it to start at the 30th minute, so we'll see from the next crash if the pattern will change. |
Yes, in fact I am very sure that the issue is related to req bans. |
We just crashed again and as predicted it happend exactly at |
background: When the ban lurker has finished working the bottom of the ban list, conceptually we mark all bans it has evaluated as completed and then remove the tail of the ban list which has no references any more. Yet, for efficiency, we first remove the tail and then mark only those bans completed, which we did not remove. Doing so depends on knowing where in the (obans) list of bans to be completed is the new tail of the bans list after pruning. 5dd54f8 was intended to solve this, but the fix was incomplete (and also unnecessarily complicated): For example when a duplicate ban was issued, ban_lurker_test_ban() could remove a ban from the obans list which later happens to become the new ban tail. We now - hopefully - solve the problem for real by properly cleaning the obans list when we prune the ban list. Fixes varnishcache#3006 Fixes varnishcache#2779 Fixes varnishcache#2556 for real (5dd54f8 was incomplete)
You might want to give nigoroll@cd774ca a try. Meanwhile I will try to come up with a specific test |
Thank you so much for looking into this! I will try to deploy this over the weekend and will get back to you with results. |
As I do now have a test case which proves that at least my hypothesis on the cause of the bug holds, I have pushed the fix to master and thus closed the issue. |
93d8050 made execution of ban_lurker_test_ban() conditional on bd != b, which effectively caused objects hanging off bans below request bans to not get tested against relevant bans. Because object bans (from the obans list) are being marked completed, the objects which were skipped would also be missed to get evaluated against the relevant bans at lookup time unless they were evaluated in request context. So, in effect, we would simply miss to test bans. Fixes varnishcache#3007 Maybe related to varnishcache#3006
background: When the ban lurker has finished working the bottom of the ban list, conceptually we mark all bans it has evaluated as completed and then remove the tail of the ban list which has no references any more. Yet, for efficiency, we first remove the tail and then mark only those bans completed, which we did not remove. Doing so depends on knowing where in the (obans) list of bans to be completed is the new tail of the bans list after pruning. 5dd54f8 was intended to solve this, but the fix was incomplete (and also unnecessarily complicated): For example when a duplicate ban was issued, ban_lurker_test_ban() could remove a ban from the obans list which later happens to become the new ban tail. We now - hopefully - solve the problem for real by properly cleaning the obans list when we prune the ban list. Fixes varnishcache#3006 Fixes varnishcache#2779 Fixes varnishcache#2556 for real (5dd54f8 was incomplete)
93d8050 made execution of ban_lurker_test_ban() conditional on bd != b, which effectively caused objects hanging off bans below request bans to not get tested against relevant bans. Because object bans (from the obans list) are being marked completed, the objects which were skipped would also be missed to get evaluated against the relevant bans at lookup time unless they were evaluated in request context. So, in effect, we would simply miss to test bans. Fixes varnishcache#3007 Maybe related to varnishcache#3006
93d8050 made execution of ban_lurker_test_ban() conditional on bd != b, which effectively caused objects hanging off bans below request bans to not get tested against relevant bans. Because object bans (from the obans list) are being marked completed, the objects which were skipped would also be missed to get evaluated against the relevant bans at lookup time unless they were evaluated in request context. So, in effect, we would simply miss to test bans. Fixes varnishcache#3007 Maybe related to varnishcache#3006
background: When the ban lurker has finished working the bottom of the ban list, conceptually we mark all bans it has evaluated as completed and then remove the tail of the ban list which has no references any more. Yet, for efficiency, we first remove the tail and then mark only those bans completed, which we did not remove. Doing so depends on knowing where in the (obans) list of bans to be completed is the new tail of the bans list after pruning. 5dd54f8 was intended to solve this, but the fix was incomplete (and also unnecessarily complicated): For example when a duplicate ban was issued, ban_lurker_test_ban() could remove a ban from the obans list which later happens to become the new ban tail. We now - hopefully - solve the problem for real by properly cleaning the obans list when we prune the ban list. Fixes #3006 Fixes #2779 Fixes #2556 for real (5dd54f8 was incomplete) Conflicts: bin/varnishd/cache/cache_ban_lurker.c
This issue looks identical to #2770 and #2779. However we are running version
6.2.0
(latest release as of today) and it is still happening although it's marked as fixed in the official changelog of6.0.0
. We get couple of Panics per day resulting in restart of the child process, which of course leads to a complete freeze of the service for 30-40 seconds.I will try to provide as much info as I can, but let me know if there's anything else that might be helpful.
We are using Varnish exclusively with Magento
1.9.3.3
+ Turpentine.Here's a daily graph of our Cache performance. Each drop to
0
corresponds to a Panic log entry:Here's the complete message from our syslog:
Our VCL is completely generated from Turpentine. It's attached here:
varnish.vcl.txt
Some additional info about OS and versions:
The VM has
1xCPU
and16GB
RAM.Varnish command line:
/usr/sbin/varnishd -P /var/run/varnish.pid -f /etc/varnish/default.vcl -a :6081 -T 192.168.103.104:6082 -S /etc/varnish/secret -s malloc,10G -p thread_pool_add_delay=2 -p thread_pools=4 -p thread_pool_min=200 -p thread_pool_max=4000 -p vcc_allow_inline_c=on -p feature=+esi_ignore_other_elements
Even though there is no swap configured on the VM we don't have any trace of OOM kicking in in
dmesg
and Varnish usually crashes even before reaching its maximum of10GB
memory.There is no CPU starvation on the VM too, so it looks completely unrelated to system load.
I'm available if more information is needed.
The text was updated successfully, but these errors were encountered: