-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
varnishadm CLI behaving weirdly #2010
Comments
Also reported in #2013 by @mwethington . Adding here to keep him/her informed. |
The child_poker lives in the manager process, and sends pings to the child. With this patch we check that we actually get a PONG back, and not some random data. If problems are detected, we kill the child. Related to: varnishcache#2010
The child_poker lives in the manager process, and sends pings to the child. With this patch we check that we actually get a PONG back, and not some random data. If problems are detected, we kill the child. Related to: varnishcache#2010
The child_poker lives in the manager process, and sends pings to the child. With this patch we check that we actually get a PONG back, and not some random data. If problems are detected, we kill the child. Related to: varnishcache#2010
I have worked with the code a lot during the last days, and I have not been able to reproduce the error or understand how it can happen. My pull request #2019 may give us some data from people that experience the problem, and will also somewhat remedy it by shutting down the worker process when it gets out of sync. |
The child_poker lives in the manager process, and sends pings to the child. With this patch we check that we actually get a PONG back, and not some random data. If problems are detected, we kill the child. Related to: varnishcache#2010
The child_poker lives in the manager process, and sends pings to the child. With this patch we check that we actually get a PONG back, and not some random data. If problems are detected, we kill the child. Related to: varnishcache#2010
I got comments from some colleagues of mine. This has been seen before, but usually only when there is a mismatch between libvarnishapi ( Please try to upgrade, and verify that you have matching versions. |
Do we need libvarnishapi1 installed? Or can we just uninstall that? |
dpkg-query -l | grep varnii libvarnishapi1 4.1.2-2 I forced it to 4.1.2 for now: apt-get -y install apt-transport-https |
Here is where I am at with 4.1.2 /usr/lib# ls -l varn varnish: |
I need to run it more with 4.1.3 in dev/test before I push it to prod again. But when I upgrade, it looks pretty clean to me: /usr/lib# ls -l varn varnish: |
Wait a sec... Shouldn't we be upgrading the lib? libvarnishapi.so.1.0.4 to libvarnishapi.so.1.0.5? |
@mwethington, thanks a lot for checking the versions and running the commands, and many thanks for testing this. I will now have a look at changes between 4.1.2 and 4.1.3, and see if anything crops up. If you this problem manifests itself in 4.1.2, please update this ticket. |
The child_poker lives in the manager process, and sends pings to the child. With this patch we check that we actually get a PONG back, and not some random data. If problems are detected, we kill the child. Related to: #2010
The child_poker lives in the manager process, and sends pings to the child. With this patch we check that we actually get a PONG back, and not some random data. If problems are detected, we kill the child. Related to: #2010
As you can see, I have added some error checking, both in master and in 4.1. If anyone with this problem can test the current 4.1 branch, it might be helpful for understanding what is going on and finding a fix. Note that when this problem happens, an automatic restart will be issued. If this happens, please update us with the error message here. The message will probably be in the syslog, depending on varnish' parameters. |
Pål, did you see this? https://www.varnish-cache.org/lists/pipermail/varnish-dev/2016-January/008758.html |
Did not see it, but I will read the mail carefully, study the code and see if I can make a patch. Thanks a lot, Dridi! |
After running varnishd patched with dee325d for some time, here is the syslog output. As a side note, I have a Nagios probe that queries the list of backends every 5 minutes. That's probably why the backend list ends up there. |
From the output @THCL has provided, it seems that either the mgt process is unable to terminate the child, or that it has the wrong pid. In 4.1 HEAD, this can be found in
The code in master looks exactly the same. Some error handling on the result of kill() is in order, it seems. I'll have a chat with some of the others before commiting anything. |
The wonderful @mbgrydeland has had a look at things (this issue, #2019, #2026), and now we have a hypothesis: There is a bug in the Varnish Jail implementation that makes the management process unable to terminate the child process. When the child get swamped and does not respond properly, the management process decides to kill it, and wait for it to be restarted automatically. In such cases, some bytes will be in transit between the mgt process and the child. When the kill fails, the pipes survive with bytes in the buffer, and thus the two processes are out of sync. This makes the mgt process try to kill it again, but it never works because of the bug. The attached patch should confirm this, and I hope @thlc can give the patch a spin and report the result. It is made for current 4.1, but it it easily cherry-picked into master, as well. Meanwhile, we will try to figure out how the jail implementation should be fixed. 0001-Add-error-checking-to-confirm-bug.patch.txt Note that the decision of killing the child is made when the mgt process does not get a timely response. This can happen if there is not enough memory and the kernel starts swapping out stuff, but there may also be other kinds of resource starvation in action. If you have any data on the health of the system at the time of the crash, it is always welcome. |
I am deploying the patch on one of our production servers. I should back with the output shortly. Thanks to both of you! |
Excerpt of the varnishd syslog output with the patch:
Hope this helps! -Thomas |
Thanks, @thlc! This means that @mbgrydeland's analysis was spot on. I will have a chat with him about this before moving forward. |
It would be great if we added a Varnishadm backend.listdetailed That would list out : Type of backend clustering. Round robin, saint etc Ip address if you have it Messages per line like rights to kill The process id Number of threads Etc Bill Bell On Oct 3, 2016, at 6:20 AM, Thomas L. <notifications@github.commailto:notifications@github.com> wrote: Excerpt of the varnishd syslog output with the patch: Oct 3 14:03:21 vext3vnlx-rbx01 varnishd[26904]: Unexpected reply from ping: 200 Backend name Admin Probe#012boot.nur01 probe Healthy 5/5#012boot.arz01 probe Healthy 5/5#012boot.agk01 probe Healthy 5/5#012boot.agk02 probe Healthy 5/5#012boot.agk01_html probe Healthy 5/5#012boot.agk02_html probe Healthy 5/5#012boot.ds01 probe Healthy 4/4 Hope this helps! -Thomas You are receiving this because you were mentioned. |
I think this is off-topic for this issue, and this not something we can do unless we change the rules. The existing |
There seems to be an error in the varnish jail design, which makes the mgt process uable to kill the child process. To confirm this, add some error checking to the relevant code. Related to: varnishcache#2010
A new jail level, JAIL_MASTER_KILL is introduced. The mgt process takes this level before killing the child process. Fixes: varnishcache#2010
There seems to be an error in the varnish jail design, which makes the mgt process uable to kill the child process. To confirm this, add some error checking to the relevant code. Related to: varnishcache#2010
A new jail level, JAIL_MASTER_KILL, is introduced. The mgt process takes this level before killing the child process. Fixes: varnishcache#2010
There seems to be an error in the varnish jail design, which makes the mgt process uable to kill the child process. To confirm this, add some error checking to the relevant code. Related to: varnishcache#2010
A new jail level, JAIL_MASTER_KILL, is introduced. The mgt process takes this level before killing the child process. Fixes: varnishcache#2010
A new jail level, JAIL_MASTER_KILL, is introduced. The mgt process takes this level before killing the child process. Fixes: varnishcache#2010
There seems to be an error in the varnish jail design, which makes the mgt process uable to kill the child process. To confirm this, add some error checking to the relevant code. Related to: varnishcache#2010
A new jail level, JAIL_MASTER_KILL, is introduced. The mgt process takes this level before killing the child process. Fixes: varnishcache#2010
There seems to be an error in the varnish jail design, which makes the mgt process uable to kill the child process. To confirm this, add some error checking to the relevant code. Related to: #2010
There seems to be an error in the varnish jail design, which makes the mgt process uable to kill the child process. To confirm this, add some error checking to the relevant code. Related to: #2010
A new jail level, JAIL_MASTER_KILL, is introduced. The mgt process takes this level before killing the child process. Fixes: #2010
Backport review: Backported as described in #2109. |
Hello, the same thing is happening in Varnish 5.0.0, resp. varnish_5.0.0-1_amd64.deb, although I don't know what triggers this condition. Thanks. |
Entering commands on the varnishadm CLI tool result in a weird unpredictable behavior as shown by the following extract:
Restarting the varnishd instance "fixed" the problem. This is the first time it's happening to me, so I can't describe how to reproduce it.
As told by scn on the IRC channel, i'm opening an issue so it can be investigated.
The problem was encountered using varnish-4.1.3-beta1 running on Debian 7 (wheezy). The sources have been compiled from the debian source package provided on Jenkins.
The text was updated successfully, but these errors were encountered: