Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

varnishadm CLI behaving weirdly #2010

Closed
thlc opened this issue Jul 7, 2016 · 22 comments
Closed

varnishadm CLI behaving weirdly #2010

thlc opened this issue Jul 7, 2016 · 22 comments
Assignees

Comments

@thlc
Copy link

thlc commented Jul 7, 2016

Entering commands on the varnishadm CLI tool result in a weird unpredictable behavior as shown by the following extract:

# varnishadm
200        
-----------------------------
Varnish Cache CLI 1.0
-----------------------------
Linux,3.2.0-4-amd64,x86_64,-junix,-sfile,-smalloc,-hcritbit
varnish-4.1.3-beta1 revision ce824e2
 
Type 'help' for command list.
Type 'quit' to close CLI session.
 
varnish> backend.list
200        
vcl.show [-v] 
backend.list [-p] []
backend.set_health  
ban    [&&    ...]
ban.list
 
varnish> backend.list
200        
PONG 1467895240 1.0
varnish>
varnish> backend.list
200        
PONG 1467895243 1.0
varnish> backend.list
200        
Backend name                   Admin      Probe
boot.nur01                     probe      Healthy 5/5
boot.arz01                     probe      Healthy 5/5
boot.agk01                     probe      Healthy 5/5
boot.agk02                     probe      Healthy 5/5
boot.agk01_html                probe      Healthy 5/5
boot.agk02_html                probe      Healthy 5/5
varnish> backend.list
200        
PONG 1467895246 1.0
varnish> backend.list
200        
Backend name                   Admin      Probe
boot.nur01                     probe      Healthy 5/5
boot.arz01                     probe      Healthy 5/5
boot.agk01                     probe      Healthy 5/5
boot.agk02                     probe      Healthy 5/5
boot.agk01_html                probe      Healthy 5/5
boot.agk02_html                probe      Healthy 5/5
varnish> backend.list
200        
Backend name                   Admin      Probe
boot.nur01                     probe      Healthy 5/5
boot.arz01                     probe      Healthy 5/5
boot.agk01                     probe      Healthy 5/5
boot.agk02                     probe      Healthy 5/5
boot.agk01_html                probe      Healthy 5/5
boot.agk02_html                probe      Healthy 5/5
varnish> backend.list
200        
PONG 1467895285 1.0
varnish> backend.list
200        
Backend name                   Admin      Probe
boot.nur01                     probe      Healthy 5/5
boot.arz01                     probe      Healthy 5/5
boot.agk01                     probe      Healthy 5/5
boot.agk02                     probe      Healthy 5/5
boot.agk01_html                probe      Healthy 5/5
boot.agk02_html                probe      Healthy 5/5

Restarting the varnishd instance "fixed" the problem. This is the first time it's happening to me, so I can't describe how to reproduce it.

As told by scn on the IRC channel, i'm opening an issue so it can be investigated.

@mithrandir | looks like the CLI is out of sync
@scn | please file a bug. i've seen this before and I think it is just editline/readline acting up, but would be good to squash it for good.

The problem was encountered using varnish-4.1.3-beta1 running on Debian 7 (wheezy). The sources have been compiled from the debian source package provided on Jenkins.

@lkarsten
Copy link
Contributor

Also reported in #2013 by @mwethington . Adding here to keep him/her informed.

hermunn added a commit to hermunn/varnish-cache that referenced this issue Jul 14, 2016
The child_poker lives in the manager process, and sends pings to the
child. With this patch we check that we actually get a PONG back, and
not some random data. If problems are detected, we kill the child.

Related to: varnishcache#2010
hermunn added a commit to hermunn/varnish-cache that referenced this issue Jul 14, 2016
The child_poker lives in the manager process, and sends pings to the
child. With this patch we check that we actually get a PONG back, and
not some random data. If problems are detected, we kill the child.

Related to: varnishcache#2010
hermunn added a commit to hermunn/varnish-cache that referenced this issue Jul 14, 2016
The child_poker lives in the manager process, and sends pings to the
child. With this patch we check that we actually get a PONG back, and
not some random data. If problems are detected, we kill the child.

Related to: varnishcache#2010
@hermunn
Copy link
Member

hermunn commented Jul 15, 2016

I have worked with the code a lot during the last days, and I have not been able to reproduce the error or understand how it can happen.

My pull request #2019 may give us some data from people that experience the problem, and will also somewhat remedy it by shutting down the worker process when it gets out of sync.

hermunn added a commit to hermunn/varnish-cache that referenced this issue Jul 15, 2016
The child_poker lives in the manager process, and sends pings to the
child. With this patch we check that we actually get a PONG back, and
not some random data. If problems are detected, we kill the child.

Related to: varnishcache#2010
hermunn added a commit to hermunn/varnish-cache that referenced this issue Jul 15, 2016
The child_poker lives in the manager process, and sends pings to the
child. With this patch we check that we actually get a PONG back, and
not some random data. If problems are detected, we kill the child.

Related to: varnishcache#2010
@hermunn
Copy link
Member

hermunn commented Jul 15, 2016

I got comments from some colleagues of mine. This has been seen before, but usually only when there is a mismatch between libvarnishapi (libvarnishapi1 in debian) and varnishadm (part of varnish). Our packages should prohibit such a mismatch, but if something goes wrong in apt/dpkg, the packages can become out of sync.

Please try to upgrade, and verify that you have matching versions.

@mwethington
Copy link

Do we need libvarnishapi1 installed? Or can we just uninstall that?

@mwethington
Copy link

dpkg-query -l | grep varn

ii libvarnishapi1 4.1.2-2jessie amd64 shared libraries for Varnish
ii varnish 4.1.2-2
jessie amd64 state of the art, high-performance web accelerator
Then I just "apt-get update; apt-get -y -o Dpkg::Options::="--force-confnew" install varnish" and I get 4.1.3... And then I get the issue.

I forced it to 4.1.2 for now:

apt-get -y install apt-transport-https
curl https://repo.varnish-cache.org/GPG-key.txt | apt-key add -
echo "deb https://repo.varnish-cache.org/debian/ jessie varnish-4.1" > /etc/apt/sources.list.d/varnish-cache.list
apt-get -y update
#apt-get -y -o Dpkg::Options::="--force-confnew" install varnish
apt-get --force-yes -y -o Dpkg::Options::="--force-confnew" install varnish=4.1.2-2jessie libvarnishapi1=4.1.2-2jessie

@mwethington
Copy link

Here is where I am at with 4.1.2

/usr/lib# ls -l varn
lrwxrwxrwx 1 root root 22 Apr 15 13:49 libvarnishapi.so.1 -> libvarnishapi.so.1.0.4
-rw-r--r-- 1 root root 170928 Apr 15 13:49 libvarnishapi.so.1.0.4

varnish:
total 332
-rw-r--r-- 1 root root 5944 Apr 15 13:49 libvarnishcompat.so
-rw-r--r-- 1 root root 93360 Apr 15 13:49 libvarnish.so
-rw-r--r-- 1 root root 155248 Apr 15 13:49 libvcc.so
-rw-r--r-- 1 root root 76176 Apr 15 13:49 libvgz.so
drwxr-xr-x 2 root root 4096 Jul 14 22:36 vmods
root@ip-10-248-1-170:/usr/lib# ls -l varnish/vmods
total 60
-rw-r--r-- 1 root root 30968 Apr 15 13:49 libvmod_directors.so
-rw-r--r-- 1 root root 27056 Apr 15 13:49 libvmod_std.so

@mwethington
Copy link

mwethington commented Jul 17, 2016

I need to run it more with 4.1.3 in dev/test before I push it to prod again.

But when I upgrade, it looks pretty clean to me:

/usr/lib# ls -l varn
lrwxrwxrwx 1 root root 22 Jul 6 12:32 libvarnishapi.so.1 -> libvarnishapi.so.1.0.4
-rw-r--r-- 1 root root 175024 Jul 6 12:32 libvarnishapi.so.1.0.4

varnish:
total 332
-rw-r--r-- 1 root root 5944 Jul 6 12:32 libvarnishcompat.so
-rw-r--r-- 1 root root 93360 Jul 6 12:32 libvarnish.so
-rw-r--r-- 1 root root 155248 Jul 6 12:32 libvcc.so
-rw-r--r-- 1 root root 76176 Jul 6 12:32 libvgz.so
drwxr-xr-x 2 root root 4096 Jul 17 18:44 vmods
root@ip-10-248-1-170:/usr/lib# ls -ltr varnish/vmods/
total 60
-rw-r--r-- 1 root root 27072 Jul 6 12:32 libvmod_std.so
-rw-r--r-- 1 root root 30968 Jul 6 12:32 libvmod_directors.so

@mwethington
Copy link

Wait a sec... Shouldn't we be upgrading the lib? libvarnishapi.so.1.0.4 to libvarnishapi.so.1.0.5?

@hermunn
Copy link
Member

hermunn commented Jul 18, 2016

@mwethington, thanks a lot for checking the versions and running the commands, and many thanks for testing this. I will now have a look at changes between 4.1.2 and 4.1.3, and see if anything crops up.

If you this problem manifests itself in 4.1.2, please update this ticket.

hermunn added a commit that referenced this issue Jul 27, 2016
The child_poker lives in the manager process, and sends pings to the
child. With this patch we check that we actually get a PONG back, and
not some random data. If problems are detected, we kill the child.

Related to: #2010
hermunn added a commit that referenced this issue Jul 27, 2016
The child_poker lives in the manager process, and sends pings to the
child. With this patch we check that we actually get a PONG back, and
not some random data. If problems are detected, we kill the child.

Related to: #2010
@hermunn
Copy link
Member

hermunn commented Jul 27, 2016

As you can see, I have added some error checking, both in master and in 4.1.

If anyone with this problem can test the current 4.1 branch, it might be helpful for understanding what is going on and finding a fix. Note that when this problem happens, an automatic restart will be issued. If this happens, please update us with the error message here. The message will probably be in the syslog, depending on varnish' parameters.

@dridi
Copy link
Member

dridi commented Jul 31, 2016

@hermunn
Copy link
Member

hermunn commented Aug 1, 2016

Did not see it, but I will read the mail carefully, study the code and see if I can make a patch. Thanks a lot, Dridi!

@thlc
Copy link
Author

thlc commented Sep 15, 2016

After running varnishd patched with dee325d for some time, here is the syslog output.

As a side note, I have a Nagios probe that queries the list of backends every 5 minutes. That's probably why the backend list ends up there.

output.txt

@hermunn
Copy link
Member

hermunn commented Sep 26, 2016

From the output @THCL has provided, it seems that either the mgt process is unable to terminate the child, or that it has the wrong pid. In 4.1 HEAD, this can be found in mgt_child.c from line 587:

void
MGT_Child_Cli_Fail(void)
{
       if (child_state != CH_RUNNING)
               return;
       if (child_pid < 0)
               return;
       MGT_complain(C_ERR, "Child (%jd) not responding to CLI, killing it.",
           (intmax_t)child_pid);
       if (MGT_FEATURE(FEATURE_NO_COREDUMP))
               (void)kill(child_pid, SIGKILL);
       else
               (void)kill(child_pid, SIGQUIT);
}

The code in master looks exactly the same. Some error handling on the result of kill() is in order, it seems. I'll have a chat with some of the others before commiting anything.

@hermunn
Copy link
Member

hermunn commented Sep 28, 2016

The wonderful @mbgrydeland has had a look at things (this issue, #2019, #2026), and now we have a hypothesis: There is a bug in the Varnish Jail implementation that makes the management process unable to terminate the child process. When the child get swamped and does not respond properly, the management process decides to kill it, and wait for it to be restarted automatically. In such cases, some bytes will be in transit between the mgt process and the child. When the kill fails, the pipes survive with bytes in the buffer, and thus the two processes are out of sync. This makes the mgt process try to kill it again, but it never works because of the bug.

The attached patch should confirm this, and I hope @thlc can give the patch a spin and report the result. It is made for current 4.1, but it it easily cherry-picked into master, as well. Meanwhile, we will try to figure out how the jail implementation should be fixed.

0001-Add-error-checking-to-confirm-bug.patch.txt

Note that the decision of killing the child is made when the mgt process does not get a timely response. This can happen if there is not enough memory and the kernel starts swapping out stuff, but there may also be other kinds of resource starvation in action. If you have any data on the health of the system at the time of the crash, it is always welcome.

@thlc
Copy link
Author

thlc commented Sep 29, 2016

I am deploying the patch on one of our production servers. I should back with the output shortly.

Thanks to both of you!

@thlc
Copy link
Author

thlc commented Oct 3, 2016

Excerpt of the varnishd syslog output with the patch:

Oct  3 14:03:21 vext3vnlx-rbx01 varnishd[26904]: Unexpected reply from ping: 200 Backend name                   Admin      Probe#012boot.nur01                     probe      Healthy 5/5#012boot.arz01                     probe      Healthy 5/5#012boot.agk01                     probe      Healthy 5/5#012boot.agk02                     probe      Healthy 5/5#012boot.agk01_html                probe      Healthy 5/5#012boot.agk02_html                probe      Healthy 5/5#012boot.ds01                      probe      Healthy 4/4
Oct  3 14:03:21 vext3vnlx-rbx01 varnishd[26904]: Child (26922) not responding to CLI. Tried to kill it, but did not have the rights.
Oct  3 14:03:51 vext3vnlx-rbx01 varnishd[26904]: Unexpected reply from ping: 200 Backend name                   Admin      Probe#012boot.nur01                     probe      Healthy 5/5#012boot.arz01                     probe      Healthy 5/5#012boot.agk01                     probe      Healthy 5/5#012boot.agk02                     probe      Healthy 5/5#012boot.agk01_html                probe      Healthy 5/5#012boot.agk02_html                probe      Healthy 5/5#012boot.ds01                      probe      Healthy 4/4
Oct  3 14:03:51 vext3vnlx-rbx01 varnishd[26904]: Child (26922) not responding to CLI. Tried to kill it, but did not have the rights.

Hope this helps!

-Thomas

@hermunn
Copy link
Member

hermunn commented Oct 3, 2016

Thanks, @thlc! This means that @mbgrydeland's analysis was spot on. I will have a chat with him about this before moving forward.

@mwethington
Copy link

It would be great if we added a

Varnishadm backend.listdetailed

That would list out :

Type of backend clustering. Round robin, saint etc

Ip address if you have it

Messages per line like rights to kill

The process id

Number of threads

Etc

Bill Bell
Sent from mobile

On Oct 3, 2016, at 6:20 AM, Thomas L. <notifications@github.commailto:notifications@github.com> wrote:

Excerpt of the varnishd syslog output with the patch:

Oct 3 14:03:21 vext3vnlx-rbx01 varnishd[26904]: Unexpected reply from ping: 200 Backend name Admin Probe#012boot.nur01 probe Healthy 5/5#012boot.arz01 probe Healthy 5/5#012boot.agk01 probe Healthy 5/5#012boot.agk02 probe Healthy 5/5#012boot.agk01_html probe Healthy 5/5#012boot.agk02_html probe Healthy 5/5#012boot.ds01 probe Healthy 4/4
Oct 3 14:03:21 vext3vnlx-rbx01 varnishd[26904]: Child (26922) not responding to CLI. Tried to kill it, but did not have the rights.
Oct 3 14:03:51 vext3vnlx-rbx01 varnishd[26904]: Unexpected reply from ping: 200 Backend name Admin Probe#012boot.nur01 probe Healthy 5/5#012boot.arz01 probe Healthy 5/5#012boot.agk01 probe Healthy 5/5#012boot.agk02 probe Healthy 5/5#012boot.agk01_html probe Healthy 5/5#012boot.agk02_html probe Healthy 5/5#012boot.ds01 probe Healthy 4/4
Oct 3 14:03:51 vext3vnlx-rbx01 varnishd[26904]: Child (26922) not responding to CLI. Tried to kill it, but did not have the rights.

Hope this helps!

-Thomas

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/2010#issuecomment-251092144, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AA4V-no48DRbDHgF2e7dGYW8TTH7UDZuks5qwPMAgaJpZM4JHR-Y.

@dridi
Copy link
Member

dridi commented Oct 3, 2016

It would be great if we added a Varnishadm backend.listdetailed

I think this is off-topic for this issue, and this not something we can do unless we change the rules. The existing backend.list command is from the varnish-cli and directors belongs in VMODs. Modules are currently not allowed to fiddle with the CLI.

hermunn added a commit to hermunn/varnish-cache that referenced this issue Oct 4, 2016
There seems to be an error in the varnish jail design, which makes the
mgt process uable to kill the child process. To confirm this, add some
error checking to the relevant code.

Related to: varnishcache#2010
hermunn added a commit to hermunn/varnish-cache that referenced this issue Oct 4, 2016
A new jail level, JAIL_MASTER_KILL is introduced. The mgt process
takes this level before killing the child process.

Fixes: varnishcache#2010
hermunn added a commit to hermunn/varnish-cache that referenced this issue Oct 10, 2016
There seems to be an error in the varnish jail design, which makes the
mgt process uable to kill the child process. To confirm this, add some
error checking to the relevant code.

Related to: varnishcache#2010
hermunn added a commit to hermunn/varnish-cache that referenced this issue Oct 10, 2016
A new jail level, JAIL_MASTER_KILL, is introduced. The mgt process
takes this level before killing the child process.

Fixes: varnishcache#2010
hermunn added a commit to hermunn/varnish-cache that referenced this issue Oct 10, 2016
There seems to be an error in the varnish jail design, which makes the
mgt process uable to kill the child process. To confirm this, add some
error checking to the relevant code.

Related to: varnishcache#2010
hermunn added a commit to hermunn/varnish-cache that referenced this issue Oct 10, 2016
A new jail level, JAIL_MASTER_KILL, is introduced. The mgt process
takes this level before killing the child process.

Fixes: varnishcache#2010
hermunn added a commit to hermunn/varnish-cache that referenced this issue Oct 11, 2016
A new jail level, JAIL_MASTER_KILL, is introduced. The mgt process
takes this level before killing the child process.

Fixes: varnishcache#2010
hermunn added a commit to hermunn/varnish-cache that referenced this issue Oct 11, 2016
There seems to be an error in the varnish jail design, which makes the
mgt process uable to kill the child process. To confirm this, add some
error checking to the relevant code.

Related to: varnishcache#2010
hermunn added a commit to hermunn/varnish-cache that referenced this issue Oct 11, 2016
A new jail level, JAIL_MASTER_KILL, is introduced. The mgt process
takes this level before killing the child process.

Fixes: varnishcache#2010
hermunn added a commit that referenced this issue Oct 12, 2016
There seems to be an error in the varnish jail design, which makes the
mgt process uable to kill the child process. To confirm this, add some
error checking to the relevant code.

Related to: #2010
hermunn added a commit that referenced this issue Oct 12, 2016
There seems to be an error in the varnish jail design, which makes the
mgt process uable to kill the child process. To confirm this, add some
error checking to the relevant code.

Related to: #2010
hermunn added a commit that referenced this issue Oct 12, 2016
A new jail level, JAIL_MASTER_KILL, is introduced. The mgt process
takes this level before killing the child process.

Fixes: #2010
@hermunn
Copy link
Member

hermunn commented Oct 12, 2016

Backport review: Backported as described in #2109.

@zstyblik
Copy link

Hello,

the same thing is happening in Varnish 5.0.0, resp. varnish_5.0.0-1_amd64.deb, although I don't know what triggers this condition.
Note: I know the fix didn't make it into referenced build, but it would be nice to get a new one with the fix :)

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants