varnishadm CLI behaving weirdly #2010

thlc · 2016-07-07T16:48:25Z

Entering commands on the varnishadm CLI tool result in a weird unpredictable behavior as shown by the following extract:

# varnishadm
200        
-----------------------------
Varnish Cache CLI 1.0
-----------------------------
Linux,3.2.0-4-amd64,x86_64,-junix,-sfile,-smalloc,-hcritbit
varnish-4.1.3-beta1 revision ce824e2
 
Type 'help' for command list.
Type 'quit' to close CLI session.
 
varnish> backend.list
200        
vcl.show [-v] 
backend.list [-p] []
backend.set_health  
ban    [&&    ...]
ban.list
 
varnish> backend.list
200        
PONG 1467895240 1.0
varnish>
varnish> backend.list
200        
PONG 1467895243 1.0
varnish> backend.list
200        
Backend name                   Admin      Probe
boot.nur01                     probe      Healthy 5/5
boot.arz01                     probe      Healthy 5/5
boot.agk01                     probe      Healthy 5/5
boot.agk02                     probe      Healthy 5/5
boot.agk01_html                probe      Healthy 5/5
boot.agk02_html                probe      Healthy 5/5
varnish> backend.list
200        
PONG 1467895246 1.0
varnish> backend.list
200        
Backend name                   Admin      Probe
boot.nur01                     probe      Healthy 5/5
boot.arz01                     probe      Healthy 5/5
boot.agk01                     probe      Healthy 5/5
boot.agk02                     probe      Healthy 5/5
boot.agk01_html                probe      Healthy 5/5
boot.agk02_html                probe      Healthy 5/5
varnish> backend.list
200        
Backend name                   Admin      Probe
boot.nur01                     probe      Healthy 5/5
boot.arz01                     probe      Healthy 5/5
boot.agk01                     probe      Healthy 5/5
boot.agk02                     probe      Healthy 5/5
boot.agk01_html                probe      Healthy 5/5
boot.agk02_html                probe      Healthy 5/5
varnish> backend.list
200        
PONG 1467895285 1.0
varnish> backend.list
200        
Backend name                   Admin      Probe
boot.nur01                     probe      Healthy 5/5
boot.arz01                     probe      Healthy 5/5
boot.agk01                     probe      Healthy 5/5
boot.agk02                     probe      Healthy 5/5
boot.agk01_html                probe      Healthy 5/5
boot.agk02_html                probe      Healthy 5/5

Restarting the varnishd instance "fixed" the problem. This is the first time it's happening to me, so I can't describe how to reproduce it.

As told by scn on the IRC channel, i'm opening an issue so it can be investigated.

@mithrandir | looks like the CLI is out of sync
@scn | please file a bug. i've seen this before and I think it is just editline/readline acting up, but would be good to squash it for good.

The problem was encountered using varnish-4.1.3-beta1 running on Debian 7 (wheezy). The sources have been compiled from the debian source package provided on Jenkins.

lkarsten · 2016-07-13T08:07:28Z

Also reported in #2013 by @mwethington . Adding here to keep him/her informed.

The child_poker lives in the manager process, and sends pings to the child. With this patch we check that we actually get a PONG back, and not some random data. If problems are detected, we kill the child. Related to: varnishcache#2010

hermunn · 2016-07-15T07:55:00Z

I have worked with the code a lot during the last days, and I have not been able to reproduce the error or understand how it can happen.

My pull request #2019 may give us some data from people that experience the problem, and will also somewhat remedy it by shutting down the worker process when it gets out of sync.

The child_poker lives in the manager process, and sends pings to the child. With this patch we check that we actually get a PONG back, and not some random data. If problems are detected, we kill the child. Related to: varnishcache#2010

hermunn · 2016-07-15T12:18:15Z

I got comments from some colleagues of mine. This has been seen before, but usually only when there is a mismatch between libvarnishapi (libvarnishapi1 in debian) and varnishadm (part of varnish). Our packages should prohibit such a mismatch, but if something goes wrong in apt/dpkg, the packages can become out of sync.

Please try to upgrade, and verify that you have matching versions.

mwethington · 2016-07-17T18:32:43Z

Do we need libvarnishapi1 installed? Or can we just uninstall that?

mwethington · 2016-07-17T18:40:34Z

dpkg-query -l | grep varn

ii libvarnishapi1 4.1.2-2jessie amd64 shared libraries for Varnish
ii varnish 4.1.2-2jessie amd64 state of the art, high-performance web accelerator
Then I just "apt-get update; apt-get -y -o Dpkg::Options::="--force-confnew" install varnish" and I get 4.1.3... And then I get the issue.

I forced it to 4.1.2 for now:

apt-get -y install apt-transport-https
curl https://repo.varnish-cache.org/GPG-key.txt | apt-key add -
echo "deb https://repo.varnish-cache.org/debian/ jessie varnish-4.1" > /etc/apt/sources.list.d/varnish-cache.list
apt-get -y update
#apt-get -y -o Dpkg::Options::="--force-confnew" install varnish
apt-get --force-yes -y -o Dpkg::Options::="--force-confnew" install varnish=4.1.2-2~~jessie libvarnishapi1=4.1.2-2~~jessie

mwethington · 2016-07-17T18:42:48Z

Here is where I am at with 4.1.2

/usr/lib# ls -l varn
lrwxrwxrwx 1 root root 22 Apr 15 13:49 libvarnishapi.so.1 -> libvarnishapi.so.1.0.4
-rw-r--r-- 1 root root 170928 Apr 15 13:49 libvarnishapi.so.1.0.4

varnish:
total 332
-rw-r--r-- 1 root root 5944 Apr 15 13:49 libvarnishcompat.so
-rw-r--r-- 1 root root 93360 Apr 15 13:49 libvarnish.so
-rw-r--r-- 1 root root 155248 Apr 15 13:49 libvcc.so
-rw-r--r-- 1 root root 76176 Apr 15 13:49 libvgz.so
drwxr-xr-x 2 root root 4096 Jul 14 22:36 vmods
root@ip-10-248-1-170:/usr/lib# ls -l varnish/vmods
total 60
-rw-r--r-- 1 root root 30968 Apr 15 13:49 libvmod_directors.so
-rw-r--r-- 1 root root 27056 Apr 15 13:49 libvmod_std.so

mwethington · 2016-07-17T18:47:08Z

I need to run it more with 4.1.3 in dev/test before I push it to prod again.

But when I upgrade, it looks pretty clean to me:

/usr/lib# ls -l varn
lrwxrwxrwx 1 root root 22 Jul 6 12:32 libvarnishapi.so.1 -> libvarnishapi.so.1.0.4
-rw-r--r-- 1 root root 175024 Jul 6 12:32 libvarnishapi.so.1.0.4

varnish:
total 332
-rw-r--r-- 1 root root 5944 Jul 6 12:32 libvarnishcompat.so
-rw-r--r-- 1 root root 93360 Jul 6 12:32 libvarnish.so
-rw-r--r-- 1 root root 155248 Jul 6 12:32 libvcc.so
-rw-r--r-- 1 root root 76176 Jul 6 12:32 libvgz.so
drwxr-xr-x 2 root root 4096 Jul 17 18:44 vmods
root@ip-10-248-1-170:/usr/lib# ls -ltr varnish/vmods/
total 60
-rw-r--r-- 1 root root 27072 Jul 6 12:32 libvmod_std.so
-rw-r--r-- 1 root root 30968 Jul 6 12:32 libvmod_directors.so

mwethington · 2016-07-17T18:48:19Z

Wait a sec... Shouldn't we be upgrading the lib? libvarnishapi.so.1.0.4 to libvarnishapi.so.1.0.5?

hermunn · 2016-07-18T11:15:54Z

@mwethington, thanks a lot for checking the versions and running the commands, and many thanks for testing this. I will now have a look at changes between 4.1.2 and 4.1.3, and see if anything crops up.

If you this problem manifests itself in 4.1.2, please update this ticket.

The child_poker lives in the manager process, and sends pings to the child. With this patch we check that we actually get a PONG back, and not some random data. If problems are detected, we kill the child. Related to: #2010

hermunn · 2016-07-27T15:25:08Z

As you can see, I have added some error checking, both in master and in 4.1.

If anyone with this problem can test the current 4.1 branch, it might be helpful for understanding what is going on and finding a fix. Note that when this problem happens, an automatic restart will be issued. If this happens, please update us with the error message here. The message will probably be in the syslog, depending on varnish' parameters.

dridi · 2016-07-31T10:29:38Z

Pål, did you see this?

https://www.varnish-cache.org/lists/pipermail/varnish-dev/2016-January/008758.html

hermunn · 2016-08-01T08:32:23Z

Did not see it, but I will read the mail carefully, study the code and see if I can make a patch. Thanks a lot, Dridi!

thlc · 2016-09-15T08:42:45Z

After running varnishd patched with dee325d for some time, here is the syslog output.

As a side note, I have a Nagios probe that queries the list of backends every 5 minutes. That's probably why the backend list ends up there.

output.txt

hermunn · 2016-09-26T10:56:16Z

From the output @THCL has provided, it seems that either the mgt process is unable to terminate the child, or that it has the wrong pid. In 4.1 HEAD, this can be found in mgt_child.c from line 587:

void
MGT_Child_Cli_Fail(void)
{
       if (child_state != CH_RUNNING)
               return;
       if (child_pid < 0)
               return;
       MGT_complain(C_ERR, "Child (%jd) not responding to CLI, killing it.",
           (intmax_t)child_pid);
       if (MGT_FEATURE(FEATURE_NO_COREDUMP))
               (void)kill(child_pid, SIGKILL);
       else
               (void)kill(child_pid, SIGQUIT);
}

The code in master looks exactly the same. Some error handling on the result of kill() is in order, it seems. I'll have a chat with some of the others before commiting anything.

hermunn · 2016-09-28T14:04:21Z

The wonderful @mbgrydeland has had a look at things (this issue, #2019, #2026), and now we have a hypothesis: There is a bug in the Varnish Jail implementation that makes the management process unable to terminate the child process. When the child get swamped and does not respond properly, the management process decides to kill it, and wait for it to be restarted automatically. In such cases, some bytes will be in transit between the mgt process and the child. When the kill fails, the pipes survive with bytes in the buffer, and thus the two processes are out of sync. This makes the mgt process try to kill it again, but it never works because of the bug.

The attached patch should confirm this, and I hope @thlc can give the patch a spin and report the result. It is made for current 4.1, but it it easily cherry-picked into master, as well. Meanwhile, we will try to figure out how the jail implementation should be fixed.

0001-Add-error-checking-to-confirm-bug.patch.txt

Note that the decision of killing the child is made when the mgt process does not get a timely response. This can happen if there is not enough memory and the kernel starts swapping out stuff, but there may also be other kinds of resource starvation in action. If you have any data on the health of the system at the time of the crash, it is always welcome.

thlc · 2016-09-29T05:57:04Z

I am deploying the patch on one of our production servers. I should back with the output shortly.

Thanks to both of you!

thlc · 2016-10-03T12:20:13Z

Excerpt of the varnishd syslog output with the patch:

Oct  3 14:03:21 vext3vnlx-rbx01 varnishd[26904]: Unexpected reply from ping: 200 Backend name                   Admin      Probe#012boot.nur01                     probe      Healthy 5/5#012boot.arz01                     probe      Healthy 5/5#012boot.agk01                     probe      Healthy 5/5#012boot.agk02                     probe      Healthy 5/5#012boot.agk01_html                probe      Healthy 5/5#012boot.agk02_html                probe      Healthy 5/5#012boot.ds01                      probe      Healthy 4/4
Oct  3 14:03:21 vext3vnlx-rbx01 varnishd[26904]: Child (26922) not responding to CLI. Tried to kill it, but did not have the rights.
Oct  3 14:03:51 vext3vnlx-rbx01 varnishd[26904]: Unexpected reply from ping: 200 Backend name                   Admin      Probe#012boot.nur01                     probe      Healthy 5/5#012boot.arz01                     probe      Healthy 5/5#012boot.agk01                     probe      Healthy 5/5#012boot.agk02                     probe      Healthy 5/5#012boot.agk01_html                probe      Healthy 5/5#012boot.agk02_html                probe      Healthy 5/5#012boot.ds01                      probe      Healthy 4/4
Oct  3 14:03:51 vext3vnlx-rbx01 varnishd[26904]: Child (26922) not responding to CLI. Tried to kill it, but did not have the rights.

Hope this helps!

-Thomas

hermunn · 2016-10-03T12:36:03Z

Thanks, @thlc! This means that @mbgrydeland's analysis was spot on. I will have a chat with him about this before moving forward.

mwethington · 2016-10-03T13:54:44Z

It would be great if we added a

Varnishadm backend.listdetailed

That would list out :

Type of backend clustering. Round robin, saint etc

Ip address if you have it

Messages per line like rights to kill

The process id

Number of threads

Etc

Bill Bell
Sent from mobile

On Oct 3, 2016, at 6:20 AM, Thomas L. <notifications@github.com mailto:notifications@github.com> wrote:

Excerpt of the varnishd syslog output with the patch:

Oct 3 14:03:21 vext3vnlx-rbx01 varnishd[26904]: Unexpected reply from ping: 200 Backend name Admin Probe#012boot.nur01 probe Healthy 5/5#012boot.arz01 probe Healthy 5/5#012boot.agk01 probe Healthy 5/5#012boot.agk02 probe Healthy 5/5#012boot.agk01_html probe Healthy 5/5#012boot.agk02_html probe Healthy 5/5#012boot.ds01 probe Healthy 4/4
Oct 3 14:03:21 vext3vnlx-rbx01 varnishd[26904]: Child (26922) not responding to CLI. Tried to kill it, but did not have the rights.
Oct 3 14:03:51 vext3vnlx-rbx01 varnishd[26904]: Unexpected reply from ping: 200 Backend name Admin Probe#012boot.nur01 probe Healthy 5/5#012boot.arz01 probe Healthy 5/5#012boot.agk01 probe Healthy 5/5#012boot.agk02 probe Healthy 5/5#012boot.agk01_html probe Healthy 5/5#012boot.agk02_html probe Healthy 5/5#012boot.ds01 probe Healthy 4/4
Oct 3 14:03:51 vext3vnlx-rbx01 varnishd[26904]: Child (26922) not responding to CLI. Tried to kill it, but did not have the rights.

Hope this helps!

-Thomas

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/2010#issuecomment-251092144, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AA4V-no48DRbDHgF2e7dGYW8TTH7UDZuks5qwPMAgaJpZM4JHR-Y.

dridi · 2016-10-03T14:16:11Z

It would be great if we added a Varnishadm backend.listdetailed

I think this is off-topic for this issue, and this not something we can do unless we change the rules. The existing backend.list command is from the varnish-cli and directors belongs in VMODs. Modules are currently not allowed to fiddle with the CLI.

There seems to be an error in the varnish jail design, which makes the mgt process uable to kill the child process. To confirm this, add some error checking to the relevant code. Related to: varnishcache#2010

A new jail level, JAIL_MASTER_KILL is introduced. The mgt process takes this level before killing the child process. Fixes: varnishcache#2010

There seems to be an error in the varnish jail design, which makes the mgt process uable to kill the child process. To confirm this, add some error checking to the relevant code. Related to: varnishcache#2010

A new jail level, JAIL_MASTER_KILL, is introduced. The mgt process takes this level before killing the child process. Fixes: varnishcache#2010

There seems to be an error in the varnish jail design, which makes the mgt process uable to kill the child process. To confirm this, add some error checking to the relevant code. Related to: varnishcache#2010

A new jail level, JAIL_MASTER_KILL, is introduced. The mgt process takes this level before killing the child process. Fixes: varnishcache#2010

There seems to be an error in the varnish jail design, which makes the mgt process uable to kill the child process. To confirm this, add some error checking to the relevant code. Related to: varnishcache#2010

A new jail level, JAIL_MASTER_KILL, is introduced. The mgt process takes this level before killing the child process. Fixes: varnishcache#2010

There seems to be an error in the varnish jail design, which makes the mgt process uable to kill the child process. To confirm this, add some error checking to the relevant code. Related to: #2010

A new jail level, JAIL_MASTER_KILL, is introduced. The mgt process takes this level before killing the child process. Fixes: #2010

hermunn · 2016-10-12T14:16:42Z

Backport review: Backported as described in #2109.

zstyblik · 2016-11-24T10:10:13Z

Hello,

the same thing is happening in Varnish 5.0.0, resp. varnish_5.0.0-1_amd64.deb, although I don't know what triggers this condition.
Note: I know the fix didn't make it into referenced build, but it would be nice to get a new one with the fix :)

Thanks.

hermunn self-assigned this Jul 11, 2016

lkarsten mentioned this issue Jul 13, 2016

varnishadm backend.list out of sync (Was: Varnish 4.1.3 issue) #2013

Closed

hermunn mentioned this issue Jul 15, 2016

Add some error checking to child_poker #2019

Closed

dridi mentioned this issue Sep 27, 2016

4.1: Unexpected reply from ping: 400 CLI communication error #2103

Closed

hermunn added a commit to hermunn/varnish-cache that referenced this issue Oct 4, 2016

New jail level - kill - used when killing is happening

e9f2f93

A new jail level, JAIL_MASTER_KILL is introduced. The mgt process takes this level before killing the child process. Fixes: varnishcache#2010

hermunn mentioned this issue Oct 4, 2016

Killing the child proces properly #2109

Closed

hermunn added a commit to hermunn/varnish-cache that referenced this issue Oct 10, 2016

New jail level - kill - used when killing is happening

f03da13

A new jail level, JAIL_MASTER_KILL, is introduced. The mgt process takes this level before killing the child process. Fixes: varnishcache#2010

hermunn added a commit to hermunn/varnish-cache that referenced this issue Oct 10, 2016

New jail level - kill - used when killing is happening

5ffd000

A new jail level, JAIL_MASTER_KILL, is introduced. The mgt process takes this level before killing the child process. Fixes: varnishcache#2010

hermunn added a commit to hermunn/varnish-cache that referenced this issue Oct 11, 2016

New jail level - kill - used when killing is happening

8360ec6

A new jail level, JAIL_MASTER_KILL, is introduced. The mgt process takes this level before killing the child process. Fixes: varnishcache#2010

hermunn added a commit to hermunn/varnish-cache that referenced this issue Oct 11, 2016

New jail level - kill - used when killing is happening

7df9758

A new jail level, JAIL_MASTER_KILL, is introduced. The mgt process takes this level before killing the child process. Fixes: varnishcache#2010

hermunn closed this as completed in c41c30a Oct 12, 2016

vssync added the backport-review-4.1 label Oct 12, 2016

hermunn added a commit that referenced this issue Oct 12, 2016

New jail level - kill - used when killing is happening

61cdc9d

A new jail level, JAIL_MASTER_KILL, is introduced. The mgt process takes this level before killing the child process. Fixes: #2010

hermunn removed the backport-review-4.1 label Oct 12, 2016

zstyblik mentioned this issue Nov 25, 2016

New varnish 5.0.x (Debian) release varnishcache/pkg-varnish-cache#53

Closed

dridi mentioned this issue Jun 4, 2018

Add JSON ping output #2689

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

varnishadm CLI behaving weirdly #2010

varnishadm CLI behaving weirdly #2010

thlc commented Jul 7, 2016

lkarsten commented Jul 13, 2016

hermunn commented Jul 15, 2016 •

edited

Loading

hermunn commented Jul 15, 2016

mwethington commented Jul 17, 2016

mwethington commented Jul 17, 2016

mwethington commented Jul 17, 2016

mwethington commented Jul 17, 2016 •

edited

Loading

mwethington commented Jul 17, 2016

hermunn commented Jul 18, 2016

hermunn commented Jul 27, 2016

dridi commented Jul 31, 2016 •

edited

Loading

hermunn commented Aug 1, 2016

thlc commented Sep 15, 2016

hermunn commented Sep 26, 2016

hermunn commented Sep 28, 2016

thlc commented Sep 29, 2016

thlc commented Oct 3, 2016

hermunn commented Oct 3, 2016

mwethington commented Oct 3, 2016

dridi commented Oct 3, 2016

hermunn commented Oct 12, 2016

zstyblik commented Nov 24, 2016

varnishadm CLI behaving weirdly #2010

varnishadm CLI behaving weirdly #2010

Comments

thlc commented Jul 7, 2016

lkarsten commented Jul 13, 2016

hermunn commented Jul 15, 2016 • edited Loading

hermunn commented Jul 15, 2016

mwethington commented Jul 17, 2016

mwethington commented Jul 17, 2016

dpkg-query -l | grep varn

mwethington commented Jul 17, 2016

mwethington commented Jul 17, 2016 • edited Loading

mwethington commented Jul 17, 2016

hermunn commented Jul 18, 2016

hermunn commented Jul 27, 2016

dridi commented Jul 31, 2016 • edited Loading

hermunn commented Aug 1, 2016

thlc commented Sep 15, 2016

hermunn commented Sep 26, 2016

hermunn commented Sep 28, 2016

thlc commented Sep 29, 2016

thlc commented Oct 3, 2016

hermunn commented Oct 3, 2016

mwethington commented Oct 3, 2016

dridi commented Oct 3, 2016

hermunn commented Oct 12, 2016

zstyblik commented Nov 24, 2016

hermunn commented Jul 15, 2016 •

edited

Loading

mwethington commented Jul 17, 2016 •

edited

Loading

dridi commented Jul 31, 2016 •

edited

Loading