Limit watchdog to highest priority only #3537

mbgrydeland · 2021-03-01T10:26:42Z

The watchdog mechanism currently triggers when any queueing is happening,
regardless of the priority. Strictly speaking it is only the backend
fetches that are critical to get executed, and this prevents the thread
limits to be used as limits on the amount of work the Varnish instance
should handle.

This can be especially important for instances with H/2 enabled, as these
connections will be holding threads for extended periods of time, possibly
triggering the watchdog in benign situations.

This patch limits the watchdog to only trigger for no queue development
on the highest priority queue.

The watchdog mechanism currently triggers when any queueing is happening, regardless of the priority. Strictly speaking it is only the backend fetches that are critical to get executed, and this prevents the thread limits to be used as limits on the amount of work the Varnish instance should handle. This can be especially important for instances with H/2 enabled, as these connections will be holding threads for extended periods of time, possibly triggering the watchdog in benign situations. This patch limits the watchdog to only trigger for no queue development on the highest priority queue.

nigoroll

Can you maybe elaborate on the case which motivated this PR?

I got doubts if we are heading in the right direction with the watchdog. First of all, to my best understanding, when the watchdog was introduced, it was not much else than a bandaid to work around the underlying root cause, which is the fact that tasks require other tasks to complete. I believe that with the queue priorities, we handle this case appropriately.
I agree that a varnish instance may work just fine and still trigger the watchdog, so I would argue that we could also remove it altogether.

Regarding this PR, I am not quite sure yet. We would activate the watchdog only for queuing on the highest priority, yet still we would count any dequeuing as progress, no matter the priority. Which scenario does this avoid and for which scenario is it still helpful?

bin/varnishd/cache/cache.h

mbgrydeland · 2021-03-01T12:01:54Z

The test case demonstrates a setup where the watchdog triggers for a benign traffic pattern. It is very constructed of course, with a small set of threads that are completely tasked with being H/2 session threads. But still the traffic is completely legitimate on a small scale, and Varnish behaves as is programmed just with strict limits on the resources available. But instead of working while severly limited in the throughput, the watchdog triggers, which I think is wrong.

This points to an area of Varnish that is not functioning very well, and that is how to limit the amount of work the given Varnish instance should accept and do. The only mechanism available to tune that is thread_pool_max, and it is not really working well at all. There are situations where the max limit makes it so that the work already comitted to can't move forward because more threads are needed. H/2 is particularly susceptive to this, as this test case shows.

For further developments in this area I wonder if we need to move towards treating thread_pool_max as a soft limit. As in once reached, breaks are applied (not accepting new connections, or h/2 streams, possibly starting to hang up live connections to free up threads). But still allow new threads to be created when necessary to make progress.

This came up while investingating a case where the watchdog triggers for no obvious good reason. There is more to the story, and this isn't the root cause nor the solution, and we are still working through some H/2 issues found. But still, Varnish should have just worked slowly instead of restarting.

nigoroll · 2021-03-01T12:29:00Z

Thank you @mbgrydeland for the context information. I take this as confirmation to my previous points.
Regarding thread_pool_max as a soft limit: In my mind, this is

orthogonal to the watchdog, as thread creation could still fail
effectively equivalent to the reserve

As, ultimately, there will always be a hard resource limit, I think we can just keep the current thread limits and either remove the watchdog or tame it one way or the other.

mbgrydeland · 2021-03-01T13:45:19Z

Regarding this PR, I am not quite sure yet. We would activate the watchdog only for queuing on the highest priority, yet still we would count any dequeuing as progress, no matter the priority. Which scenario does this avoid and for which scenario is it still helpful?

Since the highest priority is always dequeued first, there is no other type of dequeueing that will count as progress. So with this patch, the watchdog is only active and involved with prio zero.

I am also very much in doubt of the whole watchdog mechanism. The reasoning for keeping it I guess is because of H/1 and waitinglists. When a conn goes on waitinglist, there is no thread associated with it any longer, and therefor no actor to implement a timeout. If the backend fetch that is needed to resolve the waitinglist is queued but never scheduled, that effectively becomes a deadlock. The watchdog I guess still serves as a 'eventually fail' for such a scenario

Some sort of waitinglist timeout may also be a building block in this general area.

mbgrydeland · 2021-03-01T17:42:18Z

Bugwash decision:

15:31:33 < slink> we could still introduce thread_pool_watchdog=0s to deactivate and make that the default
15:31:40 < martin> yeah
15:32:16 < slink> any other opinions?
15:32:41 < dridi> i'm with martin overall, we've discussed it already before he submitted the pr
15:32:44 < phk> so what's the disposition of this ticket ? Think more ?
15:33:26 < martin> my position is that it removes the watchdog from triggering for a benign case, and is an
improvement
15:33:36 < martin> but there is more work to be done in this area
15:35:00 < slink> so merge PR now and either add watchdog deactivation or remove watchdog later?
15:35:07 < hermunn> Merge it and come back to the watchdog after 15 March?
15:35:11 < martin> sgtm
15:35:18 < slink> phk?
15:35:28 < phk> I'm +0
15:35:30 < dridi> hermunn: after march, freeze month
15:35:38 < slink> ok then
15:35:39 < hermunn> dridi: agreed.

nigoroll reviewed Mar 1, 2021

View reviewed changes

bin/varnishd/cache/cache.h Show resolved Hide resolved

mbgrydeland merged commit e4ea445 into varnishcache:master Mar 1, 2021

rezan mentioned this pull request Apr 14, 2021

[LTS Backport] Watchdog priority #3591

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit watchdog to highest priority only #3537

Limit watchdog to highest priority only #3537

mbgrydeland commented Mar 1, 2021

nigoroll left a comment

mbgrydeland commented Mar 1, 2021

nigoroll commented Mar 1, 2021

mbgrydeland commented Mar 1, 2021

mbgrydeland commented Mar 1, 2021

Limit watchdog to highest priority only #3537

Limit watchdog to highest priority only #3537

Conversation

mbgrydeland commented Mar 1, 2021

nigoroll left a comment

Choose a reason for hiding this comment

mbgrydeland commented Mar 1, 2021

nigoroll commented Mar 1, 2021

mbgrydeland commented Mar 1, 2021

mbgrydeland commented Mar 1, 2021