Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit watchdog to highest priority only #3537

Merged

Conversation

mbgrydeland
Copy link
Contributor

The watchdog mechanism currently triggers when any queueing is happening,
regardless of the priority. Strictly speaking it is only the backend
fetches that are critical to get executed, and this prevents the thread
limits to be used as limits on the amount of work the Varnish instance
should handle.

This can be especially important for instances with H/2 enabled, as these
connections will be holding threads for extended periods of time, possibly
triggering the watchdog in benign situations.

This patch limits the watchdog to only trigger for no queue development
on the highest priority queue.

The watchdog mechanism currently triggers when any queueing is happening,
regardless of the priority. Strictly speaking it is only the backend
fetches that are critical to get executed, and this prevents the thread
limits to be used as limits on the amount of work the Varnish instance
should handle.

This can be especially important for instances with H/2 enabled, as these
connections will be holding threads for extended periods of time, possibly
triggering the watchdog in benign situations.

This patch limits the watchdog to only trigger for no queue development
on the highest priority queue.
Copy link
Member

@nigoroll nigoroll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you maybe elaborate on the case which motivated this PR?

I got doubts if we are heading in the right direction with the watchdog. First of all, to my best understanding, when the watchdog was introduced, it was not much else than a bandaid to work around the underlying root cause, which is the fact that tasks require other tasks to complete. I believe that with the queue priorities, we handle this case appropriately.
I agree that a varnish instance may work just fine and still trigger the watchdog, so I would argue that we could also remove it altogether.

Regarding this PR, I am not quite sure yet. We would activate the watchdog only for queuing on the highest priority, yet still we would count any dequeuing as progress, no matter the priority. Which scenario does this avoid and for which scenario is it still helpful?

bin/varnishd/cache/cache.h Show resolved Hide resolved
@mbgrydeland
Copy link
Contributor Author

The test case demonstrates a setup where the watchdog triggers for a benign traffic pattern. It is very constructed of course, with a small set of threads that are completely tasked with being H/2 session threads. But still the traffic is completely legitimate on a small scale, and Varnish behaves as is programmed just with strict limits on the resources available. But instead of working while severly limited in the throughput, the watchdog triggers, which I think is wrong.

This points to an area of Varnish that is not functioning very well, and that is how to limit the amount of work the given Varnish instance should accept and do. The only mechanism available to tune that is thread_pool_max, and it is not really working well at all. There are situations where the max limit makes it so that the work already comitted to can't move forward because more threads are needed. H/2 is particularly susceptive to this, as this test case shows.

For further developments in this area I wonder if we need to move towards treating thread_pool_max as a soft limit. As in once reached, breaks are applied (not accepting new connections, or h/2 streams, possibly starting to hang up live connections to free up threads). But still allow new threads to be created when necessary to make progress.

This came up while investingating a case where the watchdog triggers for no obvious good reason. There is more to the story, and this isn't the root cause nor the solution, and we are still working through some H/2 issues found. But still, Varnish should have just worked slowly instead of restarting.

@nigoroll
Copy link
Member

nigoroll commented Mar 1, 2021

Thank you @mbgrydeland for the context information. I take this as confirmation to my previous points.
Regarding thread_pool_max as a soft limit: In my mind, this is

  • orthogonal to the watchdog, as thread creation could still fail
  • effectively equivalent to the reserve

As, ultimately, there will always be a hard resource limit, I think we can just keep the current thread limits and either remove the watchdog or tame it one way or the other.

@mbgrydeland
Copy link
Contributor Author

Regarding this PR, I am not quite sure yet. We would activate the watchdog only for queuing on the highest priority, yet still we would count any dequeuing as progress, no matter the priority. Which scenario does this avoid and for which scenario is it still helpful?

Since the highest priority is always dequeued first, there is no other type of dequeueing that will count as progress. So with this patch, the watchdog is only active and involved with prio zero.

I am also very much in doubt of the whole watchdog mechanism. The reasoning for keeping it I guess is because of H/1 and waitinglists. When a conn goes on waitinglist, there is no thread associated with it any longer, and therefor no actor to implement a timeout. If the backend fetch that is needed to resolve the waitinglist is queued but never scheduled, that effectively becomes a deadlock. The watchdog I guess still serves as a 'eventually fail' for such a scenario

Some sort of waitinglist timeout may also be a building block in this general area.

@mbgrydeland
Copy link
Contributor Author

Bugwash decision:

15:31:33 < slink> we could still introduce thread_pool_watchdog=0s to deactivate and make that the default
15:31:40 < martin> yeah
15:32:16 < slink> any other opinions?
15:32:41 < dridi> i'm with martin overall, we've discussed it already before he submitted the pr
15:32:44 < phk> so what's the disposition of this ticket ? Think more ?
15:33:26 < martin> my position is that it removes the watchdog from triggering for a benign case, and is an
improvement
15:33:36 < martin> but there is more work to be done in this area
15:35:00 < slink> so merge PR now and either add watchdog deactivation or remove watchdog later?
15:35:07 < hermunn> Merge it and come back to the watchdog after 15 March?
15:35:11 < martin> sgtm
15:35:18 < slink> phk?
15:35:28 < phk> I'm +0
15:35:30 < dridi> hermunn: after march, freeze month
15:35:38 < slink> ok then
15:35:39 < hermunn> dridi: agreed.

@mbgrydeland mbgrydeland merged commit e4ea445 into varnishcache:master Mar 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants