-
Notifications
You must be signed in to change notification settings - Fork 374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit watchdog to highest priority only #3537
Limit watchdog to highest priority only #3537
Conversation
The watchdog mechanism currently triggers when any queueing is happening, regardless of the priority. Strictly speaking it is only the backend fetches that are critical to get executed, and this prevents the thread limits to be used as limits on the amount of work the Varnish instance should handle. This can be especially important for instances with H/2 enabled, as these connections will be holding threads for extended periods of time, possibly triggering the watchdog in benign situations. This patch limits the watchdog to only trigger for no queue development on the highest priority queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you maybe elaborate on the case which motivated this PR?
I got doubts if we are heading in the right direction with the watchdog. First of all, to my best understanding, when the watchdog was introduced, it was not much else than a bandaid to work around the underlying root cause, which is the fact that tasks require other tasks to complete. I believe that with the queue priorities, we handle this case appropriately.
I agree that a varnish instance may work just fine and still trigger the watchdog, so I would argue that we could also remove it altogether.
Regarding this PR, I am not quite sure yet. We would activate the watchdog only for queuing on the highest priority, yet still we would count any dequeuing as progress, no matter the priority. Which scenario does this avoid and for which scenario is it still helpful?
The test case demonstrates a setup where the watchdog triggers for a benign traffic pattern. It is very constructed of course, with a small set of threads that are completely tasked with being H/2 session threads. But still the traffic is completely legitimate on a small scale, and Varnish behaves as is programmed just with strict limits on the resources available. But instead of working while severly limited in the throughput, the watchdog triggers, which I think is wrong. This points to an area of Varnish that is not functioning very well, and that is how to limit the amount of work the given Varnish instance should accept and do. The only mechanism available to tune that is For further developments in this area I wonder if we need to move towards treating This came up while investingating a case where the watchdog triggers for no obvious good reason. There is more to the story, and this isn't the root cause nor the solution, and we are still working through some H/2 issues found. But still, Varnish should have just worked slowly instead of restarting. |
Thank you @mbgrydeland for the context information. I take this as confirmation to my previous points.
As, ultimately, there will always be a hard resource limit, I think we can just keep the current thread limits and either remove the watchdog or tame it one way or the other. |
Since the highest priority is always dequeued first, there is no other type of dequeueing that will count as progress. So with this patch, the watchdog is only active and involved with prio zero. I am also very much in doubt of the whole watchdog mechanism. The reasoning for keeping it I guess is because of H/1 and waitinglists. When a conn goes on waitinglist, there is no thread associated with it any longer, and therefor no actor to implement a timeout. If the backend fetch that is needed to resolve the waitinglist is queued but never scheduled, that effectively becomes a deadlock. The watchdog I guess still serves as a 'eventually fail' for such a scenario Some sort of waitinglist timeout may also be a building block in this general area. |
Bugwash decision:
|
The watchdog mechanism currently triggers when any queueing is happening,
regardless of the priority. Strictly speaking it is only the backend
fetches that are critical to get executed, and this prevents the thread
limits to be used as limits on the amount of work the Varnish instance
should handle.
This can be especially important for instances with H/2 enabled, as these
connections will be holding threads for extended periods of time, possibly
triggering the watchdog in benign situations.
This patch limits the watchdog to only trigger for no queue development
on the highest priority queue.