Govern thread creation by queue length instead of dry signals #2942

mbgrydeland · 2019-03-13T13:51:38Z

When getidleworker signals the pool herder to spawn a thread, it increases the dry counter, and the herder resets dry again when spawning a single thread. This will in many cases only create a single thread even though the herder was signaled dry multiple times, and may cause a situation where the cache acceptor is queued and no new thread is created. Together with long lasting tasks (ie endless pipelines), and all other tasks having higher priority, this will prevent the cache acceptor from being rescheduled.

c00096.vtc demonstrates how this can lock up.

To fix this, spawn threads if we have queued tasks and we are below the thread maximum level.

dridi · 2019-03-13T14:09:59Z

Your patch's test case is failing on Travis CI.

nigoroll · 2019-03-13T14:14:32Z

besides the test case, this sounds very reasonable to me, in particular because we still have the wthread_add_delay: if after the delay there still is a queue, we really should start more threads.

dridi · 2019-03-13T14:55:03Z

FYI: combining this patch set with @nigoroll's patch from #2796 I can still lock varnish up with h2load.

Edit: less frequently it would seem.

mbgrydeland · 2019-03-13T15:12:56Z

@dridi What does the relevant varnish counters look like during that test? Do you see threads_limited increase?

mbgrydeland · 2019-03-13T15:20:26Z

Hmm, looking at the Travis log it seems my patch is not doing the trick good enough, it seems one of the thread eating backend fetches was not started properly.

At my local machine I had the test case failing before the patch, and completing fine after. But it is likely that the conditions are not exactly the same. Especially the outset conditions are hard to control. Varnish will start up one acceptor thread for each of the listening interfaces of the host, which makes the number of idle threads at the test case start dependent on the number of interfaces.

It seems that these test cases were suffering under the problem that #2942 addresses. Set a minimum thread pool so that there will be adequate threads available before the test begins.

(i hope) Trouble here is that in pool_herder(), we access pp->dry unprotected, so we might see an old value, thus we might breed more than wthread_min even if the dry condition does not exist any more. So for the vtc, we need to wait until wthread_timeout has passed and the surplus thread has been kissed to death. Notice that this does not change with #2942 because there the same unprotected access happens to lqueue.

mbgrydeland · 2019-04-26T08:56:22Z

Rebased and force-pushed to see if #2986 was the cause of Travis' failures

dridi · 2019-04-26T09:24:25Z

@mbgrydeland, it may be related to #2796 too. See 6826070 as the test case mentioned in the commit log exhibits the same behavior without @nigoroll's change, but less frequently.

nigoroll

other than that I think it makes a lot of sense

bin/varnishd/cache/cache_wrk.c

It seems that these test cases were suffering under the problem that varnishcache#2942 addresses. Set a minimum thread pool so that there will be adequate threads available before the test begins.

(i hope) Trouble here is that in pool_herder(), we access pp->dry unprotected, so we might see an old value, thus we might breed more than wthread_min even if the dry condition does not exist any more. So for the vtc, we need to wait until wthread_timeout has passed and the surplus thread has been kissed to death. Notice that this does not change with varnishcache#2942 because there the same unprotected access happens to lqueue.

nigoroll · 2019-10-02T13:21:22Z

@mbgrydeland close now and revisit later?

dridi · 2019-10-25T08:58:46Z

Looking at the CI problem, I tried to stress c96 locally and ran into a different failure:

**   top  === varnish v1 -expect MAIN.threads == 10
---- v1   Not true: MAIN.threads (14) == 10 (10)

dridi · 2019-10-25T09:04:59Z

I was also able to reproduce the CI failure locally. In both cases simply running c96 with -n100 and -j32 is enough to eventually run into either.

dridi · 2019-10-25T13:26:03Z

With this change I'm able to improve the test case readability, and stabilize the timeout problem:

diff --git a/bin/varnishtest/tests/c00096.vtc b/bin/varnishtest/tests/c00096.vtc
--- a/bin/varnishtest/tests/c00096.vtc
+++ b/bin/varnishtest/tests/c00096.vtc
@@ -29,7 +29,9 @@ server stest {
        txresp -body "All good"
 } -start
 
-varnish v1 -arg "-p debug=+syncvsl" -arg "-p thread_pools=1" -arg "-p thread_pool_min=10" -vcl+backend {
+varnish v1 -arg "-p debug=+syncvsl -p debug=+flush_head"
+varnish v1 -arg "-p thread_pools=1 -p thread_pool_min=10"
+varnish v1 -vcl+backend {
        sub vcl_backend_fetch {
                if (bereq.url == "/test") {
                        set bereq.backend = stest;

The addition is the flush_head debug bit, needed to make rxresphdrs more reliable in varnishtest. I will try to understand the expectation failure.

dridi · 2019-10-25T14:21:56Z

Final diff to pass the test after multiple runs with -n1000 -j40:

diff --git i/bin/varnishtest/tests/c00096.vtc w/bin/varnishtest/tests/c00096.vtc
--- i/bin/varnishtest/tests/c00096.vtc
+++ w/bin/varnishtest/tests/c00096.vtc
@@ -29,7 +29,9 @@ server stest {
        txresp -body "All good"
 } -start
 
-varnish v1 -arg "-p debug=+syncvsl" -arg "-p thread_pools=1" -arg "-p thread_pool_min=10" -vcl+backend {
+varnish v1 -arg "-p debug=+syncvsl -p debug=+flush_head"
+varnish v1 -arg "-p thread_pools=1 -p thread_pool_min=10"
+varnish v1 -vcl+backend {
        sub vcl_backend_fetch {
                if (bereq.url == "/test") {
                        set bereq.backend = stest;
@@ -39,7 +41,10 @@ varnish v1 -arg "-p debug=+syncvsl" -arg "-p thread_pools=1" -arg "-p thread_poo
        }
 } -start
 
-varnish v1 -expect MAIN.threads == 10
+# NB: we might go above 10 threads when early tasks are submitted to
+# the pool since at least one idle thread must be kept in the pool
+# reserve.
+varnish v1 -expect MAIN.threads >= 10
 
 client c1 {
        txreq -url /1

What really need here instead of -expect is a synchronization point:

varnish v1 -await 3s MAIN.threads == 10

If you think this would make sense, I can attempt a patch. Otherwise I couldn't find a problem with the non-VTC part of the pull request besides what @nigoroll already mentioned.

mbgrydeland · 2019-10-25T15:08:12Z

Thank you @dridi and @nigoroll for looking after this.

dridi · 2019-10-25T15:56:58Z

With @mbgrydeland we agreed that I would update the pull request with the suggested diffs. This was also rebased against current master.

dridi · 2019-10-25T18:21:26Z

Following up my offline discussion regarding -expect vs -await with @mbgrydeland: there is currently a 5-ish seconds (50 times 100ms) grace period for the counter to show up, not to match the expected value. So we currently don't have the -await behavior I'm suggesting. I will open a separate pull request for that.

dridi · 2019-10-26T07:13:12Z

Actually no, @mbgrydeland was right, so I'm wondering what may take up to 21 threads even after 5 seconds.

dridi · 2019-10-27T12:24:05Z

I looked a bit closer and realized that at this point only one task is really added tot the pool but there is a window between the moment a thread is created and the moment it goes idle. If that single acceptor task is added during that window, we may create *dozens* of threads beyond thread_pool_min because until we have two idle threads (one in the reserve with this configuration, and one not subject to the reserve constraint) we might keep creating more threads.

So this fixes a bug (try the test case on the master branch if you have any doubts) but at the same time introduces a creation race.

With this diff, the test case is very stable:

diff --git i/bin/varnishtest/tests/c00096.vtc w/bin/varnishtest/tests/c00096.vtc
--- i/bin/varnishtest/tests/c00096.vtc
+++ w/bin/varnishtest/tests/c00096.vtc
@@ -31,6 +31,7 @@ server stest {
 
 varnish v1 -arg "-p debug=+syncvsl -p debug=+flush_head"
 varnish v1 -arg "-p thread_pools=1 -p thread_pool_min=10"
+varnish v1 -arg "-p thread_pool_add_delay=0.01"
 varnish v1 -vcl+backend {
        sub vcl_backend_fetch {
                if (bereq.url == "/test") {
@@ -41,10 +42,7 @@ varnish v1 -vcl+backend {
        }
 } -start
 
-# NB: we might go above 10 threads when early tasks are submitted to
-# the pool since at least one idle thread must be kept in the pool
-# reserve.
-varnish v1 -expect MAIN.threads >= 10
+varnish v1 -expect MAIN.threads == 10
 
 client c1 {
        txreq -url /1

I think the bug fix might be incomplete, but I'm not sure how to close the race.

dridi · 2019-10-27T15:33:42Z

More digging: I thought the same race would exist with dry signals considering that this patch is not ground-breaking. Looking closer at the code we have the same kind of logic where we increment dry but never decrement it. Instead we wipe it, which may lead to the bug @mbgrydeland fixed.

So the same race exists, and I was able to exhibit it:

varnishtest "over-breeding"

varnish v1 -arg "-p debug=+syncvsl -p debug=+flush_head"
varnish v1 -arg "-p thread_pools=1 -p thread_pool_min=10"
varnish v1 -vcl {
        backend be none;
} -start

varnish v1 -expect MAIN.threads == 10

If you stress this VTC enough, the last line may fail because 11 threads are accounted for. Relying on the queue length instead opens the gates and increases the opportunity for over-breeding, and in some cases I observed 50 threads created for one acceptor task.

My suggestion is that in addition to governing thread creation by queue length, new threads should start as idle, and go to sleep if the queue was empty. Alternatively we can dismiss over-breeding as a problem since we technically always stay within min/max bounds.

dridi · 2019-10-28T09:51:01Z

I think there is more to it.

I tried to make the (struct pool).nidle counter eventually consistent: it increments before the thread is created and decremented if that failed. Once a thread goes through its first task lookup, it adjusts nidle accordingly (but only for the very first iteration). With that I lowered the probability of over-breeding by an order of magnitude, but it still happened.

Looking closer, I believe we need to always guard accesses to nidle, nthr and lqueue, which is not always the case, so a data race exists.

nigoroll

as discussed:

(14:50:47) slink: I would think that occasionally over/underbreeding would be better than introducing more lock contention?!
(14:52:47) dridi: that's my suggestion
(14:53:08) slink: ok, we agree. Is the patch ready as-is?
(14:53:23) dridi: we already have a window for over-breeding, but that window increases with 2942
(14:53:31) slink: yes
(14:53:42) slink: and I do not see an issue with that
(14:53:57) slink: I see that underbreeding is far worse than overbreeding
(14:54:06) dridi: if that's the consensus, then 2942 is ready in my opinion
(14:54:06) slink: and IIUC the PR fixes that

dridi · 2019-10-28T14:09:09Z

I still would like to give @mbgrydeland an opportunity to have a look before merging if that's fine with you.

nigoroll · 2019-11-05T12:01:41Z

@mbgrydeland if this will be reworked, could we close the PR in the meantime?

mbgrydeland · 2019-11-06T10:15:25Z

I had a quick look at the overbreeding effect that this patch introduces. That is the result of the herder not taking into account the threads it has already created, and unless these new threads have already dequeued tasks by the next loop iteration of the herder, it will continue to create more.

One way to fix this race would be for the herder to dequeue tasks as part of the thread creation, and handing these tasks to the newly created threads as the first task, before becoming a pool worker once that task is finished. That way the lqueue would be decremented by the herder and it would have the complete right picture when making the decision to create another thread. Though this solution will require some refactoring and I don't really like it.

A simpler approach would be for the herder to keep track of how many newly created threads there are, that have not yet reached the first task dequeue point in Pool_Work_Thread(). So the herder does newborn++ in pool_breed, and Pool_Work_Thread does newborn-- once and only once. Then the breed condition would change from 'lqueue > 0' to 'lqueue - newborn > 0'. Some locking changes would also be needed for the tests of lqueue and newborn, instead of testing them without like the current patch does.

dridi · 2019-11-06T10:25:33Z

This is similar to my experiment, except that I went for lqueue > newborn and repurposed nidle instead of introducing a new field. Either way, you also reached the same conclusions with regards to locking.

How do we proceed? And who does the implementation? (@mbgrydeland, please tell my you are planning to do this=)

nigoroll · 2019-11-06T12:08:16Z

This change fixes a serious underbreeding problem (which, in fact, I ran into just yesterday).
My take is that we should fix this and not care about the occasional overbreeding, because fixing it would add more locking overhead.

So, back to first principles, why is overbreeding a problem?

mbgrydeland · 2019-11-06T12:44:37Z

So, back to first principles, why is overbreeding a problem?

Not a problem per se, but something that is more likely to happen when the system is busy than when not, which would make the system even more busy and even more likely.

dridi · 2019-11-06T13:27:22Z

So, back to first principles, why is overbreeding a problem?

One reason is also that we should avoid creating spurious VSC spikes that may result in puzzling graphs or misleading alerts from monitoring systems.

nigoroll · 2019-11-06T14:02:52Z

I see your points, but in this case I really lean towards avoiding the additional complications, and, in particular, the additional pp->mtx ops.

If there are no other objections, I will go with this. But my personal point of view is that avoiding this rather exceptional situation, mostly seen with artificial testing, should not be handled with additional overhead.
Regarding the argument that creating even more threads in an overload situation was a bad idea: This is exactly what I had in mind when adding the sched_yield(): If the machine is actually saturated, it will simply not get back to the herder in the first place. On the contrary, if it does, this means we do have CPUs available to create more threads.

abstaining my vote with the new overbreeding-avoidance code as explained

nigoroll · 2019-11-06T14:13:51Z

@mbgrydeland @dridi To make progress here, because, as mentioned, this PR is important: Can we maybe split out the overbreed-avoidance and agree on the patch as without the last commit first?

mbgrydeland · 2019-11-06T14:15:52Z

I haven't been involved in all the discussions on this and maybe I missunderstood the previous comments, but I thought sched_yield() was found to not solve the overbreeding problem? If it does then I have no problem using that instead of the proposed patch.

On the pp->mtx concern, I do not think that is a concern, with one more lock/unlock per MIN(thread_add_delay, thread_timeout).

nigoroll · 2019-11-06T14:35:59Z

@mbgrydeland the sched_yield() does improve on it, but not solve it as far as @dridi s test is concerned.

my concern regarding pp->mtx is that it is central to all scheduling functions, so we should simply avoid any additional load on it - or, in other words, have a very good reason to increase its use. In particular, we need pp->mtx to have our newly created threads begin useful work, so grabbing it one more time in the herder will increase chances of further delays.
Note that wthread_add_delay defaults to 0

Also please rebase on master to see the sched_yield() in your branch

When getidleworker signals the pool herder to spawn a thread, it increases the dry counter, and the herder resets dry again when spawning a single thread. This will in many cases only create a single thread even though the herder was signaled dry multiple times, and may cause a situation where the cache acceptor is queued and no new thread is created. Together with long lasting tasks (ie endless pipelines), and all other tasks having higher priority, this will prevent the cache acceptor from being rescheduled. c00096.vtc demonstrates how this can lock up. To fix this, spawn threads if we have queued tasks and we are below the thread maximum level.

Fix a small memory leak when failing to spawn a new thread.

mbgrydeland · 2019-11-13T15:49:13Z

Force update rebase on master and removed overbreeding patch

mbgrydeland · 2019-11-14T09:46:12Z

I think this PR is good to go at this point. Marking for bugwash for that purpose.

Spotted on CircleCI: ---- v1 Not true: MAIN.threads (26) == 10 (10) Refs #2942

Since the removal of dry signals, pools will spin when they run out of threads and increment MAIN.threads_limited at a very high rate. That spike in CPU consumption will also have detrimental effects on useful tasks. This change introduces a 1s delay when the pool is saturated. This allows to periodically exercise the watchdog check. We could wait until the watchdog times out but we would also miss potential updates to the thread_pool_watchdog parameter. To avoid spurious increments of MAIN.threads_limited we only take task submissions into account and ignore timeouts of the condvar. Refs #2942 Refs #3531

mbgrydeland force-pushed the master-dryfix branch from 709a5c5 to 87d15a7 Compare April 26, 2019 08:54

nigoroll requested changes Apr 27, 2019

View reviewed changes

bin/varnishd/cache/cache_wrk.c Outdated Show resolved Hide resolved

dridi mentioned this pull request May 15, 2019

reserve a fraction of idle threads for each priority class #2796

Merged

dridi force-pushed the master-dryfix branch from 87d15a7 to 263ffe6 Compare October 25, 2019 15:54

dridi requested a review from nigoroll October 25, 2019 15:57

nigoroll previously approved these changes Oct 28, 2019

View reviewed changes

dridi self-requested a review November 6, 2019 13:30

dridi approved these changes Nov 6, 2019

View reviewed changes

nigoroll self-requested a review November 6, 2019 14:03

nigoroll requested review from bsdphk and removed request for nigoroll November 6, 2019 14:05

mbgrydeland and others added 3 commits November 13, 2019 16:48

Fix a small memory leak

751404a

Fix a small memory leak when failing to spawn a new thread.

SQUASHME: c96 stabilization

51a2f89

mbgrydeland force-pushed the master-dryfix branch from dc69b44 to 51a2f89 Compare November 13, 2019 15:48

nigoroll approved these changes Nov 13, 2019

View reviewed changes

mbgrydeland added the a=need bugwash label Nov 14, 2019

mbgrydeland merged commit 3660622 into varnishcache:master Nov 18, 2019

mbgrydeland deleted the master-dryfix branch November 18, 2019 14:32

dridi mentioned this pull request Apr 14, 2020

Worker scheduling back-ports for 6.0 #3290

Closed

dridi added a commit that referenced this pull request Oct 28, 2020

test: Over-breeding is still a thing

1b0a51c

Spotted on CircleCI: ---- v1 Not true: MAIN.threads (26) == 10 (10) Refs #2942

nigoroll mentioned this pull request Feb 21, 2021

threads_limited increases without reaching thread_pool_max #3531

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Govern thread creation by queue length instead of dry signals #2942

Govern thread creation by queue length instead of dry signals #2942

mbgrydeland commented Mar 13, 2019

dridi commented Mar 13, 2019

nigoroll commented Mar 13, 2019

dridi commented Mar 13, 2019 •

edited

Loading

mbgrydeland commented Mar 13, 2019

mbgrydeland commented Mar 13, 2019 •

edited

Loading

mbgrydeland commented Apr 26, 2019

dridi commented Apr 26, 2019

nigoroll left a comment

nigoroll commented Oct 2, 2019

dridi commented Oct 25, 2019

dridi commented Oct 25, 2019

dridi commented Oct 25, 2019

dridi commented Oct 25, 2019

mbgrydeland commented Oct 25, 2019

dridi commented Oct 25, 2019

dridi commented Oct 25, 2019

dridi commented Oct 26, 2019

dridi commented Oct 27, 2019

dridi commented Oct 27, 2019

dridi commented Oct 28, 2019

nigoroll left a comment

dridi commented Oct 28, 2019

nigoroll commented Nov 5, 2019

mbgrydeland commented Nov 6, 2019

dridi commented Nov 6, 2019

nigoroll commented Nov 6, 2019 •

edited

Loading

mbgrydeland commented Nov 6, 2019

dridi commented Nov 6, 2019

nigoroll commented Nov 6, 2019

nigoroll commented Nov 6, 2019 •

edited

Loading

mbgrydeland commented Nov 6, 2019

nigoroll commented Nov 6, 2019

mbgrydeland commented Nov 13, 2019

mbgrydeland commented Nov 14, 2019

Govern thread creation by queue length instead of dry signals #2942

Govern thread creation by queue length instead of dry signals #2942

Conversation

mbgrydeland commented Mar 13, 2019

dridi commented Mar 13, 2019

nigoroll commented Mar 13, 2019

dridi commented Mar 13, 2019 • edited Loading

mbgrydeland commented Mar 13, 2019

mbgrydeland commented Mar 13, 2019 • edited Loading

mbgrydeland commented Apr 26, 2019

dridi commented Apr 26, 2019

nigoroll left a comment

Choose a reason for hiding this comment

nigoroll commented Oct 2, 2019

dridi commented Oct 25, 2019

dridi commented Oct 25, 2019

dridi commented Oct 25, 2019

dridi commented Oct 25, 2019

mbgrydeland commented Oct 25, 2019

dridi commented Oct 25, 2019

dridi commented Oct 25, 2019

dridi commented Oct 26, 2019

dridi commented Oct 27, 2019

dridi commented Oct 27, 2019

dridi commented Oct 28, 2019

nigoroll left a comment

Choose a reason for hiding this comment

dridi commented Oct 28, 2019

nigoroll commented Nov 5, 2019

mbgrydeland commented Nov 6, 2019

dridi commented Nov 6, 2019

nigoroll commented Nov 6, 2019 • edited Loading

mbgrydeland commented Nov 6, 2019

dridi commented Nov 6, 2019

nigoroll commented Nov 6, 2019

nigoroll commented Nov 6, 2019 • edited Loading

mbgrydeland commented Nov 6, 2019

nigoroll commented Nov 6, 2019

mbgrydeland commented Nov 13, 2019

mbgrydeland commented Nov 14, 2019

dridi commented Mar 13, 2019 •

edited

Loading

mbgrydeland commented Mar 13, 2019 •

edited

Loading

nigoroll commented Nov 6, 2019 •

edited

Loading

nigoroll commented Nov 6, 2019 •

edited

Loading