Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Govern thread creation by queue length instead of dry signals #2942

Merged
merged 3 commits into from
Nov 18, 2019

Conversation

mbgrydeland
Copy link
Contributor

When getidleworker signals the pool herder to spawn a thread, it increases the dry counter, and the herder resets dry again when spawning a single thread. This will in many cases only create a single thread even though the herder was signaled dry multiple times, and may cause a situation where the cache acceptor is queued and no new thread is created. Together with long lasting tasks (ie endless pipelines), and all other tasks having higher priority, this will prevent the cache acceptor from being rescheduled.

c00096.vtc demonstrates how this can lock up.

To fix this, spawn threads if we have queued tasks and we are below the thread maximum level.

@dridi
Copy link
Member

dridi commented Mar 13, 2019

Your patch's test case is failing on Travis CI.

@nigoroll
Copy link
Member

besides the test case, this sounds very reasonable to me, in particular because we still have the wthread_add_delay: if after the delay there still is a queue, we really should start more threads.

@dridi
Copy link
Member

dridi commented Mar 13, 2019

FYI: combining this patch set with @nigoroll's patch from #2796 I can still lock varnish up with h2load.

Edit: less frequently it would seem.

@mbgrydeland
Copy link
Contributor Author

@dridi What does the relevant varnish counters look like during that test? Do you see threads_limited increase?

@mbgrydeland
Copy link
Contributor Author

mbgrydeland commented Mar 13, 2019

Hmm, looking at the Travis log it seems my patch is not doing the trick good enough, it seems one of the thread eating backend fetches was not started properly.

At my local machine I had the test case failing before the patch, and completing fine after. But it is likely that the conditions are not exactly the same. Especially the outset conditions are hard to control. Varnish will start up one acceptor thread for each of the listening interfaces of the host, which makes the number of idle threads at the test case start dependent on the number of interfaces.

mbgrydeland added a commit that referenced this pull request Apr 23, 2019
It seems that these test cases were suffering under the problem that #2942
addresses. Set a minimum thread pool so that there will be adequate
threads available before the test begins.
nigoroll added a commit that referenced this pull request Apr 23, 2019
(i hope)

Trouble here is that in pool_herder(), we access pp->dry unprotected, so
we might see an old value, thus we might breed more than wthread_min
even if the dry condition does not exist any more.

So for the vtc, we need to wait until wthread_timeout has passed and the
surplus thread has been kissed to death.

Notice that this does not change with #2942 because there the same
unprotected access happens to lqueue.
@mbgrydeland
Copy link
Contributor Author

Rebased and force-pushed to see if #2986 was the cause of Travis' failures

@dridi
Copy link
Member

dridi commented Apr 26, 2019

@mbgrydeland, it may be related to #2796 too. See 6826070 as the test case mentioned in the commit log exhibits the same behavior without @nigoroll's change, but less frequently.

Copy link
Member

@nigoroll nigoroll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other than that I think it makes a lot of sense

bin/varnishd/cache/cache_wrk.c Outdated Show resolved Hide resolved
rezan pushed a commit to rezan/varnish-cache that referenced this pull request Jul 11, 2019
It seems that these test cases were suffering under the problem that varnishcache#2942
addresses. Set a minimum thread pool so that there will be adequate
threads available before the test begins.
rezan pushed a commit to rezan/varnish-cache that referenced this pull request Jul 11, 2019
(i hope)

Trouble here is that in pool_herder(), we access pp->dry unprotected, so
we might see an old value, thus we might breed more than wthread_min
even if the dry condition does not exist any more.

So for the vtc, we need to wait until wthread_timeout has passed and the
surplus thread has been kissed to death.

Notice that this does not change with varnishcache#2942 because there the same
unprotected access happens to lqueue.
@nigoroll
Copy link
Member

nigoroll commented Oct 2, 2019

@mbgrydeland close now and revisit later?

@dridi
Copy link
Member

dridi commented Oct 25, 2019

Looking at the CI problem, I tried to stress c96 locally and ran into a different failure:

**   top  === varnish v1 -expect MAIN.threads == 10
---- v1   Not true: MAIN.threads (14) == 10 (10)

@dridi
Copy link
Member

dridi commented Oct 25, 2019

I was also able to reproduce the CI failure locally. In both cases simply running c96 with -n100 and -j32 is enough to eventually run into either.

@dridi
Copy link
Member

dridi commented Oct 25, 2019

With this change I'm able to improve the test case readability, and stabilize the timeout problem:

diff --git a/bin/varnishtest/tests/c00096.vtc b/bin/varnishtest/tests/c00096.vtc
--- a/bin/varnishtest/tests/c00096.vtc
+++ b/bin/varnishtest/tests/c00096.vtc
@@ -29,7 +29,9 @@ server stest {
        txresp -body "All good"
 } -start
 
-varnish v1 -arg "-p debug=+syncvsl" -arg "-p thread_pools=1" -arg "-p thread_pool_min=10" -vcl+backend {
+varnish v1 -arg "-p debug=+syncvsl -p debug=+flush_head"
+varnish v1 -arg "-p thread_pools=1 -p thread_pool_min=10"
+varnish v1 -vcl+backend {
        sub vcl_backend_fetch {
                if (bereq.url == "/test") {
                        set bereq.backend = stest;

The addition is the flush_head debug bit, needed to make rxresphdrs more reliable in varnishtest. I will try to understand the expectation failure.

@dridi
Copy link
Member

dridi commented Oct 25, 2019

Final diff to pass the test after multiple runs with -n1000 -j40:

diff --git i/bin/varnishtest/tests/c00096.vtc w/bin/varnishtest/tests/c00096.vtc
--- i/bin/varnishtest/tests/c00096.vtc
+++ w/bin/varnishtest/tests/c00096.vtc
@@ -29,7 +29,9 @@ server stest {
        txresp -body "All good"
 } -start
 
-varnish v1 -arg "-p debug=+syncvsl" -arg "-p thread_pools=1" -arg "-p thread_pool_min=10" -vcl+backend {
+varnish v1 -arg "-p debug=+syncvsl -p debug=+flush_head"
+varnish v1 -arg "-p thread_pools=1 -p thread_pool_min=10"
+varnish v1 -vcl+backend {
        sub vcl_backend_fetch {
                if (bereq.url == "/test") {
                        set bereq.backend = stest;
@@ -39,7 +41,10 @@ varnish v1 -arg "-p debug=+syncvsl" -arg "-p thread_pools=1" -arg "-p thread_poo
        }
 } -start
 
-varnish v1 -expect MAIN.threads == 10
+# NB: we might go above 10 threads when early tasks are submitted to
+# the pool since at least one idle thread must be kept in the pool
+# reserve.
+varnish v1 -expect MAIN.threads >= 10
 
 client c1 {
        txreq -url /1

What really need here instead of -expect is a synchronization point:

varnish v1 -await 3s MAIN.threads == 10

If you think this would make sense, I can attempt a patch. Otherwise I couldn't find a problem with the non-VTC part of the pull request besides what @nigoroll already mentioned.

@mbgrydeland
Copy link
Contributor Author

Thank you @dridi and @nigoroll for looking after this.

@dridi
Copy link
Member

dridi commented Oct 25, 2019

With @mbgrydeland we agreed that I would update the pull request with the suggested diffs. This was also rebased against current master.

@dridi dridi requested a review from nigoroll October 25, 2019 15:57
@dridi
Copy link
Member

dridi commented Oct 25, 2019

Following up my offline discussion regarding -expect vs -await with @mbgrydeland: there is currently a 5-ish seconds (50 times 100ms) grace period for the counter to show up, not to match the expected value. So we currently don't have the -await behavior I'm suggesting. I will open a separate pull request for that.

@dridi
Copy link
Member

dridi commented Oct 26, 2019

Actually no, @mbgrydeland was right, so I'm wondering what may take up to 21 threads even after 5 seconds.

@dridi
Copy link
Member

dridi commented Oct 27, 2019

I looked a bit closer and realized that at this point only one task is really added tot the pool but there is a window between the moment a thread is created and the moment it goes idle. If that single acceptor task is added during that window, we may create *dozens* of threads beyond thread_pool_min because until we have two idle threads (one in the reserve with this configuration, and one not subject to the reserve constraint) we might keep creating more threads.

So this fixes a bug (try the test case on the master branch if you have any doubts) but at the same time introduces a creation race.

With this diff, the test case is very stable:

diff --git i/bin/varnishtest/tests/c00096.vtc w/bin/varnishtest/tests/c00096.vtc
--- i/bin/varnishtest/tests/c00096.vtc
+++ w/bin/varnishtest/tests/c00096.vtc
@@ -31,6 +31,7 @@ server stest {
 
 varnish v1 -arg "-p debug=+syncvsl -p debug=+flush_head"
 varnish v1 -arg "-p thread_pools=1 -p thread_pool_min=10"
+varnish v1 -arg "-p thread_pool_add_delay=0.01"
 varnish v1 -vcl+backend {
        sub vcl_backend_fetch {
                if (bereq.url == "/test") {
@@ -41,10 +42,7 @@ varnish v1 -vcl+backend {
        }
 } -start
 
-# NB: we might go above 10 threads when early tasks are submitted to
-# the pool since at least one idle thread must be kept in the pool
-# reserve.
-varnish v1 -expect MAIN.threads >= 10
+varnish v1 -expect MAIN.threads == 10
 
 client c1 {
        txreq -url /1

I think the bug fix might be incomplete, but I'm not sure how to close the race.

@dridi
Copy link
Member

dridi commented Oct 27, 2019

More digging: I thought the same race would exist with dry signals considering that this patch is not ground-breaking. Looking closer at the code we have the same kind of logic where we increment dry but never decrement it. Instead we wipe it, which may lead to the bug @mbgrydeland fixed.

So the same race exists, and I was able to exhibit it:

varnishtest "over-breeding"

varnish v1 -arg "-p debug=+syncvsl -p debug=+flush_head"
varnish v1 -arg "-p thread_pools=1 -p thread_pool_min=10"
varnish v1 -vcl {
        backend be none;
} -start

varnish v1 -expect MAIN.threads == 10

If you stress this VTC enough, the last line may fail because 11 threads are accounted for. Relying on the queue length instead opens the gates and increases the opportunity for over-breeding, and in some cases I observed 50 threads created for one acceptor task.

My suggestion is that in addition to governing thread creation by queue length, new threads should start as idle, and go to sleep if the queue was empty. Alternatively we can dismiss over-breeding as a problem since we technically always stay within min/max bounds.

@dridi
Copy link
Member

dridi commented Oct 28, 2019

I think there is more to it.

I tried to make the (struct pool).nidle counter eventually consistent: it increments before the thread is created and decremented if that failed. Once a thread goes through its first task lookup, it adjusts nidle accordingly (but only for the very first iteration). With that I lowered the probability of over-breeding by an order of magnitude, but it still happened.

Looking closer, I believe we need to always guard accesses to nidle, nthr and lqueue, which is not always the case, so a data race exists.

nigoroll
nigoroll previously approved these changes Oct 28, 2019
Copy link
Member

@nigoroll nigoroll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed:

(14:50:47) slink: I would think that occasionally over/underbreeding would be better than introducing more lock contention?!
(14:52:47) dridi: that's my suggestion
(14:53:08) slink: ok, we agree. Is the patch ready as-is?
(14:53:23) dridi: we already have a window for over-breeding, but that window increases with 2942
(14:53:31) slink: yes
(14:53:42) slink: and I do not see an issue with that
(14:53:57) slink: I see that underbreeding is far worse than overbreeding
(14:54:06) dridi: if that's the consensus, then 2942 is ready in my opinion
(14:54:06) slink: and IIUC the PR fixes that

@dridi
Copy link
Member

dridi commented Oct 28, 2019

I still would like to give @mbgrydeland an opportunity to have a look before merging if that's fine with you.

@nigoroll
Copy link
Member

nigoroll commented Nov 5, 2019

@mbgrydeland if this will be reworked, could we close the PR in the meantime?

@mbgrydeland
Copy link
Contributor Author

I had a quick look at the overbreeding effect that this patch introduces. That is the result of the herder not taking into account the threads it has already created, and unless these new threads have already dequeued tasks by the next loop iteration of the herder, it will continue to create more.

One way to fix this race would be for the herder to dequeue tasks as part of the thread creation, and handing these tasks to the newly created threads as the first task, before becoming a pool worker once that task is finished. That way the lqueue would be decremented by the herder and it would have the complete right picture when making the decision to create another thread. Though this solution will require some refactoring and I don't really like it.

A simpler approach would be for the herder to keep track of how many newly created threads there are, that have not yet reached the first task dequeue point in Pool_Work_Thread(). So the herder does newborn++ in pool_breed, and Pool_Work_Thread does newborn-- once and only once. Then the breed condition would change from 'lqueue > 0' to 'lqueue - newborn > 0'. Some locking changes would also be needed for the tests of lqueue and newborn, instead of testing them without like the current patch does.

@dridi
Copy link
Member

dridi commented Nov 6, 2019

This is similar to my experiment, except that I went for lqueue > newborn and repurposed nidle instead of introducing a new field. Either way, you also reached the same conclusions with regards to locking.

How do we proceed? And who does the implementation? (@mbgrydeland, please tell my you are planning to do this=)

@nigoroll
Copy link
Member

nigoroll commented Nov 6, 2019

This change fixes a serious underbreeding problem (which, in fact, I ran into just yesterday).
My take is that we should fix this and not care about the occasional overbreeding, because fixing it would add more locking overhead.

So, back to first principles, why is overbreeding a problem?

@mbgrydeland
Copy link
Contributor Author

So, back to first principles, why is overbreeding a problem?

Not a problem per se, but something that is more likely to happen when the system is busy than when not, which would make the system even more busy and even more likely.

@dridi
Copy link
Member

dridi commented Nov 6, 2019

So, back to first principles, why is overbreeding a problem?

One reason is also that we should avoid creating spurious VSC spikes that may result in puzzling graphs or misleading alerts from monitoring systems.

@dridi dridi self-requested a review November 6, 2019 13:30
@nigoroll
Copy link
Member

nigoroll commented Nov 6, 2019

I see your points, but in this case I really lean towards avoiding the additional complications, and, in particular, the additional pp->mtx ops.

If there are no other objections, I will go with this. But my personal point of view is that avoiding this rather exceptional situation, mostly seen with artificial testing, should not be handled with additional overhead.
Regarding the argument that creating even more threads in an overload situation was a bad idea: This is exactly what I had in mind when adding the sched_yield(): If the machine is actually saturated, it will simply not get back to the herder in the first place. On the contrary, if it does, this means we do have CPUs available to create more threads.

@nigoroll nigoroll self-requested a review November 6, 2019 14:03
@nigoroll nigoroll dismissed their stale review November 6, 2019 14:04

abstaining my vote with the new overbreeding-avoidance code as explained

@nigoroll nigoroll requested review from bsdphk and removed request for nigoroll November 6, 2019 14:05
@nigoroll
Copy link
Member

nigoroll commented Nov 6, 2019

@mbgrydeland @dridi To make progress here, because, as mentioned, this PR is important: Can we maybe split out the overbreed-avoidance and agree on the patch as without the last commit first?

@mbgrydeland
Copy link
Contributor Author

I haven't been involved in all the discussions on this and maybe I missunderstood the previous comments, but I thought sched_yield() was found to not solve the overbreeding problem? If it does then I have no problem using that instead of the proposed patch.

On the pp->mtx concern, I do not think that is a concern, with one more lock/unlock per MIN(thread_add_delay, thread_timeout).

@nigoroll
Copy link
Member

nigoroll commented Nov 6, 2019

@mbgrydeland the sched_yield() does improve on it, but not solve it as far as @dridi s test is concerned.

my concern regarding pp->mtx is that it is central to all scheduling functions, so we should simply avoid any additional load on it - or, in other words, have a very good reason to increase its use. In particular, we need pp->mtx to have our newly created threads begin useful work, so grabbing it one more time in the herder will increase chances of further delays.
Note that wthread_add_delay defaults to 0

Also please rebase on master to see the sched_yield() in your branch

mbgrydeland and others added 3 commits November 13, 2019 16:48
When getidleworker signals the pool herder to spawn a thread, it increases
the dry counter, and the herder resets dry again when spawning a single
thread. This will in many cases only create a single thread even though
the herder was signaled dry multiple times, and may cause a situation
where the cache acceptor is queued and no new thread is created. Together
with long lasting tasks (ie endless pipelines), and all other tasks having
higher priority, this will prevent the cache acceptor from being
rescheduled.

c00096.vtc demonstrates how this can lock up.

To fix this, spawn threads if we have queued tasks and we are below the
thread maximum level.
Fix a small memory leak when failing to spawn a new thread.
@mbgrydeland
Copy link
Contributor Author

Force update rebase on master and removed overbreeding patch

@mbgrydeland
Copy link
Contributor Author

I think this PR is good to go at this point. Marking for bugwash for that purpose.

@mbgrydeland mbgrydeland merged commit 3660622 into varnishcache:master Nov 18, 2019
@mbgrydeland mbgrydeland deleted the master-dryfix branch November 18, 2019 14:32
dridi added a commit that referenced this pull request Oct 28, 2020
Spotted on CircleCI:

    ---- v1    Not true: MAIN.threads (26) == 10 (10)

Refs #2942
dridi added a commit that referenced this pull request Feb 23, 2021
Since the removal of dry signals, pools will spin when they run out of
threads and increment MAIN.threads_limited at a very high rate. That
spike in CPU consumption will also have detrimental effects on useful
tasks.

This change introduces a 1s delay when the pool is saturated. This
allows to periodically exercise the watchdog check. We could wait until
the watchdog times out but we would also miss potential updates to the
thread_pool_watchdog parameter.

To avoid spurious increments of MAIN.threads_limited we only take task
submissions into account and ignore timeouts of the condvar.

Refs #2942
Refs #3531
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants