-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Govern thread creation by queue length instead of dry signals #2942
Conversation
Your patch's test case is failing on Travis CI. |
besides the test case, this sounds very reasonable to me, in particular because we still have the |
@dridi What does the relevant varnish counters look like during that test? Do you see threads_limited increase? |
Hmm, looking at the Travis log it seems my patch is not doing the trick good enough, it seems one of the thread eating backend fetches was not started properly. At my local machine I had the test case failing before the patch, and completing fine after. But it is likely that the conditions are not exactly the same. |
It seems that these test cases were suffering under the problem that #2942 addresses. Set a minimum thread pool so that there will be adequate threads available before the test begins.
(i hope) Trouble here is that in pool_herder(), we access pp->dry unprotected, so we might see an old value, thus we might breed more than wthread_min even if the dry condition does not exist any more. So for the vtc, we need to wait until wthread_timeout has passed and the surplus thread has been kissed to death. Notice that this does not change with #2942 because there the same unprotected access happens to lqueue.
709a5c5
to
87d15a7
Compare
Rebased and force-pushed to see if #2986 was the cause of Travis' failures |
@mbgrydeland, it may be related to #2796 too. See 6826070 as the test case mentioned in the commit log exhibits the same behavior without @nigoroll's change, but less frequently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other than that I think it makes a lot of sense
It seems that these test cases were suffering under the problem that varnishcache#2942 addresses. Set a minimum thread pool so that there will be adequate threads available before the test begins.
(i hope) Trouble here is that in pool_herder(), we access pp->dry unprotected, so we might see an old value, thus we might breed more than wthread_min even if the dry condition does not exist any more. So for the vtc, we need to wait until wthread_timeout has passed and the surplus thread has been kissed to death. Notice that this does not change with varnishcache#2942 because there the same unprotected access happens to lqueue.
@mbgrydeland close now and revisit later? |
Looking at the CI problem, I tried to stress c96 locally and ran into a different failure:
|
I was also able to reproduce the CI failure locally. In both cases simply running c96 with |
With this change I'm able to improve the test case readability, and stabilize the timeout problem: diff --git a/bin/varnishtest/tests/c00096.vtc b/bin/varnishtest/tests/c00096.vtc
--- a/bin/varnishtest/tests/c00096.vtc
+++ b/bin/varnishtest/tests/c00096.vtc
@@ -29,7 +29,9 @@ server stest {
txresp -body "All good"
} -start
-varnish v1 -arg "-p debug=+syncvsl" -arg "-p thread_pools=1" -arg "-p thread_pool_min=10" -vcl+backend {
+varnish v1 -arg "-p debug=+syncvsl -p debug=+flush_head"
+varnish v1 -arg "-p thread_pools=1 -p thread_pool_min=10"
+varnish v1 -vcl+backend {
sub vcl_backend_fetch {
if (bereq.url == "/test") {
set bereq.backend = stest; The addition is the |
Final diff to pass the test after multiple runs with diff --git i/bin/varnishtest/tests/c00096.vtc w/bin/varnishtest/tests/c00096.vtc
--- i/bin/varnishtest/tests/c00096.vtc
+++ w/bin/varnishtest/tests/c00096.vtc
@@ -29,7 +29,9 @@ server stest {
txresp -body "All good"
} -start
-varnish v1 -arg "-p debug=+syncvsl" -arg "-p thread_pools=1" -arg "-p thread_pool_min=10" -vcl+backend {
+varnish v1 -arg "-p debug=+syncvsl -p debug=+flush_head"
+varnish v1 -arg "-p thread_pools=1 -p thread_pool_min=10"
+varnish v1 -vcl+backend {
sub vcl_backend_fetch {
if (bereq.url == "/test") {
set bereq.backend = stest;
@@ -39,7 +41,10 @@ varnish v1 -arg "-p debug=+syncvsl" -arg "-p thread_pools=1" -arg "-p thread_poo
}
} -start
-varnish v1 -expect MAIN.threads == 10
+# NB: we might go above 10 threads when early tasks are submitted to
+# the pool since at least one idle thread must be kept in the pool
+# reserve.
+varnish v1 -expect MAIN.threads >= 10
client c1 {
txreq -url /1 What really need here instead of
If you think this would make sense, I can attempt a patch. Otherwise I couldn't find a problem with the non-VTC part of the pull request besides what @nigoroll already mentioned. |
With @mbgrydeland we agreed that I would update the pull request with the suggested diffs. This was also rebased against current master. |
Following up my offline discussion regarding |
Actually no, @mbgrydeland was right, so I'm wondering what may take up to 21 threads even after 5 seconds. |
I looked a bit closer and realized that at this point only one task is really added tot the pool but there is a window between the moment a thread is created and the moment it goes idle. If that single acceptor task is added during that window, we may create *dozens* of threads beyond So this fixes a bug (try the test case on the master branch if you have any doubts) but at the same time introduces a creation race. With this diff, the test case is very stable: diff --git i/bin/varnishtest/tests/c00096.vtc w/bin/varnishtest/tests/c00096.vtc
--- i/bin/varnishtest/tests/c00096.vtc
+++ w/bin/varnishtest/tests/c00096.vtc
@@ -31,6 +31,7 @@ server stest {
varnish v1 -arg "-p debug=+syncvsl -p debug=+flush_head"
varnish v1 -arg "-p thread_pools=1 -p thread_pool_min=10"
+varnish v1 -arg "-p thread_pool_add_delay=0.01"
varnish v1 -vcl+backend {
sub vcl_backend_fetch {
if (bereq.url == "/test") {
@@ -41,10 +42,7 @@ varnish v1 -vcl+backend {
}
} -start
-# NB: we might go above 10 threads when early tasks are submitted to
-# the pool since at least one idle thread must be kept in the pool
-# reserve.
-varnish v1 -expect MAIN.threads >= 10
+varnish v1 -expect MAIN.threads == 10
client c1 {
txreq -url /1 I think the bug fix might be incomplete, but I'm not sure how to close the race. |
More digging: I thought the same race would exist with dry signals considering that this patch is not ground-breaking. Looking closer at the code we have the same kind of logic where we increment dry but never decrement it. Instead we wipe it, which may lead to the bug @mbgrydeland fixed. So the same race exists, and I was able to exhibit it: varnishtest "over-breeding"
varnish v1 -arg "-p debug=+syncvsl -p debug=+flush_head"
varnish v1 -arg "-p thread_pools=1 -p thread_pool_min=10"
varnish v1 -vcl {
backend be none;
} -start
varnish v1 -expect MAIN.threads == 10 If you stress this VTC enough, the last line may fail because 11 threads are accounted for. Relying on the queue length instead opens the gates and increases the opportunity for over-breeding, and in some cases I observed 50 threads created for one acceptor task. My suggestion is that in addition to governing thread creation by queue length, new threads should start as idle, and go to sleep if the queue was empty. Alternatively we can dismiss over-breeding as a problem since we technically always stay within min/max bounds. |
I think there is more to it. I tried to make the Looking closer, I believe we need to always guard accesses to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed:
(14:50:47) slink: I would think that occasionally over/underbreeding would be better than introducing more lock contention?!
(14:52:47) dridi: that's my suggestion
(14:53:08) slink: ok, we agree. Is the patch ready as-is?
(14:53:23) dridi: we already have a window for over-breeding, but that window increases with 2942
(14:53:31) slink: yes
(14:53:42) slink: and I do not see an issue with that
(14:53:57) slink: I see that underbreeding is far worse than overbreeding
(14:54:06) dridi: if that's the consensus, then 2942 is ready in my opinion
(14:54:06) slink: and IIUC the PR fixes that
I still would like to give @mbgrydeland an opportunity to have a look before merging if that's fine with you. |
@mbgrydeland if this will be reworked, could we close the PR in the meantime? |
I had a quick look at the overbreeding effect that this patch introduces. That is the result of the herder not taking into account the threads it has already created, and unless these new threads have already dequeued tasks by the next loop iteration of the herder, it will continue to create more. One way to fix this race would be for the herder to dequeue tasks as part of the thread creation, and handing these tasks to the newly created threads as the first task, before becoming a pool worker once that task is finished. That way the lqueue would be decremented by the herder and it would have the complete right picture when making the decision to create another thread. Though this solution will require some refactoring and I don't really like it. A simpler approach would be for the herder to keep track of how many newly created threads there are, that have not yet reached the first task dequeue point in Pool_Work_Thread(). So the herder does newborn++ in pool_breed, and Pool_Work_Thread does newborn-- once and only once. Then the breed condition would change from 'lqueue > 0' to 'lqueue - newborn > 0'. Some locking changes would also be needed for the tests of lqueue and newborn, instead of testing them without like the current patch does. |
This is similar to my experiment, except that I went for How do we proceed? And who does the implementation? (@mbgrydeland, please tell my you are planning to do this=) |
This change fixes a serious underbreeding problem (which, in fact, I ran into just yesterday). So, back to first principles, why is overbreeding a problem? |
Not a problem per se, but something that is more likely to happen when the system is busy than when not, which would make the system even more busy and even more likely. |
One reason is also that we should avoid creating spurious VSC spikes that may result in puzzling graphs or misleading alerts from monitoring systems. |
I see your points, but in this case I really lean towards avoiding the additional complications, and, in particular, the additional If there are no other objections, I will go with this. But my personal point of view is that avoiding this rather exceptional situation, mostly seen with artificial testing, should not be handled with additional overhead. |
abstaining my vote with the new overbreeding-avoidance code as explained
@mbgrydeland @dridi To make progress here, because, as mentioned, this PR is important: Can we maybe split out the overbreed-avoidance and agree on the patch as without the last commit first? |
I haven't been involved in all the discussions on this and maybe I missunderstood the previous comments, but I thought sched_yield() was found to not solve the overbreeding problem? If it does then I have no problem using that instead of the proposed patch. On the pp->mtx concern, I do not think that is a concern, with one more lock/unlock per MIN(thread_add_delay, thread_timeout). |
@mbgrydeland the my concern regarding Also please rebase on master to see the |
When getidleworker signals the pool herder to spawn a thread, it increases the dry counter, and the herder resets dry again when spawning a single thread. This will in many cases only create a single thread even though the herder was signaled dry multiple times, and may cause a situation where the cache acceptor is queued and no new thread is created. Together with long lasting tasks (ie endless pipelines), and all other tasks having higher priority, this will prevent the cache acceptor from being rescheduled. c00096.vtc demonstrates how this can lock up. To fix this, spawn threads if we have queued tasks and we are below the thread maximum level.
Fix a small memory leak when failing to spawn a new thread.
dc69b44
to
51a2f89
Compare
Force update rebase on master and removed overbreeding patch |
I think this PR is good to go at this point. Marking for bugwash for that purpose. |
Spotted on CircleCI: ---- v1 Not true: MAIN.threads (26) == 10 (10) Refs #2942
Since the removal of dry signals, pools will spin when they run out of threads and increment MAIN.threads_limited at a very high rate. That spike in CPU consumption will also have detrimental effects on useful tasks. This change introduces a 1s delay when the pool is saturated. This allows to periodically exercise the watchdog check. We could wait until the watchdog times out but we would also miss potential updates to the thread_pool_watchdog parameter. To avoid spurious increments of MAIN.threads_limited we only take task submissions into account and ignore timeouts of the condvar. Refs #2942 Refs #3531
When getidleworker signals the pool herder to spawn a thread, it increases the dry counter, and the herder resets dry again when spawning a single thread. This will in many cases only create a single thread even though the herder was signaled dry multiple times, and may cause a situation where the cache acceptor is queued and no new thread is created. Together with long lasting tasks (ie endless pipelines), and all other tasks having higher priority, this will prevent the cache acceptor from being rescheduled.
c00096.vtc demonstrates how this can lock up.
To fix this, spawn threads if we have queued tasks and we are below the thread maximum level.