Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread leak? #2623

Closed
xcir opened this issue Mar 20, 2018 · 24 comments
Closed

Thread leak? #2623

xcir opened this issue Mar 20, 2018 · 24 comments
Assignees
Labels

Comments

@xcir
Copy link
Contributor

xcir commented Mar 20, 2018

image
ESI: enabled HTTP/2: enabled
I disabled http/2 after noticing threads increase.(only hitch restart)
But, It seems gradual increase.

image
This is another instance(ESI: enabled + HTTP/2: enabled)

image
ESI: disabled HTTP/2: enabled
This instance is not using ESI.
Thread was not increase.

Your Environment

  • Version used: Varnish 6.0.0 + Hitch 1.4.6
  • Ubuntu 16.04 (4.13.0-19-generic)
@xcir
Copy link
Contributor Author

xcir commented Mar 20, 2018

I'm testing other conditions.(ESI + HTTP/1 + varnish restart)
But, take few days to result comes out.

@xcir
Copy link
Contributor Author

xcir commented Mar 23, 2018

ESI:enabled + h2:disabled + varnish restart
->Not increase.

ESI:disabled + h2:enabled
-> After posting thread increased in few instance.
image
This graph(ESI:disabled+h2:enabled) is running for 5 days.

I think this issue seems at http2.

@xcir
Copy link
Contributor Author

xcir commented Mar 26, 2018

Hitch stopped.
Only healthcheck request.(HTTP)
Instance removed from the cluster at 20:30(JST)

image
ESI: Enabled h2: Enabled thread_pools: 2 thread_pool_min: 200 MAIN.threads: 527

image
ESI: Disabled h2: Enabled thread_pools: 2 thread_pool_min: 150 MAIN.threads: 344

@xcir
Copy link
Contributor Author

xcir commented Mar 27, 2018

All TZ is UTC+0900(JST)

image
I added some counters for check.
master...1c804ea
nidle_delta = nidle_incr - nidle_decr (idle threads count)
nidle_active = MAIN.threads - nidle_delta (active threads count)
h2_refcnt_delta = h2_refcnt_incr - h2_refcnt_decr (active h2 streams count)

h2_refcnt_delta and nidle_active increased simultaneously at 10:40-11:00.
This instance removed (and Hitch restart) from cluster at 17:00-17:30.
But, h2_refcnt_delta is not so much reduced.

@xcir
Copy link
Contributor Author

xcir commented Apr 6, 2018

I'm looking for a cause commit...

Leaked
0d3044d
image
https://github.com/xcir/varnish-cache/tree/tmp/snap/1003-0d3044d

Not leaked
3a86dd8
image
https://github.com/xcir/varnish-cache/tree/tmp/snap/0817-3a86dd8

00:50(JST)~ removed instance , set thread_pool_timeout=10 and thread_pool_min=10.

@xcir
Copy link
Contributor Author

xcir commented Apr 10, 2018

I think thread leak began from 0d3044d

@daghf daghf self-assigned this Apr 30, 2018
@Francois-v3
Copy link

I have also a thread leak under unbuntu 16.04 (4.4.0-98-generic) with varnish 6.0.0 and HAProxy 1.8 with h2 enable. The same config with very few h2 trafic show no leak.
image
Threads goes down when child restart ...

@Francois-v3
Copy link

In addition, i can observe that by changing communication between HAProxy and varnish from TCP to UDS, threak leak become at least very small:
image

@Francois-v3
Copy link

Finaly i still have thread leak, but only above 2500/3000 requests per second.
image

@ehocdet
Copy link
Contributor

ehocdet commented May 24, 2018

We have also thread leak since h2 is activated.
With frequently vcl load/discard we can see a lot of vcl stalled in "discarded cold/busy" state.

n_vcl grows with h2 enable, and stop to grow without. n_vcl metric on 4 servers:
image

@daghf
Copy link
Member

daghf commented May 30, 2018

Hey

I've now pushed fixes for two separate potential thread leak situations (a56469c and 4b73c5f).

The first one relates to failing POST bodies, and the second one is relating to our handling of PRIORITY frames.

Could I ask you to please give master a spin to see if you still have a thread leak?

@xcir
Copy link
Contributor Author

xcir commented May 31, 2018

I'm testing at varnish 6.0.0 + bin/varnishd/http2/* changes.
Please tell me, if you need other commit.
varnish-6.0.0...xcir:tmp/snap/v600-apply-h2commit

@xcir
Copy link
Contributor Author

xcir commented May 31, 2018

image
varnish-6.0.0...xcir:b418fe871acc1798026f86e10904c01807403d8f
16:45(JST) ~ removed instance , set thread_pool_timeout=10 and thread_pool_min=10.

Leaked...

@daghf daghf added the c=H/2 label May 31, 2018
@ehocdet
Copy link
Contributor

ehocdet commented Jun 1, 2018

Testing with varnish 6.0.0 and all h2 changes from master: same issue.
Leaks view with stalled vcl (n_vcl):
image

@daghf
Copy link
Member

daghf commented Jun 1, 2018

Thanks for the updates - looks like we still have a thread leak situation happening somewhere. I'll keep looking.

@daghf
Copy link
Member

daghf commented Jun 15, 2018

I've plugged yet another thread leak, this one also relating to request bodies (f690631).

If we failed to schedule a thread for a stream, the h/2 session thread would get stuck indefinitely if a DATA frame was received on that stream prior to us cleaning it up.

Since the h/2 session thread is also responsible for waking up any streams currently waiting for window updates (since 0d3044d), those threads would also be leaked, exacerbating the situation.

Please test 👍

@xcir
Copy link
Contributor Author

xcir commented Jun 18, 2018

image

c7c7f52+add metrics
18:20(JST) ~ removed instance , set thread_pool_timeout=10 and thread_pool_min=10.

Leaked...

@xcir
Copy link
Contributor Author

xcir commented Jun 18, 2018

@daghf
Copy link
Member

daghf commented Jun 19, 2018

Thanks a lot @xcir, that was hugely helpful.

I'm pretty confident I've gotten to the bottom of this now. Will push a fix asap.

@daghf daghf closed this as completed in c8bfae7 Jun 19, 2018
@daghf
Copy link
Member

daghf commented Jun 19, 2018

I'm pretty sure this is fixed now, but please reopen if you still see problems.

@xcir
Copy link
Contributor Author

xcir commented Jun 19, 2018

leaked again...
image
23:25(JST) ~ replace varnish ( ed50971 -> 677320e )
02:15(JST) ~ removed instance , set thread_pool_timeout=10 and thread_pool_min=10.

backtrace
https://gist.github.com/xcir/6d0dc200a11ba987f53a770a1f80ff3f

backtrace(h2=0x7f63194affc0)
https://gist.github.com/xcir/bbf50720fc2fa14038f009c54c37dd19

@daghf daghf reopened this Jun 20, 2018
@daghf daghf closed this as completed in ace8e8f Jun 22, 2018
Dridi pushed a commit to Dridi/varnish-cache that referenced this issue Jun 22, 2018
Previously we've been incorrectly transtitioning to CLOS_REM on
END_HEADERS, which prevents us from seeing if a client incorrectly
transmitted a DATA frame on a closed stream.

This slightly complicates things in that we can now be in state OPEN
with an inactive hpack decoding state, and we need to make sure on
cleanup if that has already been finalized.

This would be simpler if the h/2 spec had split the OPEN state in two
parts, with an extra state transition on END_HEADERS.

Again big thanks to @xcir for his help in diagnosing this.

Fixes: varnishcache#2623
@xcir
Copy link
Contributor Author

xcir commented Jun 23, 2018

image

again..

backtrace + VSL
https://gist.github.com/xcir/336098bb0d2a93552d232ff4516f90a1

working varnish(tmp/snap/0618-trunk)
xcir@d2378c9

@daghf daghf reopened this Jun 26, 2018
@daghf daghf closed this as completed in 51e6ded Jun 27, 2018
Dridi pushed a commit to Dridi/varnish-cache that referenced this issue Jun 27, 2018
The current flow control code's use of h2->cond is racy.

h2->cond is already used for handing over a DATA frame to a stream
thread. In the event that we have both streams waiting on this condvar
for window updates and at the same time the rxthread gets signaled for a
DATA frame, we could end up waking up the wrong thread and the rxthread
gets stuck forever.

This commit addresses this by using a separate condvar for window
updates.

An alternative would be to always issue a broadcast on h2->cond instead
of signal, but I found this approach much cleaner.

Probably fixes: varnishcache#2623
Dridi pushed a commit to Dridi/varnish-cache that referenced this issue Jun 27, 2018
The current flow control code's use of h2->cond is racy.

h2->cond is already used for handing over a DATA frame to a stream
thread. In the event that we have both streams waiting on this condvar
for window updates and at the same time the rxthread gets signaled for a
DATA frame, we could end up waking up the wrong thread and the rxthread
gets stuck forever.

This commit addresses this by using a separate condvar for window
updates.

An alternative would be to always issue a broadcast on h2->cond instead
of signal, but I found this approach much cleaner.

Probably fixes: varnishcache#2623
@xcir
Copy link
Contributor Author

xcir commented Jul 1, 2018

I'm running 812b136 commit for 4days. Not leak!
I think this issue was fixed.

Dridi pushed a commit that referenced this issue Aug 16, 2018
Previously we've been incorrectly transtitioning to CLOS_REM on
END_HEADERS, which prevents us from seeing if a client incorrectly
transmitted a DATA frame on a closed stream.

This slightly complicates things in that we can now be in state OPEN
with an inactive hpack decoding state, and we need to make sure on
cleanup if that has already been finalized.

This would be simpler if the h/2 spec had split the OPEN state in two
parts, with an extra state transition on END_HEADERS.

Again big thanks to @xcir for his help in diagnosing this.

Fixes: #2623
Dridi pushed a commit that referenced this issue Aug 16, 2018
The current flow control code's use of h2->cond is racy.

h2->cond is already used for handing over a DATA frame to a stream
thread. In the event that we have both streams waiting on this condvar
for window updates and at the same time the rxthread gets signaled for a
DATA frame, we could end up waking up the wrong thread and the rxthread
gets stuck forever.

This commit addresses this by using a separate condvar for window
updates.

An alternative would be to always issue a broadcast on h2->cond instead
of signal, but I found this approach much cleaner.

Probably fixes: #2623
dmatetelki pushed a commit to dmatetelki/varnish-cache that referenced this issue Mar 14, 2019
If a client mistakenly sent a DATA frame on a stream where it already
transmitted an END_STREAM, it would lead to the rxthread sitting around
indefinitely.

Big thanks to @xcir for his help in diagnosing this.

Fixes: varnishcache#2623
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants