-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Successive HTTP requests in lws 2.4 exhausting server ah pool #1249
Comments
What happens if you just increase the amount of ah available with max_http_header_pool to suit your situation? 4 (the default) has no special meaning.
|
Increasing the max_http_header_pool to a value bigger than the number of requests works around the problem as well. But I think that's just hiding the fact that resources are somehow not being freed up correctly. |
I see. Well, I will try to reproduce it tomorrow. attack.sh which is in the tree does pretty much what you describe and I tested v2.4 with that without problems before its original release. What are you using to do the tests, also lws or something else? |
Thanks for your help. Yes, the unit tests are using lws for both the client and server side. This can be part of the reason, but your question got me thinking. I just tested again our server inside a more complex application with Firefox as a client, and observed similar issues of ah pool exhaustion (since lws 2.4 only). I'm sure that the reference lws server has been well tested so there must be a subtle difference in our server code that leads to this behavior since v2.4; Unfortunately, I've been looking into this for a long time already and I really can't figure out what that could be. In case you would like to run the tests I've been using:
|
Hm the fact there's no logging from lws cited and if I understand it, no problem with lws test apps either, along with doing bisect-type a-b testing on lws as "the problem" without actually identifying what the problem really is, reminds me of other 'issues' where the problem turned out to be the cut-and-pasted http callback code from lws samples in the user app doesn't deal with the end of http transactions properly. Http2 requires this to work because a single connection may carry many ongoing and completed transactions. Even http/1.1 pipelining needs it to work. In v2.4 there's a 'dummy' callback exported that you can either use with mounts and rely on completely to get this boilerplate right for you, or have your callback call through to. If this code came from older lws, you should review it against the http stuff in the dummy calkback in lib/context.c and / or the same code in the v2.4 test app. |
Fair enough, I'll try and write down a simpler reproducer from scratch this week. Hopefully it will turn out that I missed something obvious. Thank you for the pointers, but excluding the CGI/PROXY parts (which we're not using), the dummy callback does little else than:
I'm not seeing any relevant changes in the examples for v2.4 either yet, but I'll keep looking. |
...and...
https://github.com/warmcat/libwebsockets/blob/master/lib/context.c#L301 |
Correct, I checked that as well but this callback is never occurring as we're not using the file serve built-in functionality. The http_callback is the following:
Note that the POLL_FD part is irrelevant in this unit test case and can be commented out. All the functions for the HTTP callbacks (except HTTP_BODY) return -1 on lws_http_transaction_completed() after writing the response. Except that since v2.4, lws_http_transaction_completed() no longer closes the connection. For what it's worth, here is a diff between the output of a single test testing the [GET, POST, PUT, PATCH, DELETE, OPTIONS] methods in order, comparing v2.3-stable and v2.4-stable: But I'll try to work on a minimal, self-contained code example that can be used to debug this. |
If it works on the test apps, it sounds like it requires something specifically your code is doing different. |
I'll be happy to look at a sample showing the problem. |
Sure, please see below. Hopefully you can point me in the right direction, because I'm at a loss for ideas. Here is a minimal HTTP server that I put together by looking at test-server-http.c and the newer minimal-http-server examples.
|
@lws-team have you had a chance to look at the test code I posted above? |
Sorry these things eat up a lot of time to understand. So your test app is a bit unusual... I guess you are trying to confirm the ah wait queue processing which is a Good Thing. But your test procedure doesn't do what you think it does.
With Chrome 67, hitting
Then a request comes on wsi 0xfc60c0 (source port localhost.50096)
We service it (this is all on wsi 0xfc60c0)
We see there is no pipelined request right now for wsi 0xfc60c0, so we give up the ah and have wsi 0xfc60c0 wait a bit in case something new comes. The ah detach processing passes the (only) ah to the last (first requesting) guy in the ah wait list, wsi 0xfc7960
wsi 0xfc7960 has nothing waiting for him at all; he never gets sent any headers by the browser. So he just sits there holding the only ah. Meanwhile the browser processed the JS and decides to send the XHR on our old friend wsi 0xfc60c0, as a pipelined transaction.
Since we gave the only ah up to this other speculative browser connection, we can't do anything with it but put it on the ah list.
So then we are stuck until the unused speculative connection with our only ah times out... we sit there for 6s/
The browser retries on a different connection (either fresh or localhost.50100 was one of the speculative connections created at the start, tcpdump didn't catch it being established either way)
... and we service it finally etc...
So this test with an ah limit of 4 is unreasonable is the client is going to open 5 connections speculatively and leave some of them idle. You can test the pipelining meaningfully like this (the /dev/zero is just to keep nc alive until the server finishes sending everything)
or using the new minimal example that knows how to keep itself open long enough
On master, the ah limit defaults to the number of fds the process has... for http/2, restricting ah is much more difficult and we don't support it yet (if ever). For h2, all of the simultaneous streams share one tcp connection and it is assumed you can always fire up a new stream and send headers on it, up to the general limit of streams-per-connection. So the ability to manage ah is not really there like it is for h1.... you need to allow for enough ah for as many streams as you expect. |
Thank you for this in-depth analysis, I appreciate that you're taking time to look into this. My example may indeed have been slightly too complex; I just wanted to see if issuing more http requests (6) than available ah (4) would work without blocking the server, which indeed works for both v2.3 and v2.4. I hoped this was a good test of TCP connection reuse in http1.1. Doing as you suggest, I have now restricted everything to just 1 extra http request for 1 image and 1 available ah. With both Firefox 59 and Chrome 63 (which are the versions I have at hand on Ubuntu 16.04), after pressing Ctrl-Shift-R once I get the same output for v2.4-stable:
So you're right, I wonder how you got to see 5 TCP connections... :-) Anyway, if I press Ctrl-Shift-R twice in Firefox 59 the server blocks with v2.4-stable (matching your observations):
Whereas I could reload an infinite number of times with v2.3-stable. In conclusion, if I understand correctly, you suggest fixing the problem by using a larger ah pool with v2.4. Since there is no way of determining in advance how many clients will connect and how many requests each one of them will be making, that means setting an arbitrarily high limit, right? I logically assumed it was a regression in the http1 support... but seeing your last comment, you seem to present it as a necessary evolution to support http2, and the ah limit has been removed by default (max_http_header_pool=0) in the latest master branch (v3.0.0 :-)). Is that right? |
No doubt what the browsers choose to do is dependent on the exact version; chrome 67 that I have acts different than your chrome 63. But if there's even one speculative connection that is idle, and we only have one ah to play with, it will block until the connection times out and the browser retries. That isn't an "lws thing" it's just what happens under the test conditions.
I don't have time to look at it but I think I will find v2.3 using more ah-es or otherwise acting differently. v2.4 + master are doing what was asked AFAICS.
No... I suggest testing it with an exact test that I pasted in my previous reply, where you know exactly what stimulus you have given it, how many connections it uses, and how many ah it uses, so you are testing what you think you are. For normal use testing holding down Ctrl-R is a fine test on $BROWSER. But under the constraint there are only one for four ah to go around, as I showed in my previous post that is not workable with every browser which can just choose to broadside 5 connections in Chrome 67 case. If you want to confirm it, install Chrome 67 :-) For those tests, there is no point restricting the ah like that with a browser.
No... there are two separate issues. 1a) Is the ah limit appropriate you can server ANYTHING to Chrome 67? It must be above 5 by the look of it. 1b) If you are running http/2, where the stream limit is say 64 per connection, can you handle the amount of streams it will broadside to get assets? If you have a lot of pictures in your html, the browser may reasonably say it wants to load 16 streams in parallel, each with headers and requiring 16 ah. It might go up to 64 theoretically or whatever you set as the stream limit. So you must cut your ah limit to that cloth even to service one browser client with your busy page. (The ah are malloc'd up and deleted if not needed after they are used). We can't play deferment games with the ah on h2 because there is one serialized connection with all the streams muxed that is blocked by data we don't read. If we just stop reading the connection until another ah appears, we cannot get, eg, tx credit updates that allow us to send more and just stall. So we must always go forward with h2.
By default the ah limit == the fd limit (set by ulimit -n) now. It is not "unlimited". Like I say the memory load is perhaps not as burdensome as you are thinking... |
OK I see, then I will increase the ah limit to 1024 when compiling against lws 2.4, which should solve my problem in practice. The default limit was simply not sufficient anymore with the new behavior. Thanks again for your help. For the record, here is the difference between 2.3 and 2.4, calling lws-minimal-raw-netcat twice in row. First call with 2.4:
Second call with 2.4:
With v2.3-stable:
Possibly the key difference is that it looks like the images are not served... matching server output:
|
Trying it against master... which you don't mention... there is no problem running the 1 x ah test as much as you like with the minimal netcat... ie, it's good. If you want a fix for 2.4 I think you understood what that is about when you said
If you want provide a patch (just paste in a comment is OK) aligning 2.4 to what I did here on master if it fixes it as I assume it'll be super welcome. |
Oh wow, indeed it works! That's great news 👍 |
OK, this patch fixes everything for me: #1265 Also, I think there is a typo here: wsi->ws->rx_draining_ext || wsi->ws->rx_draining_ext |
Increase the pool size to 1024 for lws < 3.0 as discussed here: warmcat/libwebsockets#1249 There was no point in restricting the pool size, memory is only allocated as-needed, and should reach max 5MB. Since lws 3.0 the default ah limit == fd limit (typically 1024 as well).
Increase the pool size to 1024 for lws < 3.0 as discussed here: warmcat/libwebsockets#1249 There was no point in restricting the pool size, memory is only allocated as-needed, and should reach max 5MB. Since lws 3.0 the default ah limit == fd limit (typically 1024 as well).
Oh good find... I modified it accordingly on master. |
Increase the pool size to 1024 for lws < 3.0 as discussed here: warmcat/libwebsockets#1249 There was no point in restricting the pool size, memory is only allocated as-needed, and should reach max 5MB. Since lws 3.0 the default ah limit == fd limit (typically 1024 as well).
We are facing an issue with lws 2.4 in unit tests that validate the HTTP client-server communication of our C++ wrapper library. Our tests create a lws client-server pair to make a succession of simple HTTP requests (no SSL or http/2) from the client to the server.
Until lws 2.3, the test cases execute extremely fast (<< 1 sec). But since precisely this commit introducing http2 support: 904a9c0 , execution is very slow (~1 min) and the observed behavior of the unit tests is something like the following:
This keeps happening with v2.4.1, v2.4.2 and v2.4-stable.
Enabling libwebsocket's log output gave a useful indication:
Indeed, it turns out that the ah pool of the lws server become exhausted after max_http_header_pool (default=5) previous HTTP requests have been made by the client. Execution continues after some internal timeout (?) frees the resources of previous connections.
More debugging allowed us to locate the exact change causing the connections to no longer be closed. It is the following 4 lines (2610-2613) in lib/server.c: 904a9c0#diff-9621bd8daa5d0a52d5af514af312fbaaR2610
Reverting back to the previous behavior of "goto fail;" here fixes the problem; Although we don't know enough about libwebsockets internals to suggest this as a proper fix.
As a workaround, we observed that forcing a return -1; from LWS_CALLBACK_HTTP_WRITEABLE in the server code also fixes the problem. However, bypassing the lws_http_transaction_completed() check as is done in test-server-http.c is unlikely to be the correct solution. We were otherwise unable to identify an appropriate change in the client code based on the existing examples and documentation. Even recreating the client context with
lws_context_destroy() + lws_create_context() before each lws_client_connect_via_info() does not make a difference.
So the question is:
a) Could the change in behavior described above be caused by a regression in libwebsockets, or:
b) Is there anything specific to do in either the client or server code to properly close the HTTP connection(s) starting from v2.4?
The text was updated successfully, but these errors were encountered: