-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection does not timeout after sending data on broken connection when using TCP keepalive #1223
Comments
What platform is this? The unix plat code looks like this
ie, if you are not on Linux platform, actually I don't know how to turn it on and off per-socket. It's Linux? |
Yes, it's Linux. The keep alive messages are not an issue, but TCP keep alive messages are only sent on an idle connection. This is from RFC 1122, page 100:
This works fine as you can see in the following Wireshark log. However when data has been sent to the disconnected target, the TCP Keep-Alive packet will not be sent but TCP Retransmission packets will be sent as shown in the next Wireshark log. This thread on stackoverflow might give some more perspective: https://stackoverflow.com/questions/5907527/application-control-of-tcp-retransmission-on-linux |
I kind of understood, but although I am happy you have a lot of faith in me, what are you asking me to do about it... fix the kernel network stack? Or there is something at the userland layer it seems lws is getting wrong to cause this outcome? |
So are you telling me that there is no way to detect a connection loss in less than 15 min, other than implementing a heart beat on top of the WebSocket protocol? |
No, I asked you:
It calls a few apis and the kernel does the rest. The options for lws to get something wrong are very small (although, that has never stopped us before). So either, a) the problem is somewhere else, or b) the code we are talking about is the stuff I pasted above. Would you perhaps like to try debugging those 20 lines or so? Maybe check optlen is sane since that is a bit strange in the api. As far as I know there aren't any other buttons to press when it comes to tcp keepalive, those lines are more or less what you're supposed to do to enable it and that's it. If they get executed then it's supposed to be operational. |
As I said, there is no issue with the TCP keep alive, this is working as expected on idle connections. My issue is that the TCP keep alive messages are not used in case of data retransmissions (which is totally according to the RFC) and therefore the connection is only disconnected after 15 min. As far as I can tell, there are two ways to control this:
The second option seems to be the preferred option, however I cannot find how this option can be set using libwebsockets. |
The entire code around keepalive in lws is what I pasted earlier. If you want to test and send a patch or PR on lws that solves your problem I'll be happy to see it. |
Sorry, I don't have time to create a decent patch at this moment. |
I don't need a 'decent patch'... if you have a solution just paste the relevant part here and I will integrate it. |
I found out this problem also occurs when using "timeout_secs" and "ws_ping_pong_interval" in stead of TCP keep alive. When sending data on a physically disconnected connection, libwebsockets will execute lws_restart_ws_ping_pong_timer where time_next_ping_check is set. As a result there will be no ping pong timeout although no data can be sent to the remote device and no data is received. I don't have a solution or patch for this issue (at least I believe this is an issue). In case anyone wants to reproduce:
|
... hmmm all the sockets in lws are nonblocking. So there shouldn't be any way to block to the point that the event loop stops. I'll try your recipe and see what happens here. |
I tried it like this a) On master, apply this diff to follow your steps and to add some debug so I can see what goes on
b) build lws and then the test app c) run the test app d) connect by Android tablet on the same wlan segment using Firefox e) disable wifi on the tablet
...I disable the wlan around here...
He does not stop going around the event loop. Timeout reason 16 that it closes on is This is acting the same for you, or something else? Because this seems OK to me. |
In the ~1Hz timeout processing stuff, if he decides it's time to send the ping. he sets the related timeout at that time. lib/service.c:1382 (on master)
Even if the socket never becomes writeable due to whatever, the timeout should still fire and close it. Does this act any different for you? Do you maybe have something countermanding the timeout? |
The issue only happens for me when sending data to the client after it disconnected. I'll try to make time for some sample code tomorrow. |
My test from earlier isn't doing that... it's also disconnecting the client if that makes any difference. I'll also try this kind of test tomorrow. |
Hmm so now I have this diff for testing on master
With this, once you make a connection from a client (I used a laptop on the same subnet) he sends a message "ping" every 3s using a wsi timer. And after a few seconds, I pull out the only ethernet cable from the server side.
This timeout / close is the now idle http connection timing out and closing normally, the ws connection continues...
Again he times out on
So this is now following what was supposed to reproduce it AFAIK and it acts well. Can you try the same thing? |
Well, if you want to try the same tests I'm happy to continue looking at it if something to look at. |
Hi, I have created a sample client and server program to demonstrate the issue. You can find the diff file here: I used the following command to run cmake: $ mkdir build && cd build
$ cmake .. -DLWS_WITHOUT_EXTENSIONS=1 -DLWS_WITH_SSL=0 -DLWS_WITH_ZLIB=0 -DLWS_WITHOUT_TESTAPPS=1 -DLWS_WITH_SHARED=0 -DLWS_WITH_ZIP_FOPS=0 You can set the IP address of the server in test-apps/keep-alive/main. Flow of the program:
When you disable sending data on the server (test-apps/keep-alive/Server.c line 71), the connection will timeout as expected. |
I also met a same issue:
|
Well I appreciate Matthias has sent some kind of reproducer. However Xmanmax's description makes me feel the same way as they last time he [1] mentioned it (I attempted to recreate it and failed)... lws is just a userland app in the end. Pointing at the kernel behaviours - these retries and such are coming from the kernel networking stack - and expecting me to "fix" them is not going to lead anywhere unless there is something that can be / should be done to solve it from userland side. If the userland Posix apis for sockets don't get told anything for some some period then lws can't react... what else are you expecting? Anyway it is the end of my day where I am. I'll try Matthias' stuff tomorrow. Edit: [1] Sorry it was matthias... confused by basically the same screenshot. |
Hmmm... lws is just a userland app. These things are platform-specific. He suggests a workaround which boils down to something like this on v2.4-stable (the 5000 is in ms and can be computed from the existing KA params I think)
This makes the server understand the client has gone away for me... is this solving what everyone is talking about the same, at least for Linux? Also notice that lws supports |
I agree with this one. I guess optval should be (ka_time + (ka_probes * ka_interval)) in this case. |
Some additional checking should be added because I am building libwebsockets for an older linux kernel and this one does not support TCP_USER_TIMEOUT yet, therefore it cannot be compiled with these changes. |
Yeah I know it... I will protect it via cmake... but I want to see if this is also the same thing @xmanmax is talking about first. This is only a hack on linux new enough to have it, it doesn't do anything for any other platform or "fix anything in lws". It remains the case you're better off using the lws timeout stuff as described above. |
I pushed a cleaned-up version of the patch on both v2.4-stable and master. |
I need to be able to detect a connection loss in about 6 seconds.
The TCP keep alive settings works great when no data is being sent on the moment when the connection gets lost.
However when some data is sent on the WebSocket after the physical connection is broken, the TCP socket will not send TCP keep alive packets but TCP retransmission packets. The timeout between every retransmit will increase exponentially and a timeout will only occur after about 15 min in my case.
Is there any way to control this in libwebsockets so this can also timeout after about 6 seconds?
I have searched for this in the documentation and online but couldn't find a clear answer.
The text was updated successfully, but these errors were encountered: