Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection does not timeout after sending data on broken connection when using TCP keepalive #1223

Closed
mathiasdevos opened this issue Apr 3, 2018 · 26 comments

Comments

@mathiasdevos
Copy link

I need to be able to detect a connection loss in about 6 seconds.

The TCP keep alive settings works great when no data is being sent on the moment when the connection gets lost.

    lws_context_creation_info info;
    info.ka_time = 3;
    info.ka_probes = 3;
    info.ka_interval = 1; 

However when some data is sent on the WebSocket after the physical connection is broken, the TCP socket will not send TCP keep alive packets but TCP retransmission packets. The timeout between every retransmit will increase exponentially and a timeout will only occur after about 15 min in my case.

Is there any way to control this in libwebsockets so this can also timeout after about 6 seconds?
I have searched for this in the documentation and online but couldn't find a clear answer.

@lws-team
Copy link
Member

lws-team commented Apr 3, 2018

What platform is this? The unix plat code looks like this

	if (vhost->ka_time) {
		/* enable keepalive on this socket */
		optval = 1;
		if (setsockopt(fd, SOL_SOCKET, SO_KEEPALIVE,
			       (const void *)&optval, optlen) < 0)
			return 1;

#if defined(__APPLE__) || \
    defined(__FreeBSD__) || defined(__FreeBSD_kernel__) || \
    defined(__NetBSD__) || \
    defined(__CYGWIN__) || defined(__OpenBSD__) || defined (__sun) || \
    defined(__HAIKU__)

		/*
		 * didn't find a way to set these per-socket, need to
		 * tune kernel systemwide values
		 */
#else
		/* set the keepalive conditions we want on it too */
		optval = vhost->ka_time;
		if (setsockopt(fd, IPPROTO_TCP, TCP_KEEPIDLE,
			       (const void *)&optval, optlen) < 0)
			return 1;

		optval = vhost->ka_interval;
		if (setsockopt(fd, IPPROTO_TCP, TCP_KEEPINTVL,
			       (const void *)&optval, optlen) < 0)
			return 1;

		optval = vhost->ka_probes;
		if (setsockopt(fd, IPPROTO_TCP, TCP_KEEPCNT,
			       (const void *)&optval, optlen) < 0)
			return 1;
#endif

ie, if you are not on Linux platform, actually I don't know how to turn it on and off per-socket.

It's Linux?

@mathiasdevos
Copy link
Author

Yes, it's Linux.

The keep alive messages are not an issue, but TCP keep alive messages are only sent on an idle connection.

This is from RFC 1122, page 100:

        Keep-alive packets MUST only be sent when no data or
        acknowledgement packets have been received for the
        connection within an interval.  This interval MUST be
        configurable and MUST default to no less than two hours.

This works fine as you can see in the following Wireshark log.
keep_alive

However when data has been sent to the disconnected target, the TCP Keep-Alive packet will not be sent but TCP Retransmission packets will be sent as shown in the next Wireshark log.
retransmission

This thread on stackoverflow might give some more perspective: https://stackoverflow.com/questions/5907527/application-control-of-tcp-retransmission-on-linux

@lws-team
Copy link
Member

lws-team commented Apr 3, 2018

I kind of understood, but although I am happy you have a lot of faith in me, what are you asking me to do about it... fix the kernel network stack? Or there is something at the userland layer it seems lws is getting wrong to cause this outcome?

@mathiasdevos
Copy link
Author

So are you telling me that there is no way to detect a connection loss in less than 15 min, other than implementing a heart beat on top of the WebSocket protocol?

@lws-team
Copy link
Member

lws-team commented Apr 3, 2018

No, I asked you:

is something at the userland layer it seems lws is getting wrong to cause this outcome?

It calls a few apis and the kernel does the rest. The options for lws to get something wrong are very small (although, that has never stopped us before). So either, a) the problem is somewhere else, or b) the code we are talking about is the stuff I pasted above. Would you perhaps like to try debugging those 20 lines or so? Maybe check optlen is sane since that is a bit strange in the api.

As far as I know there aren't any other buttons to press when it comes to tcp keepalive, those lines are more or less what you're supposed to do to enable it and that's it. If they get executed then it's supposed to be operational.

@mathiasdevos
Copy link
Author

As I said, there is no issue with the TCP keep alive, this is working as expected on idle connections.

My issue is that the TCP keep alive messages are not used in case of data retransmissions (which is totally according to the RFC) and therefore the connection is only disconnected after 15 min.

As far as I can tell, there are two ways to control this:

  1. set tcp_retries2 to smaller number
  2. set TCP_USER_TIMEOUT socket option

The second option seems to be the preferred option, however I cannot find how this option can be set using libwebsockets.

@lws-team
Copy link
Member

lws-team commented Apr 3, 2018

As far as I can tell, there are two ways to control this:

set tcp_retries2 to smaller number
set TCP_USER_TIMEOUT socket option

The second option seems to be the preferred option, however I cannot find how this option can be set using libwebsockets.

The entire code around keepalive in lws is what I pasted earlier.

If you want to test and send a patch or PR on lws that solves your problem I'll be happy to see it.

@mathiasdevos
Copy link
Author

Sorry, I don't have time to create a decent patch at this moment.
Thanks for your quick response.

@lws-team
Copy link
Member

lws-team commented Apr 4, 2018

I don't need a 'decent patch'... if you have a solution just paste the relevant part here and I will integrate it.

@mathiasdevos
Copy link
Author

I found out this problem also occurs when using "timeout_secs" and "ws_ping_pong_interval" in stead of TCP keep alive.

When sending data on a physically disconnected connection, libwebsockets will execute lws_restart_ws_ping_pong_timer where time_next_ping_check is set.
However there will be no poll timeouts so lws_service_fd_tsi will not be executed.

As a result there will be no ping pong timeout although no data can be sent to the remote device and no data is received.

I don't have a solution or patch for this issue (at least I believe this is an issue).
I am afraid this is all the input I can provide.

In case anyone wants to reproduce:

  • Set the timeout_secs and ws_ping_pong_interval to some useful value in the server
  • Connect a client to the server
  • Physically interrupt the connection between the client and server
  • Send data from the server to the client on the open WebSocket before the ping pong timer timed out

@lws-team
Copy link
Member

lws-team commented Apr 4, 2018

However there will be no poll timeouts

... hmmm all the sockets in lws are nonblocking. So there shouldn't be any way to block to the point that the event loop stops. I'll try your recipe and see what happens here.

@lws-team
Copy link
Member

lws-team commented Apr 4, 2018

I tried it like this

a) On master, apply this diff to follow your steps and to add some debug so I can see what goes on

diff --git a/lib/service.c b/lib/service.c
index 4fe9b9e7..f8e2f921 100644
--- a/lib/service.c
+++ b/lib/service.c
@@ -1375,7 +1375,7 @@ lws_service_fd_tsi(struct lws_context *context, struct lws_pollfd *pollfd,
                                            lws_compare_time_t(context, now,
                                                wsi->ws->time_next_ping_check) >
                                               context->ws_ping_pong_interval) {
-
+                                               lwsl_notice("xxx\n");
                                                lwsl_info("req pp on wsi %p\n",
                                                          wsi);
                                                wsi->ws->send_check_ping = 1;
diff --git a/minimal-examples/ws-server/minimal-ws-server/minimal-ws-server.c b/minimal-examples/ws-server/minimal-ws-server/minimal-ws-server.c
index d22a7d7c..0fd97bc1 100644
--- a/minimal-examples/ws-server/minimal-ws-server/minimal-ws-server.c
+++ b/minimal-examples/ws-server/minimal-ws-server/minimal-ws-server.c
@@ -66,14 +66,16 @@ int main(int argc, char **argv)
        info.port = 7681;
        info.mounts = &mount;
        info.protocols = protocols;
+       info.timeout_secs = 30;
+       info.ws_ping_pong_interval = 5;
 
        lws_set_log_level(LLL_USER | LLL_ERR | LLL_WARN | LLL_NOTICE
                        /* for LLL_ verbosity above NOTICE to be built into lws,
                         * lws must have been configured and built with
                         * -DCMAKE_BUILD_TYPE=DEBUG instead of =RELEASE */
-                       /* | LLL_INFO */ /* | LLL_PARSER */ /* | LLL_HEADER */
+                        | LLL_INFO  /* | LLL_PARSER */ /* | LLL_HEADER */
                        /* | LLL_EXT */ /* | LLL_CLIENT */ /* | LLL_LATENCY */
-                       /* | LLL_DEBUG */, NULL);
+                       | LLL_DEBUG, NULL);
 
        lwsl_user("LWS minimal ws server | visit http://localhost:7681\n");
 
@@ -83,8 +85,10 @@ int main(int argc, char **argv)
                return 1;
        }
 
-       while (n >= 0 && !interrupted)
+       while (n >= 0 && !interrupted) {
                n = lws_service(context, 1000);
+               lwsl_notice("evloop\n");
+       }
 
        lws_context_destroy(context);
 

b) build lws and then the test app minimal-examples/ws-server/minimal-ws-server

c) run the test app

d) connect by Android tablet on the same wlan segment using Firefox

e) disable wifi on the tablet

[2018/04/04 17:23:47:5921] INFO: Method: 'GET' (0), request for '/'
[2018/04/04 17:23:47:5921] DEBUG: lws_ensure_user_space: 0xb681a0 protocol pss 0, user_space=(nil)
[2018/04/04 17:23:47:5921] DEBUG: __lws_set_timeout: 0xb681a0: 30 secs
[2018/04/04 17:23:47:5921] INFO: client indicates GZIP is acceptable
[2018/04/04 17:23:47:5922] INFO: lws_add_http_header_content_length: wsi 0xb681a0: tx_content_length/remain 1734
[2018/04/04 17:23:47:5922] INFO: lws_callback_on_writable: 0xb681a0 (mode 1)
[2018/04/04 17:23:47:5922] DEBUG: _lws_change_pollfd: wsi 0xb681a0: fd 7 events 1 -> 5
[2018/04/04 17:23:47:5922] DEBUG: lws_server_socket_service: wsi 0xb681a0: ah read rxpos 338, rxlen 338
[2018/04/04 17:23:47:5922] NOTICE: evloop
[2018/04/04 17:23:47:5922] DEBUG: _lws_change_pollfd: wsi 0xb681a0: fd 7 events 5 -> 1
[2018/04/04 17:23:47:5922] DEBUG: wsi->http2_substream 0
[2018/04/04 17:23:47:5922] DEBUG: _lws_plat_file_read: read 1734 of req 1734, pos 1734, len 1734
[2018/04/04 17:23:47:5922] DEBUG: lws_serve_http_file_fragment: sending 1734
[2018/04/04 17:23:47:5922] DEBUG: __lws_set_timeout: 0xb681a0: 30 secs
[2018/04/04 17:23:47:5923] DEBUG: file completed
[2018/04/04 17:23:47:5923] INFO: lws_http_transaction_completed: wsi 0xb681a0
[2018/04/04 17:23:47:5923] INFO: ah attached, pos 338, len 338
[2018/04/04 17:23:47:5923] INFO: lws_same_vh_protocol_remove: removing same prot wsi 0xb681a0
[2018/04/04 17:23:47:5923] DEBUG: lws_ensure_user_space: 0xb681a0 protocol pss 0, user_space=(nil)
[2018/04/04 17:23:47:5923] DEBUG: __lws_set_timeout: 0xb681a0: 5 secs
[2018/04/04 17:23:47:5923] INFO: __lws_header_table_detach: wsi 0xb681a0: ah 0xb684b0 (tsi=0, count = 1)
[2018/04/04 17:23:47:5923] INFO: __lws_header_table_detach: nobody usable waiting
[2018/04/04 17:23:47:5923] INFO: _lws_destroy_ah: freed ah 0xb684b0 : pool length 0
[2018/04/04 17:23:47:5923] INFO: __lws_header_table_detach: wsi 0xb681a0: ah 0xb684b0 (tsi=0, count = 0)
[2018/04/04 17:23:47:5923] INFO: lws_http_transaction_completed: 0xb681a0: keep-alive await new transaction
[2018/04/04 17:23:47:5923] INFO: lws_callback_on_writable: 0xb681a0 (mode 0)
[2018/04/04 17:23:47:5923] DEBUG: _lws_change_pollfd: wsi 0xb681a0: fd 7 events 1 -> 5
[2018/04/04 17:23:47:5923] NOTICE: evloop
[2018/04/04 17:23:47:5923] DEBUG: _lws_change_pollfd: wsi 0xb681a0: fd 7 events 5 -> 1
[2018/04/04 17:23:47:5923] DEBUG: lws_server_socket_service: LWSS_HTTP_DEFERRING_ACTION now writable
[2018/04/04 17:23:47:5923] DEBUG: _lws_change_pollfd: wsi 0xb681a0: fd 7 events 1 -> 1
[2018/04/04 17:23:47:5923] NOTICE: evloop
[2018/04/04 17:23:47:7976] DEBUG: accepted new conn port 46498 on fd=8
[2018/04/04 17:23:47:7977] DEBUG: _realloc: size 776: new server wsi
[2018/04/04 17:23:47:7977] DEBUG: new wsi 0xb684b0 joining vhost default, tsi 0
[2018/04/04 17:23:47:7977] DEBUG: lws_adopt_descriptor_vhost: new wsi 0xb684b0, sockfd 8
[2018/04/04 17:23:47:7977] DEBUG: __lws_set_timeout: 0xb684b0: 30 secs
[2018/04/04 17:23:47:7977] DEBUG: __insert_wsi_socket_into_fds: 0xb684b0: tsi=0, sock=8, pos-in-fds=3
[2018/04/04 17:23:47:7977] INFO: lws_header_table_attach: wsi 0xb684b0: ah (nil) (tsi 0, count = 0) in
[2018/04/04 17:23:47:7977] DEBUG: _realloc: size 3000: ah struct
[2018/04/04 17:23:47:7977] DEBUG: _realloc: size 4096: ah data
[2018/04/04 17:23:47:7977] INFO: _lws_create_ah: created ah 0xb687c0 (size 4096): pool length 1
[2018/04/04 17:23:47:7977] DEBUG: _lws_change_pollfd: wsi 0xb684b0: fd 8 events 1 -> 1
[2018/04/04 17:23:47:7977] INFO: lws_header_table_attach: did attach wsi 0xb684b0: ah 0xb687c0: count 1 (on exit)
[2018/04/04 17:23:47:7977] DEBUG: __lws_set_timeout: 0xb684b0: 10 secs
[2018/04/04 17:23:47:7977] DEBUG: Attached ah immediately
[2018/04/04 17:23:47:7978] DEBUG: lws_plat_pipe_signal: fd 6 1
[2018/04/04 17:23:47:7978] NOTICE: evloop
[2018/04/04 17:23:47:7978] NOTICE: evloop
[2018/04/04 17:23:47:7997] INFO: vhost match to default based on port 7681
[2018/04/04 17:23:47:7997] DEBUG: __lws_set_timeout: 0xb684b0: 0 secs
[2018/04/04 17:23:47:7997] INFO: Upgrade to ws
[2018/04/04 17:23:47:7997] DEBUG: checking lws-minimal
[2018/04/04 17:23:47:7997] DEBUG: try http
[2018/04/04 17:23:47:7997] DEBUG: try lws-minimal
[2018/04/04 17:23:47:7997] DEBUG: _realloc: size 216: ws struct
[2018/04/04 17:23:47:7998] DEBUG: _realloc: size 24: user space
[2018/04/04 17:23:47:7998] DEBUG: lws_ensure_user_space: 0xb684b0 protocol pss 24, user_space=0xb624a0
[2018/04/04 17:23:47:7998] DEBUG: 0xb684b0 new partial sent 148 from 166 total
[2018/04/04 17:23:47:7998] DEBUG: _realloc: size 18: truncated send alloc
[2018/04/04 17:23:47:7998] INFO: lws_callback_on_writable: 0xb684b0 (mode 2)
[2018/04/04 17:23:47:7998] DEBUG: _lws_change_pollfd: wsi 0xb684b0: fd 8 events 1 -> 5
[2018/04/04 17:23:47:7998] DEBUG: lws_process_ws_upgrade: 0xb684b0: inheriting ws ah (rxpos:0, rxlen:544)
[2018/04/04 17:23:47:7998] DEBUG: lws_union_transition: 0xb684b0: mode 3
[2018/04/04 17:23:47:7998] DEBUG: _realloc: size 148: rx_ubuf
[2018/04/04 17:23:47:7998] DEBUG: Allocating RX buffer 144
[2018/04/04 17:23:47:7998] DEBUG: ws established
[2018/04/04 17:23:47:7998] DEBUG: lws_server_socket_service: wsi 0xb684b0: ah read rxpos 544, rxlen 544
[2018/04/04 17:23:47:7998] INFO: __lws_header_table_detach: wsi 0xb684b0: ah 0xb687c0 (tsi=0, count = 1)
[2018/04/04 17:23:47:7998] INFO: __lws_header_table_detach: nobody usable waiting
[2018/04/04 17:23:47:7999] INFO: _lws_destroy_ah: freed ah 0xb687c0 : pool length 0
[2018/04/04 17:23:47:7999] INFO: __lws_header_table_detach: wsi 0xb684b0: ah 0xb687c0 (tsi=0, count = 0)
[2018/04/04 17:23:47:7999] NOTICE: evloop
[2018/04/04 17:23:47:7999] INFO: lws_handle_POLLOUT_event: 0xb684b0
[2018/04/04 17:23:47:7999] INFO: 0xb684b0 partial adv 18 (vs 18)
[2018/04/04 17:23:47:7999] INFO: ** 0xb684b0 partial send completed
[2018/04/04 17:23:47:7999] INFO: lws_callback_on_writable: 0xb684b0 (mode 3)
[2018/04/04 17:23:47:7999] DEBUG: _lws_change_pollfd: using leave_pollout_active
[2018/04/04 17:23:47:7999] NOTICE: evloop
[2018/04/04 17:23:47:8026] INFO: lws_handle_POLLOUT_event: 0xb684b0
[2018/04/04 17:23:47:8026] DEBUG: _lws_change_pollfd: wsi 0xb684b0: fd 8 events 5 -> 1
[2018/04/04 17:23:47:8026] INFO: lws_handle_POLLOUT_event: non http2
[2018/04/04 17:23:47:8026] NOTICE: evloop
[2018/04/04 17:23:47:8126] INFO: lws_header_table_attach: wsi 0xb681a0: ah (nil) (tsi 0, count = 0) in
[2018/04/04 17:23:47:8126] DEBUG: _realloc: size 3000: ah struct
[2018/04/04 17:23:47:8126] DEBUG: _realloc: size 4096: ah data
[2018/04/04 17:23:47:8126] INFO: _lws_create_ah: created ah 0xb687c0 (size 4096): pool length 1
[2018/04/04 17:23:47:8126] DEBUG: _lws_change_pollfd: wsi 0xb681a0: fd 7 events 1 -> 1
[2018/04/04 17:23:47:8126] INFO: lws_header_table_attach: did attach wsi 0xb681a0: ah 0xb687c0: count 1 (on exit)
[2018/04/04 17:23:47:8126] DEBUG: __lws_set_timeout: 0xb681a0: 10 secs
[2018/04/04 17:23:47:8126] INFO: vhost match to default based on port 7681
[2018/04/04 17:23:47:8126] DEBUG: __lws_set_timeout: 0xb681a0: 0 secs
[2018/04/04 17:23:47:8126] INFO: No upgrade
[2018/04/04 17:23:47:8126] DEBUG: lws_union_transition: 0xb681a0: mode 1
[2018/04/04 17:23:47:8126] DEBUG: lws_handshake_server: wsi 0xb681a0: ah 0xb687c0
[2018/04/04 17:23:47:8126] INFO: Method: 'GET' (0), request for '/libwebsockets.org-logo.png'
[2018/04/04 17:23:47:8126] DEBUG: lws_ensure_user_space: 0xb681a0 protocol pss 0, user_space=(nil)
[2018/04/04 17:23:47:8127] DEBUG: __lws_set_timeout: 0xb681a0: 30 secs
[2018/04/04 17:23:47:8127] INFO: client indicates GZIP is acceptable
[2018/04/04 17:23:47:8127] INFO: lws_add_http_header_content_length: wsi 0xb681a0: tx_content_length/remain 7029
[2018/04/04 17:23:47:8127] INFO: lws_callback_on_writable: 0xb681a0 (mode 1)
[2018/04/04 17:23:47:8127] DEBUG: _lws_change_pollfd: wsi 0xb681a0: fd 7 events 1 -> 5
[2018/04/04 17:23:47:8127] DEBUG: lws_server_socket_service: wsi 0xb681a0: ah read rxpos 311, rxlen 311
[2018/04/04 17:23:47:8127] NOTICE: evloop
[2018/04/04 17:23:47:8127] DEBUG: _lws_change_pollfd: wsi 0xb681a0: fd 7 events 5 -> 1
[2018/04/04 17:23:47:8127] DEBUG: wsi->http2_substream 0
[2018/04/04 17:23:47:8128] DEBUG: _lws_plat_file_read: read 4087 of req 4087, pos 4087, len 7029
[2018/04/04 17:23:47:8128] DEBUG: lws_serve_http_file_fragment: sending 4087
[2018/04/04 17:23:47:8128] DEBUG: __lws_set_timeout: 0xb681a0: 30 secs
[2018/04/04 17:23:47:8128] INFO: lws_callback_on_writable: 0xb681a0 (mode 1)
[2018/04/04 17:23:47:8128] DEBUG: _lws_change_pollfd: wsi 0xb681a0: fd 7 events 1 -> 5
[2018/04/04 17:23:47:8128] NOTICE: evloop
[2018/04/04 17:23:47:8128] DEBUG: _lws_change_pollfd: wsi 0xb681a0: fd 7 events 5 -> 1
[2018/04/04 17:23:47:8128] DEBUG: wsi->http2_substream 0
[2018/04/04 17:23:47:8128] DEBUG: _lws_plat_file_read: read 2942 of req 4087, pos 7029, len 7029
[2018/04/04 17:23:47:8128] DEBUG: lws_serve_http_file_fragment: sending 2942
[2018/04/04 17:23:47:8128] DEBUG: __lws_set_timeout: 0xb681a0: 30 secs
[2018/04/04 17:23:47:8128] DEBUG: file completed
[2018/04/04 17:23:47:8128] INFO: lws_http_transaction_completed: wsi 0xb681a0
[2018/04/04 17:23:47:8129] INFO: ah attached, pos 311, len 311
[2018/04/04 17:23:47:8129] INFO: lws_same_vh_protocol_remove: removing same prot wsi 0xb681a0
[2018/04/04 17:23:47:8129] INFO: have prev 0xb62380, setting him to our next (nil)
[2018/04/04 17:23:47:8129] DEBUG: lws_ensure_user_space: 0xb681a0 protocol pss 0, user_space=(nil)
[2018/04/04 17:23:47:8129] DEBUG: __lws_set_timeout: 0xb681a0: 5 secs
[2018/04/04 17:23:47:8129] INFO: __lws_header_table_detach: wsi 0xb681a0: ah 0xb687c0 (tsi=0, count = 1)
[2018/04/04 17:23:47:8129] INFO: __lws_header_table_detach: nobody usable waiting
[2018/04/04 17:23:47:8129] INFO: _lws_destroy_ah: freed ah 0xb687c0 : pool length 0
[2018/04/04 17:23:47:8129] INFO: __lws_header_table_detach: wsi 0xb681a0: ah 0xb687c0 (tsi=0, count = 0)
[2018/04/04 17:23:47:8129] INFO: lws_http_transaction_completed: 0xb681a0: keep-alive await new transaction
[2018/04/04 17:23:47:8129] INFO: lws_callback_on_writable: 0xb681a0 (mode 0)
[2018/04/04 17:23:47:8129] DEBUG: _lws_change_pollfd: wsi 0xb681a0: fd 7 events 1 -> 5
[2018/04/04 17:23:47:8129] NOTICE: evloop
[2018/04/04 17:23:47:8129] DEBUG: _lws_change_pollfd: wsi 0xb681a0: fd 7 events 5 -> 1
[2018/04/04 17:23:47:8129] DEBUG: lws_server_socket_service: LWSS_HTTP_DEFERRING_ACTION now writable
[2018/04/04 17:23:47:8129] DEBUG: _lws_change_pollfd: wsi 0xb681a0: fd 7 events 1 -> 1
[2018/04/04 17:23:47:8130] NOTICE: evloop
[2018/04/04 17:23:48:8140] NOTICE: evloop
[2018/04/04 17:23:49:8151] NOTICE: evloop
[2018/04/04 17:23:50:8161] NOTICE: evloop
[2018/04/04 17:23:51:8172] NOTICE: evloop
[2018/04/04 17:23:52:8182] NOTICE: evloop
[2018/04/04 17:23:53:8193] INFO: __lws_close_free_wsi: 0xb681a0: caller: timeout
[2018/04/04 17:23:53:8193] DEBUG: __lws_close_free_wsi: real just_kill_connection: 0xb681a0 (sockfd 7)
[2018/04/04 17:23:53:8193] INFO: lws_same_vh_protocol_remove: removing same prot wsi 0xb681a0
[2018/04/04 17:23:53:8193] INFO: have prev 0xb62380, setting him to our next (nil)
[2018/04/04 17:23:53:8193] DEBUG: __remove_wsi_socket_from_fds: wsi=0xb681a0, sock=7, fds pos=2, end guy pos=4, endfd=0
[2018/04/04 17:23:53:8193] DEBUG: _lws_change_pollfd: wsi 0xb67b80: fd 4 events 1 -> 1
[2018/04/04 17:23:53:8194] INFO: ah det due to close
[2018/04/04 17:23:53:8194] DEBUG: __lws_free_wsi: 0xb681a0, remaining wsi 3
[2018/04/04 17:23:53:8194] NOTICE: xxx
[2018/04/04 17:23:53:8194] INFO: req pp on wsi 0xb684b0
[2018/04/04 17:23:53:8194] DEBUG: __lws_set_timeout: 0xb684b0: 30 secs
[2018/04/04 17:23:53:8194] INFO: lws_callback_on_writable: 0xb684b0 (mode 3)
[2018/04/04 17:23:53:8194] DEBUG: _lws_change_pollfd: wsi 0xb684b0: fd 8 events 1 -> 5
[2018/04/04 17:23:53:8194] NOTICE: evloop
[2018/04/04 17:23:53:8194] INFO: lws_handle_POLLOUT_event: 0xb684b0
[2018/04/04 17:23:53:8194] INFO: issuing ping on wsi 0xb684b0
[2018/04/04 17:23:53:8194] DEBUG: __lws_set_timeout: 0xb684b0: 30 secs
[2018/04/04 17:23:53:8194] NOTICE: evloop
[2018/04/04 17:23:53:8195] INFO: lws_handle_POLLOUT_event: 0xb684b0
[2018/04/04 17:23:53:8195] DEBUG: _lws_change_pollfd: wsi 0xb684b0: fd 8 events 5 -> 1
[2018/04/04 17:23:53:8195] INFO: lws_handle_POLLOUT_event: non http2
[2018/04/04 17:23:53:8195] NOTICE: evloop
[2018/04/04 17:23:53:8245] INFO: received pong
[2018/04/04 17:23:53:8245] DEBUG: 
[2018/04/04 17:23:53:8245] DEBUG: 
[2018/04/04 17:23:53:8245] INFO: received expected PONG on wsi 0xb684b0
[2018/04/04 17:23:53:8245] DEBUG: __lws_set_timeout: 0xb684b0: 0 secs
[2018/04/04 17:23:53:8245] NOTICE: evloop
[2018/04/04 17:23:54:8256] NOTICE: evloop
[2018/04/04 17:23:55:8267] NOTICE: evloop
[2018/04/04 17:23:56:8277] NOTICE: evloop
[2018/04/04 17:23:57:8288] NOTICE: evloop
[2018/04/04 17:23:58:8299] NOTICE: evloop
[2018/04/04 17:23:59:8309] NOTICE: xxx
[2018/04/04 17:23:59:8309] INFO: req pp on wsi 0xb684b0
[2018/04/04 17:23:59:8309] DEBUG: __lws_set_timeout: 0xb684b0: 30 secs
[2018/04/04 17:23:59:8309] INFO: lws_callback_on_writable: 0xb684b0 (mode 3)
[2018/04/04 17:23:59:8309] DEBUG: _lws_change_pollfd: wsi 0xb684b0: fd 8 events 1 -> 5
[2018/04/04 17:23:59:8310] NOTICE: evloop
[2018/04/04 17:23:59:8310] INFO: lws_handle_POLLOUT_event: 0xb684b0
[2018/04/04 17:23:59:8310] INFO: issuing ping on wsi 0xb684b0
[2018/04/04 17:23:59:8310] DEBUG: __lws_set_timeout: 0xb684b0: 30 secs
[2018/04/04 17:23:59:8310] NOTICE: evloop
[2018/04/04 17:23:59:8310] INFO: lws_handle_POLLOUT_event: 0xb684b0
[2018/04/04 17:23:59:8310] DEBUG: _lws_change_pollfd: wsi 0xb684b0: fd 8 events 5 -> 1
[2018/04/04 17:23:59:8310] INFO: lws_handle_POLLOUT_event: non http2
[2018/04/04 17:23:59:8310] NOTICE: evloop
[2018/04/04 17:23:59:8502] INFO: received pong
[2018/04/04 17:23:59:8502] DEBUG: 
[2018/04/04 17:23:59:8502] DEBUG: 
[2018/04/04 17:23:59:8502] INFO: received expected PONG on wsi 0xb684b0
[2018/04/04 17:23:59:8502] DEBUG: __lws_set_timeout: 0xb684b0: 0 secs
[2018/04/04 17:23:59:8502] NOTICE: evloop
[2018/04/04 17:24:00:8513] NOTICE: evloop
[2018/04/04 17:24:01:8523] NOTICE: evloop
[2018/04/04 17:24:02:8534] NOTICE: evloop
[2018/04/04 17:24:03:8544] NOTICE: evloop
[2018/04/04 17:24:04:8555] NOTICE: evloop
[2018/04/04 17:24:05:8565] NOTICE: xxx
[2018/04/04 17:24:05:8566] INFO: req pp on wsi 0xb684b0
[2018/04/04 17:24:05:8566] DEBUG: __lws_set_timeout: 0xb684b0: 30 secs
[2018/04/04 17:24:05:8566] INFO: lws_callback_on_writable: 0xb684b0 (mode 3)
[2018/04/04 17:24:05:8566] DEBUG: _lws_change_pollfd: wsi 0xb684b0: fd 8 events 1 -> 5
[2018/04/04 17:24:05:8566] NOTICE: evloop
[2018/04/04 17:24:05:8566] INFO: lws_handle_POLLOUT_event: 0xb684b0
[2018/04/04 17:24:05:8566] INFO: issuing ping on wsi 0xb684b0
[2018/04/04 17:24:05:8566] DEBUG: __lws_set_timeout: 0xb684b0: 30 secs
[2018/04/04 17:24:05:8566] NOTICE: evloop
[2018/04/04 17:24:05:8566] INFO: lws_handle_POLLOUT_event: 0xb684b0
[2018/04/04 17:24:05:8566] DEBUG: _lws_change_pollfd: wsi 0xb684b0: fd 8 events 5 -> 1
[2018/04/04 17:24:05:8566] INFO: lws_handle_POLLOUT_event: non http2
[2018/04/04 17:24:05:8566] NOTICE: evloop
[2018/04/04 17:24:06:8577] NOTICE: evloop
[2018/04/04 17:24:07:8588] NOTICE: evloop
[2018/04/04 17:24:08:8598] NOTICE: evloop
[2018/04/04 17:24:09:8609] NOTICE: evloop
[2018/04/04 17:24:10:8619] NOTICE: evloop
[2018/04/04 17:24:11:8631] NOTICE: xxx
[2018/04/04 17:24:11:8631] INFO: req pp on wsi 0xb684b0
[2018/04/04 17:24:11:8631] DEBUG: __lws_set_timeout: 0xb684b0: 30 secs
[2018/04/04 17:24:11:8631] INFO: lws_callback_on_writable: 0xb684b0 (mode 3)
[2018/04/04 17:24:11:8631] DEBUG: _lws_change_pollfd: wsi 0xb684b0: fd 8 events 1 -> 5
[2018/04/04 17:24:11:8631] NOTICE: evloop
[2018/04/04 17:24:11:8631] INFO: lws_handle_POLLOUT_event: 0xb684b0
[2018/04/04 17:24:11:8632] INFO: issuing ping on wsi 0xb684b0
[2018/04/04 17:24:11:8632] DEBUG: __lws_set_timeout: 0xb684b0: 30 secs
[2018/04/04 17:24:11:8632] NOTICE: evloop
[2018/04/04 17:24:12:8642] NOTICE: evloop
[2018/04/04 17:24:13:8653] NOTICE: evloop
[2018/04/04 17:24:14:8663] NOTICE: evloop
[2018/04/04 17:24:15:8674] NOTICE: evloop
[2018/04/04 17:24:16:8684] NOTICE: evloop
[2018/04/04 17:24:17:8695] NOTICE: xxx
[2018/04/04 17:24:17:8695] INFO: req pp on wsi 0xb684b0
[2018/04/04 17:24:17:8695] DEBUG: __lws_set_timeout: 0xb684b0: 30 secs
[2018/04/04 17:24:17:8695] INFO: lws_callback_on_writable: 0xb684b0 (mode 3)
[2018/04/04 17:24:17:8695] DEBUG: _lws_change_pollfd: wsi 0xb684b0: fd 8 events 5 -> 5
[2018/04/04 17:24:17:8695] NOTICE: evloop
[2018/04/04 17:24:18:8706] NOTICE: evloop
[2018/04/04 17:24:19:8717] NOTICE: evloop
[2018/04/04 17:24:20:8727] NOTICE: evloop

...I disable the wlan around here...

[2018/04/04 17:24:21:8738] NOTICE: evloop
[2018/04/04 17:24:22:8748] NOTICE: evloop
[2018/04/04 17:24:23:8759] NOTICE: evloop
[2018/04/04 17:24:24:8770] NOTICE: evloop
[2018/04/04 17:24:25:8780] NOTICE: evloop
[2018/04/04 17:24:26:8791] NOTICE: evloop
[2018/04/04 17:24:27:8801] NOTICE: evloop
[2018/04/04 17:24:28:8812] NOTICE: evloop
[2018/04/04 17:24:29:8823] NOTICE: evloop
[2018/04/04 17:24:30:8833] NOTICE: evloop
[2018/04/04 17:24:31:8844] NOTICE: evloop
[2018/04/04 17:24:32:8854] NOTICE: evloop
[2018/04/04 17:24:33:8865] NOTICE: evloop
[2018/04/04 17:24:34:8875] NOTICE: evloop
[2018/04/04 17:24:35:8885] NOTICE: evloop
[2018/04/04 17:24:36:8896] NOTICE: evloop
[2018/04/04 17:24:37:8906] NOTICE: evloop
[2018/04/04 17:24:38:8917] NOTICE: evloop
[2018/04/04 17:24:39:8928] NOTICE: evloop
[2018/04/04 17:24:40:8938] NOTICE: evloop
[2018/04/04 17:24:41:8949] NOTICE: evloop
[2018/04/04 17:24:42:8959] NOTICE: evloop
[2018/04/04 17:24:43:8970] NOTICE: evloop
[2018/04/04 17:24:44:8980] NOTICE: evloop
[2018/04/04 17:24:45:8991] NOTICE: evloop
[2018/04/04 17:24:46:9002] NOTICE: evloop
[2018/04/04 17:24:47:9013] NOTICE: evloop
[2018/04/04 17:24:48:9023] INFO: wsi 0xb684b0: TIMEDOUT WAITING on 16 (did hdr 1, ah (nil), wl 0, pfd events 5) 1522833888 vs 30
[2018/04/04 17:24:48:9023] INFO: __lws_close_free_wsi: 0xb684b0: caller: timeout
[2018/04/04 17:24:48:9023] DEBUG: __lws_close_free_wsi: real just_kill_connection: 0xb684b0 (sockfd 8)
[2018/04/04 17:24:48:9024] INFO: lws_same_vh_protocol_remove: removing same prot wsi 0xb684b0
[2018/04/04 17:24:48:9024] INFO: have prev 0xb62388, setting him to our next (nil)
[2018/04/04 17:24:48:9024] DEBUG: __remove_wsi_socket_from_fds: wsi=0xb684b0, sock=8, fds pos=2, end guy pos=3, endfd=8
[2018/04/04 17:24:48:9024] DEBUG: _lws_change_pollfd: wsi 0xb67b80: fd 4 events 1 -> 1
[2018/04/04 17:24:48:9024] INFO: ah det due to close
[2018/04/04 17:24:48:9024] DEBUG: __lws_free_wsi: 0xb684b0, remaining wsi 2
[2018/04/04 17:24:48:9024] NOTICE: evloop
[2018/04/04 17:24:49:9035] NOTICE: evloop
[2018/04/04 17:24:50:9045] NOTICE: evloop
[2018/04/04 17:24:51:9056] NOTICE: evloop
[2018/04/04 17:24:52:9066] NOTICE: evloop
^C[2018/04/04 17:24:53:3201] NOTICE: evloop
[2018/04/04 17:24:53:3201] INFO: lws_context_destroy: ctx 0xb61530
[2018/04/04 17:24:53:3201] INFO: __lws_close_free_wsi: 0xb67b80: caller: ctx destroy
[2018/04/04 17:24:53:3201] DEBUG: __lws_close_free_wsi: real just_kill_connection: 0xb67b80 (sockfd 4)
[2018/04/04 17:24:53:3201] INFO: lws_same_vh_protocol_remove: removing same prot wsi 0xb67b80
[2018/04/04 17:24:53:3201] DEBUG: __remove_wsi_socket_from_fds: wsi=0xb67b80, sock=4, fds pos=0, end guy pos=2, endfd=8
[2018/04/04 17:24:53:3201] DEBUG: not calling back closed mode=11 state=0
[2018/04/04 17:24:53:3202] INFO: ah det due to close
[2018/04/04 17:24:53:3202] DEBUG: __lws_free_wsi: 0xb67b80, remaining wsi 1
[2018/04/04 17:24:53:3202] INFO: lws_same_vh_protocol_remove: removing same prot wsi 0xb67e90
[2018/04/04 17:24:53:3202] DEBUG: __remove_wsi_socket_from_fds: wsi=0xb67e90, sock=5, fds pos=0, end guy pos=1, endfd=5
[2018/04/04 17:24:53:3202] INFO: lws_vhost_destroy1
[2018/04/04 17:24:53:3202] INFO: lws_libuv_destroyloop: 0
[2018/04/04 17:24:53:3202] INFO: lws_context_destroy2: ctx 0xb61530
[2018/04/04 17:24:53:3202] INFO: lws_vhost_destroy2: 0xb678f0
[2018/04/04 17:24:53:3202] INFO:   lws_vhost_destroy2: Freeing vhost 0xb678f0

He does not stop going around the event loop. Timeout reason 16 that it closes on is PENDING_TIMEOUT_WS_PONG_CHECK_SEND_PING.

This is acting the same for you, or something else? Because this seems OK to me.

@lws-team
Copy link
Member

lws-team commented Apr 4, 2018

In the ~1Hz timeout processing stuff, if he decides it's time to send the ping. he sets the related timeout at that time.

lib/service.c:1382 (on master)

						lws_set_timeout(wsi,
					PENDING_TIMEOUT_WS_PONG_CHECK_SEND_PING,
							context->timeout_secs);
						lws_callback_on_writable(wsi);

Even if the socket never becomes writeable due to whatever, the timeout should still fire and close it. Does this act any different for you? Do you maybe have something countermanding the timeout?

@mathiasdevos
Copy link
Author

The issue only happens for me when sending data to the client after it disconnected.

I'll try to make time for some sample code tomorrow.

@lws-team
Copy link
Member

lws-team commented Apr 4, 2018

My test from earlier isn't doing that... it's also disconnecting the client if that makes any difference.

I'll also try this kind of test tomorrow.

@lws-team
Copy link
Member

lws-team commented Apr 4, 2018

Hmm so now I have this diff for testing on master

diff --git a/lib/service.c b/lib/service.c
index 4fe9b9e7..f8e2f921 100644
--- a/lib/service.c
+++ b/lib/service.c
@@ -1375,7 +1375,7 @@ lws_service_fd_tsi(struct lws_context *context, struct lws_pollfd *pollfd,
                                            lws_compare_time_t(context, now,
                                                wsi->ws->time_next_ping_check) >
                                               context->ws_ping_pong_interval) {
-
+                                               lwsl_notice("xxx\n");
                                                lwsl_info("req pp on wsi %p\n",
                                                          wsi);
                                                wsi->ws->send_check_ping = 1;
diff --git a/minimal-examples/ws-server/minimal-ws-server/minimal-ws-server.c b/minimal-examples/ws-server/minimal-ws-server/minimal-ws-server.c
index d22a7d7c..0fd97bc1 100644
--- a/minimal-examples/ws-server/minimal-ws-server/minimal-ws-server.c
+++ b/minimal-examples/ws-server/minimal-ws-server/minimal-ws-server.c
@@ -66,14 +66,16 @@ int main(int argc, char **argv)
        info.port = 7681;
        info.mounts = &mount;
        info.protocols = protocols;
+       info.timeout_secs = 30;
+       info.ws_ping_pong_interval = 5;
 
        lws_set_log_level(LLL_USER | LLL_ERR | LLL_WARN | LLL_NOTICE
                        /* for LLL_ verbosity above NOTICE to be built into lws,
                         * lws must have been configured and built with
                         * -DCMAKE_BUILD_TYPE=DEBUG instead of =RELEASE */
-                       /* | LLL_INFO */ /* | LLL_PARSER */ /* | LLL_HEADER */
+                        | LLL_INFO  /* | LLL_PARSER */ /* | LLL_HEADER */
                        /* | LLL_EXT */ /* | LLL_CLIENT */ /* | LLL_LATENCY */
-                       /* | LLL_DEBUG */, NULL);
+                       | LLL_DEBUG, NULL);
 
        lwsl_user("LWS minimal ws server | visit http://localhost:7681\n");
 
@@ -83,8 +85,10 @@ int main(int argc, char **argv)
                return 1;
        }
 
-       while (n >= 0 && !interrupted)
+       while (n >= 0 && !interrupted) {
                n = lws_service(context, 1000);
+               lwsl_notice("evloop\n");
+       }
 
        lws_context_destroy(context);
 
diff --git a/minimal-examples/ws-server/minimal-ws-server/protocol_lws_minimal.c b/minimal-examples/ws-server/minimal-ws-server/protocol_lws_minimal.c
index 26f20c2f..e1fde2fa 100644
--- a/minimal-examples/ws-server/minimal-ws-server/protocol_lws_minimal.c
+++ b/minimal-examples/ws-server/minimal-ws-server/protocol_lws_minimal.c
@@ -32,6 +32,7 @@ struct per_session_data__minimal {
        struct per_session_data__minimal *pss_list;
        struct lws *wsi;
        int last; /* the last message number we sent */
+       int timer_pending;
 };
 
 /* one of these is created for each vhost our protocol is used with */
@@ -86,6 +87,7 @@ callback_minimal(struct lws *wsi, enum lws_callback_reasons reason,
                lws_ll_fwd_insert(pss, pss_list, vhd->pss_list);
                pss->wsi = wsi;
                pss->last = vhd->current;
+               lws_set_timer_usecs(wsi, 3000000);
                break;
 
        case LWS_CALLBACK_CLOSED:
@@ -95,6 +97,25 @@ callback_minimal(struct lws *wsi, enum lws_callback_reasons reason,
                break;
 
        case LWS_CALLBACK_SERVER_WRITEABLE:
+
+               if (pss->timer_pending) {
+                       uint8_t buf[LWS_PRE + 32];
+                       int n;
+
+                       pss->timer_pending--;
+                       n = lws_snprintf((char *)buf + LWS_PRE, sizeof(buf) - LWS_PRE,
+                                        "ping");
+                       lwsl_user("writing timer ping\n");
+                       m = lws_write(wsi, buf + LWS_PRE, n, LWS_WRITE_TEXT);
+                       if (m < n) {
+                               lwsl_err("ERROR %d writing to ws\n", m);
+                               return -1;
+                       }
+                       if (pss->timer_pending)
+                               lws_callback_on_writable(wsi);
+                       break;
+               }
+
                if (!vhd->amsg.payload)
                        break;
 
@@ -137,6 +158,13 @@ callback_minimal(struct lws *wsi, enum lws_callback_reasons reason,
                } lws_end_foreach_llp(ppss, pss_list);
                break;
 
+       case LWS_CALLBACK_TIMER:
+               lwsl_user("LWS_CALLBACK_TIMER\n");
+               pss->timer_pending++;
+               lws_callback_on_writable(wsi);
+               lws_set_timer_usecs(wsi, 3000000);
+               break;
+
        default:
                break;
        }

With this, once you make a connection from a client (I used a laptop on the same subnet) he sends a message "ping" every 3s using a wsi timer.

And after a few seconds, I pull out the only ethernet cable from the server side.

...
[2018/04/05 07:20:40:9283] NOTICE: evloop
[2018/04/05 07:20:41:8563] NOTICE: evloop
[2018/04/05 07:20:41:8563] NOTICE: evloop
[2018/04/05 07:20:41:8563] NOTICE: evloop
[2018/04/05 07:20:41:8563] USER: LWS_CALLBACK_TIMER
[2018/04/05 07:20:41:8564] INFO: lws_callback_on_writable: 0x11464b0 (mode 3)
[2018/04/05 07:20:41:8564] DEBUG: _lws_change_pollfd: wsi 0x11464b0: fd 8 events 1 -> 5
[2018/04/05 07:20:41:8564] INFO: lws_handle_POLLOUT_event: 0x11464b0
[2018/04/05 07:20:41:8564] DEBUG: _lws_change_pollfd: wsi 0x11464b0: fd 8 events 5 -> 1
[2018/04/05 07:20:41:8564] INFO: lws_handle_POLLOUT_event: non http2
[2018/04/05 07:20:41:8564] USER: writing timer ping
[2018/04/05 07:20:41:8564] DEBUG: LWS_EXT_CB_PAYLOAD_TX
[2018/04/05 07:20:41:8564] NOTICE: evloop
[2018/04/05 07:20:42:8575] NOTICE: evloop
[2018/04/05 07:20:43:8585] NOTICE: evloop
[2018/04/05 07:20:44:8566] USER: LWS_CALLBACK_TIMER
[2018/04/05 07:20:44:8566] INFO: lws_callback_on_writable: 0x11464b0 (mode 3)
[2018/04/05 07:20:44:8566] DEBUG: _lws_change_pollfd: wsi 0x11464b0: fd 8 events 1 -> 5

This timeout / close is the now idle http connection timing out and closing normally, the ws connection continues...

[2018/04/05 07:20:44:8566] INFO: __lws_close_free_wsi: 0x11461a0: caller: timeout
[2018/04/05 07:20:44:8566] DEBUG: __lws_close_free_wsi: real just_kill_connection: 0x11461a0 (sockfd 7)
[2018/04/05 07:20:44:8566] INFO: lws_same_vh_protocol_remove: removing same prot wsi 0x11461a0
[2018/04/05 07:20:44:8566] INFO: have prev 0x1140380, setting him to our next (nil)
[2018/04/05 07:20:44:8566] DEBUG: __remove_wsi_socket_from_fds: wsi=0x11461a0, sock=7, fds pos=2, end guy pos=4, endfd=0
[2018/04/05 07:20:44:8566] DEBUG: _lws_change_pollfd: wsi 0x1145b80: fd 4 events 1 -> 1
[2018/04/05 07:20:44:8566] INFO: ah det due to close
[2018/04/05 07:20:44:8567] DEBUG: __lws_free_wsi: 0x11461a0, remaining wsi 3
[2018/04/05 07:20:44:8567] NOTICE: evloop
[2018/04/05 07:20:44:8567] INFO: lws_handle_POLLOUT_event: 0x11464b0
[2018/04/05 07:20:44:8567] DEBUG: _lws_change_pollfd: wsi 0x11464b0: fd 8 events 5 -> 1
[2018/04/05 07:20:44:8567] INFO: lws_handle_POLLOUT_event: non http2
[2018/04/05 07:20:44:8567] USER: writing timer ping
[2018/04/05 07:20:44:8567] DEBUG: LWS_EXT_CB_PAYLOAD_TX
[2018/04/05 07:20:44:8567] NOTICE: evloop
[2018/04/05 07:20:45:8578] NOTICE: evloop
[2018/04/05 07:20:46:8588] NOTICE: evloop
[2018/04/05 07:20:47:8569] USER: LWS_CALLBACK_TIMER
[2018/04/05 07:20:47:8569] INFO: lws_callback_on_writable: 0x11464b0 (mode 3)
[2018/04/05 07:20:47:8569] DEBUG: _lws_change_pollfd: wsi 0x11464b0: fd 8 events 1 -> 5
[2018/04/05 07:20:47:8569] NOTICE: evloop
[2018/04/05 07:20:47:8569] INFO: lws_handle_POLLOUT_event: 0x11464b0
[2018/04/05 07:20:47:8569] DEBUG: _lws_change_pollfd: wsi 0x11464b0: fd 8 events 5 -> 1
[2018/04/05 07:20:47:8569] INFO: lws_handle_POLLOUT_event: non http2
[2018/04/05 07:20:47:8569] USER: writing timer ping
[2018/04/05 07:20:47:8569] DEBUG: LWS_EXT_CB_PAYLOAD_TX
[2018/04/05 07:20:47:8570] NOTICE: evloop
[2018/04/05 07:20:48:8580] NOTICE: evloop
[2018/04/05 07:20:49:8591] NOTICE: evloop
[2018/04/05 07:20:50:8571] USER: LWS_CALLBACK_TIMER
[2018/04/05 07:20:50:8571] INFO: lws_callback_on_writable: 0x11464b0 (mode 3)
[2018/04/05 07:20:50:8571] DEBUG: _lws_change_pollfd: wsi 0x11464b0: fd 8 events 1 -> 5
[2018/04/05 07:20:50:8571] NOTICE: evloop
[2018/04/05 07:20:50:8571] INFO: lws_handle_POLLOUT_event: 0x11464b0
[2018/04/05 07:20:50:8572] DEBUG: _lws_change_pollfd: wsi 0x11464b0: fd 8 events 5 -> 1
[2018/04/05 07:20:50:8572] INFO: lws_handle_POLLOUT_event: non http2
[2018/04/05 07:20:50:8572] USER: writing timer ping
[2018/04/05 07:20:50:8572] DEBUG: LWS_EXT_CB_PAYLOAD_TX
[2018/04/05 07:20:50:8572] NOTICE: evloop
[2018/04/05 07:20:51:8583] NOTICE: evloop
[2018/04/05 07:20:52:8593] NOTICE: evloop
[2018/04/05 07:20:53:8574] USER: LWS_CALLBACK_TIMER
[2018/04/05 07:20:53:8574] INFO: lws_callback_on_writable: 0x11464b0 (mode 3)
[2018/04/05 07:20:53:8574] DEBUG: _lws_change_pollfd: wsi 0x11464b0: fd 8 events 1 -> 5
[2018/04/05 07:20:53:8574] NOTICE: evloop
[2018/04/05 07:20:53:8574] INFO: lws_handle_POLLOUT_event: 0x11464b0
[2018/04/05 07:20:53:8574] DEBUG: _lws_change_pollfd: wsi 0x11464b0: fd 8 events 5 -> 1
[2018/04/05 07:20:53:8574] INFO: lws_handle_POLLOUT_event: non http2
[2018/04/05 07:20:53:8574] USER: writing timer ping
[2018/04/05 07:20:53:8574] DEBUG: LWS_EXT_CB_PAYLOAD_TX
[2018/04/05 07:20:53:8575] NOTICE: evloop
[2018/04/05 07:20:54:8585] NOTICE: evloop
[2018/04/05 07:20:55:8596] NOTICE: evloop
[2018/04/05 07:20:56:8576] USER: LWS_CALLBACK_TIMER
[2018/04/05 07:20:56:8577] INFO: lws_callback_on_writable: 0x11464b0 (mode 3)
[2018/04/05 07:20:56:8577] DEBUG: _lws_change_pollfd: wsi 0x11464b0: fd 8 events 1 -> 5
[2018/04/05 07:20:56:8577] NOTICE: evloop
[2018/04/05 07:20:56:8577] INFO: lws_handle_POLLOUT_event: 0x11464b0
[2018/04/05 07:20:56:8577] DEBUG: _lws_change_pollfd: wsi 0x11464b0: fd 8 events 5 -> 1
[2018/04/05 07:20:56:8577] INFO: lws_handle_POLLOUT_event: non http2
[2018/04/05 07:20:56:8577] USER: writing timer ping
[2018/04/05 07:20:56:8577] DEBUG: LWS_EXT_CB_PAYLOAD_TX
[2018/04/05 07:20:56:8577] NOTICE: evloop
[2018/04/05 07:20:57:8588] NOTICE: evloop
[2018/04/05 07:20:58:8599] NOTICE: evloop
[2018/04/05 07:20:59:8579] USER: LWS_CALLBACK_TIMER
[2018/04/05 07:20:59:8579] INFO: lws_callback_on_writable: 0x11464b0 (mode 3)
[2018/04/05 07:20:59:8579] DEBUG: _lws_change_pollfd: wsi 0x11464b0: fd 8 events 1 -> 5
[2018/04/05 07:20:59:8579] NOTICE: evloop
[2018/04/05 07:20:59:8580] INFO: lws_handle_POLLOUT_event: 0x11464b0
[2018/04/05 07:20:59:8580] DEBUG: _lws_change_pollfd: wsi 0x11464b0: fd 8 events 5 -> 1
[2018/04/05 07:20:59:8580] INFO: lws_handle_POLLOUT_event: non http2
[2018/04/05 07:20:59:8580] USER: writing timer ping
[2018/04/05 07:20:59:8580] DEBUG: LWS_EXT_CB_PAYLOAD_TX
[2018/04/05 07:20:59:8580] NOTICE: evloop
[2018/04/05 07:21:00:8591] NOTICE: evloop
[2018/04/05 07:21:01:8601] NOTICE: evloop
[2018/04/05 07:21:02:8582] USER: LWS_CALLBACK_TIMER
[2018/04/05 07:21:02:8582] INFO: lws_callback_on_writable: 0x11464b0 (mode 3)
[2018/04/05 07:21:02:8582] DEBUG: _lws_change_pollfd: wsi 0x11464b0: fd 8 events 1 -> 5
[2018/04/05 07:21:02:8582] NOTICE: evloop
[2018/04/05 07:21:03:8592] NOTICE: evloop
[2018/04/05 07:21:04:8603] NOTICE: evloop
[2018/04/05 07:21:05:8584] USER: LWS_CALLBACK_TIMER
[2018/04/05 07:21:05:8584] INFO: lws_callback_on_writable: 0x11464b0 (mode 3)
[2018/04/05 07:21:05:8584] DEBUG: _lws_change_pollfd: wsi 0x11464b0: fd 8 events 5 -> 5
[2018/04/05 07:21:05:8584] NOTICE: xxx
[2018/04/05 07:21:05:8584] INFO: req pp on wsi 0x11464b0
[2018/04/05 07:21:05:8584] DEBUG: __lws_set_timeout: 0x11464b0: 30 secs
[2018/04/05 07:21:05:8584] INFO: lws_callback_on_writable: 0x11464b0 (mode 3)
[2018/04/05 07:21:05:8584] DEBUG: _lws_change_pollfd: wsi 0x11464b0: fd 8 events 5 -> 5
[2018/04/05 07:21:05:8584] NOTICE: evloop
[2018/04/05 07:21:06:8595] NOTICE: evloop
[2018/04/05 07:21:07:8605] NOTICE: evloop
[2018/04/05 07:21:08:8586] USER: LWS_CALLBACK_TIMER
[2018/04/05 07:21:08:8586] INFO: lws_callback_on_writable: 0x11464b0 (mode 3)
[2018/04/05 07:21:08:8586] DEBUG: _lws_change_pollfd: wsi 0x11464b0: fd 8 events 5 -> 5
[2018/04/05 07:21:08:8586] NOTICE: evloop
[2018/04/05 07:21:09:8596] NOTICE: evloop
[2018/04/05 07:21:10:8607] NOTICE: evloop
[2018/04/05 07:21:11:8587] USER: LWS_CALLBACK_TIMER
[2018/04/05 07:21:11:8588] INFO: lws_callback_on_writable: 0x11464b0 (mode 3)
[2018/04/05 07:21:11:8588] DEBUG: _lws_change_pollfd: wsi 0x11464b0: fd 8 events 5 -> 5
[2018/04/05 07:21:11:8588] NOTICE: evloop
[2018/04/05 07:21:12:8598] NOTICE: evloop
[2018/04/05 07:21:13:8609] NOTICE: evloop
[2018/04/05 07:21:14:8589] USER: LWS_CALLBACK_TIMER
[2018/04/05 07:21:14:8589] INFO: lws_callback_on_writable: 0x11464b0 (mode 3)
[2018/04/05 07:21:14:8589] DEBUG: _lws_change_pollfd: wsi 0x11464b0: fd 8 events 5 -> 5
[2018/04/05 07:21:14:8589] NOTICE: evloop
[2018/04/05 07:21:15:8600] NOTICE: evloop
[2018/04/05 07:21:16:8610] NOTICE: evloop
[2018/04/05 07:21:17:8591] USER: LWS_CALLBACK_TIMER
[2018/04/05 07:21:17:8591] INFO: lws_callback_on_writable: 0x11464b0 (mode 3)
[2018/04/05 07:21:17:8591] DEBUG: _lws_change_pollfd: wsi 0x11464b0: fd 8 events 5 -> 5
[2018/04/05 07:21:17:8591] NOTICE: evloop
[2018/04/05 07:21:18:8601] NOTICE: evloop
[2018/04/05 07:21:19:8612] NOTICE: evloop
[2018/04/05 07:21:20:8592] USER: LWS_CALLBACK_TIMER
[2018/04/05 07:21:20:8593] INFO: lws_callback_on_writable: 0x11464b0 (mode 3)
[2018/04/05 07:21:20:8593] DEBUG: _lws_change_pollfd: wsi 0x11464b0: fd 8 events 5 -> 5
[2018/04/05 07:21:20:8593] NOTICE: evloop
[2018/04/05 07:21:21:8603] NOTICE: evloop
[2018/04/05 07:21:22:8614] NOTICE: evloop
[2018/04/05 07:21:23:8594] USER: LWS_CALLBACK_TIMER
[2018/04/05 07:21:23:8594] INFO: lws_callback_on_writable: 0x11464b0 (mode 3)
[2018/04/05 07:21:23:8594] DEBUG: _lws_change_pollfd: wsi 0x11464b0: fd 8 events 5 -> 5
[2018/04/05 07:21:23:8594] NOTICE: evloop
[2018/04/05 07:21:24:8605] NOTICE: evloop
[2018/04/05 07:21:25:8615] NOTICE: evloop
[2018/04/05 07:21:26:8596] USER: LWS_CALLBACK_TIMER
[2018/04/05 07:21:26:8596] INFO: lws_callback_on_writable: 0x11464b0 (mode 3)
[2018/04/05 07:21:26:8596] DEBUG: _lws_change_pollfd: wsi 0x11464b0: fd 8 events 5 -> 5
[2018/04/05 07:21:26:8596] NOTICE: evloop
[2018/04/05 07:21:27:8606] NOTICE: evloop
[2018/04/05 07:21:28:8617] NOTICE: evloop
[2018/04/05 07:21:29:8597] USER: LWS_CALLBACK_TIMER
[2018/04/05 07:21:29:8598] INFO: lws_callback_on_writable: 0x11464b0 (mode 3)
[2018/04/05 07:21:29:8598] DEBUG: _lws_change_pollfd: wsi 0x11464b0: fd 8 events 5 -> 5
[2018/04/05 07:21:29:8598] NOTICE: evloop
[2018/04/05 07:21:30:8608] NOTICE: evloop
[2018/04/05 07:21:31:8619] NOTICE: evloop
[2018/04/05 07:21:32:8599] USER: LWS_CALLBACK_TIMER
[2018/04/05 07:21:32:8599] INFO: lws_callback_on_writable: 0x11464b0 (mode 3)
[2018/04/05 07:21:32:8599] DEBUG: _lws_change_pollfd: wsi 0x11464b0: fd 8 events 5 -> 5
[2018/04/05 07:21:32:8600] NOTICE: evloop
[2018/04/05 07:21:33:8610] NOTICE: evloop
[2018/04/05 07:21:34:8620] NOTICE: evloop
[2018/04/05 07:21:35:8601] USER: LWS_CALLBACK_TIMER
[2018/04/05 07:21:35:8601] INFO: lws_callback_on_writable: 0x11464b0 (mode 3)
[2018/04/05 07:21:35:8601] DEBUG: _lws_change_pollfd: wsi 0x11464b0: fd 8 events 5 -> 5
[2018/04/05 07:21:35:8601] NOTICE: evloop
[2018/04/05 07:21:36:8612] INFO: wsi 0x11464b0: TIMEDOUT WAITING on 16 (did hdr 1, ah (nil), wl 0, pfd events 5) 1522884096 vs 30
[2018/04/05 07:21:36:8612] INFO: __lws_close_free_wsi: 0x11464b0: caller: timeout
[2018/04/05 07:21:36:8612] DEBUG: __lws_close_free_wsi: real just_kill_connection: 0x11464b0 (sockfd 8)
[2018/04/05 07:21:36:8612] INFO: lws_same_vh_protocol_remove: removing same prot wsi 0x11464b0
[2018/04/05 07:21:36:8612] INFO: have prev 0x1140388, setting him to our next (nil)
[2018/04/05 07:21:36:8612] DEBUG: __remove_wsi_socket_from_fds: wsi=0x11464b0, sock=8, fds pos=2, end guy pos=3, endfd=8
[2018/04/05 07:21:36:8612] DEBUG: _lws_change_pollfd: wsi 0x1145b80: fd 4 events 1 -> 1
[2018/04/05 07:21:36:8612] INFO: ah det due to close
[2018/04/05 07:21:36:8612] DEBUG: __lws_free_wsi: 0x11464b0, remaining wsi 2
[2018/04/05 07:21:36:8612] NOTICE: evloop
[2018/04/05 07:21:37:8623] NOTICE: evloop
[2018/04/05 07:21:38:8634] NOTICE: evloop

Again he times out on PENDING_TIMEOUT_WS_PONG_CHECK_SEND_PING (reason 16).

[2018/04/05 07:21:36:8612] INFO: wsi 0x11464b0: TIMEDOUT WAITING on 16 (did hdr 1, ah (nil), wl 0, pfd events 5) 1522884096 vs 30

So this is now following what was supposed to reproduce it AFAIK and it acts well. Can you try the same thing?

@lws-team
Copy link
Member

lws-team commented Apr 6, 2018

Well, if you want to try the same tests I'm happy to continue looking at it if something to look at.

@lws-team lws-team closed this as completed Apr 6, 2018
@mathiasdevos
Copy link
Author

Hi,

I have created a sample client and server program to demonstrate the issue.
I used the v2.4-stable branch but the issue also happens on the master branch.

You can find the diff file here:
keep-alive.txt

I used the following command to run cmake:

$ mkdir build && cd build
$ cmake .. -DLWS_WITHOUT_EXTENSIONS=1 -DLWS_WITH_SSL=0 -DLWS_WITH_ZLIB=0 -DLWS_WITHOUT_TESTAPPS=1 -DLWS_WITH_SHARED=0 -DLWS_WITH_ZIP_FOPS=0

You can set the IP address of the server in test-apps/keep-alive/main.

Flow of the program:

  • Client connects to server
  • Server sends a message to the client every 0.5 seconds
  • When removing the physical connection between the client and the server, the server will not timeout

When you disable sending data on the server (test-apps/keep-alive/Server.c line 71), the connection will timeout as expected.
Keep alive configuration can be found in test-apps/keep-alive/Server.c line 36 - 38.

@xmanmax
Copy link

xmanmax commented Apr 16, 2018

I also met a same issue:
under ubuntu platform, used v2.4-stable branch ,and options as below:
info->ws_ping_pong_interval = 40;
info->timeout_secs = 5;

  1. I removed the physical connection between client and server
  2. after about one minute, a PING message from server send to device
  3. tcp retransmission tried 30 times, continued for 3minutes, then the sockets closed. I received LWS_CALLBACK_CLOSED callback.

tcpdump result as below:
531523872592_ pic_hd

541523872708_ pic_hd

@lws-team
Copy link
Member

lws-team commented Apr 16, 2018

Well I appreciate Matthias has sent some kind of reproducer.

However Xmanmax's description makes me feel the same way as they last time he [1] mentioned it (I attempted to recreate it and failed)... lws is just a userland app in the end. Pointing at the kernel behaviours - these retries and such are coming from the kernel networking stack - and expecting me to "fix" them is not going to lead anywhere unless there is something that can be / should be done to solve it from userland side. If the userland Posix apis for sockets don't get told anything for some some period then lws can't react... what else are you expecting?

Anyway it is the end of my day where I am. I'll try Matthias' stuff tomorrow.

Edit: [1] Sorry it was matthias... confused by basically the same screenshot.

@lws-team
Copy link
Member

Hmmm... lws is just a userland app. These things are platform-specific.

https://stackoverflow.com/questions/16320039/getting-disconnection-notification-using-tcp-keep-alive-on-write-blocked-socket?lq=1

He suggests a workaround which boils down to something like this on v2.4-stable (the 5000 is in ms and can be computed from the existing KA params I think)

diff --git a/lib/plat/lws-plat-unix.c b/lib/plat/lws-plat-unix.c
index a51e67bb..d3a8093e 100644
--- a/lib/plat/lws-plat-unix.c
+++ b/lib/plat/lws-plat-unix.c
@@ -261,6 +261,12 @@ lws_plat_set_socket_options(struct lws_vhost *vhost, int fd)
                 * tune kernel systemwide values
                 */
 #else
+               /* linux-specific hack*/
+               optval = 5000;
+               if (setsockopt(fd, IPPROTO_TCP, TCP_USER_TIMEOUT,
+                              (const void *)&optval, optlen) < 0)
+                       return 1;
+
                /* set the keepalive conditions we want on it too */
                optval = vhost->ka_time;
                if (setsockopt(fd, IPPROTO_TCP, TCP_KEEPIDLE,

This makes the server understand the client has gone away for me... is this solving what everyone is talking about the same, at least for Linux?

Also notice that lws supports info.ws_ping_pong_interval which applies to ws connections... it tracks and confirms idle connections in lws itself by using the ws PING, PONG and timeout support, independent of tcp keepalive.

@mathiasdevos
Copy link
Author

I agree with this one.

I guess optval should be (ka_time + (ka_probes * ka_interval)) in this case.

@mathiasdevos mathiasdevos changed the title Cannot detect connection loss within a short period after sending data Connection does not timeout after sending data on broken connection when using TCP keepalive Apr 17, 2018
@mathiasdevos
Copy link
Author

Some additional checking should be added because I am building libwebsockets for an older linux kernel and this one does not support TCP_USER_TIMEOUT yet, therefore it cannot be compiled with these changes.

@lws-team
Copy link
Member

Yeah I know it... I will protect it via cmake... but I want to see if this is also the same thing @xmanmax is talking about first. This is only a hack on linux new enough to have it, it doesn't do anything for any other platform or "fix anything in lws".

It remains the case you're better off using the lws timeout stuff as described above.

@xmanmax
Copy link

xmanmax commented Apr 17, 2018

Thank you.
That has a little different with my case, I can not use ka_time/ka_interval/ka_probes for some reason, so I tried ws_ping_pong_interval and timeout_sec to cover offline check, as you said ,I tried TCP_USER_TIMEOUT, that`s work for my case.
551523967409_ pic_hd

@lws-team
Copy link
Member

I pushed a cleaned-up version of the patch on both v2.4-stable and master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants