Reconnect client if no pong answer #64

surik · 2015-08-25T12:11:43Z

If no pong answer on ping client connection will be reestablished

Relate to #62

liveforeverx · 2015-08-25T14:19:43Z

src/hello_client.erl

@@ -165,12 +166,17 @@ handle_call(terminate, _From, State) ->
 %% @hidden
 handle_info({?INCOMING_MSG, Message}, State) ->
    incoming_message(Message, State);
+handle_info(?PING, State = #client_state{ping_counter = PingCounter}) when PingCounter >= 3 ->
+    ?LOG_ERROR("There is no PONG answer on PING for a long time. Client will be restarted.", []),


@ContentSafe Please, check, what additional information(meta fields) or reword needed here.

"for a long time" - seriously? can we please be more specific?
it seems you are just pretending time-measurement but instead you are counting missed ping/pong.
how about being precise and honest in one go? keep track of the timestamp of the last successful ping/pong and compare that to now(). you will have a precise time-period to expose for the log-reader and the tolerance can be configured directly as a time-span (instead of some combination of PingCounter and PingInterval).

@thz in this log message I wanted to be more clearer: we sent PONG 3 times and didn't receive any answers. For me, 3 * keep_alive_interval is a long time for one PONG answer.
Of course, we could more specific and have timestamp for last PONG message (if there was).
And of course, we could get rid of PingCounter and PIngInterval combination and have something like PingNoAnswerInterval option, which parametrizes time-span in seconds to shutdown if there no PONG answer within. But, in my opinion, PingCounter is more OTP way (like supervisor restart politic)
I maybe don't understand you on 100%, can you give me more examples?

@surik at least the time hello_client was waiting for a $PONG should be logged when the PingCounter is >=3. Assuming the $PING is sent approximately right away from the client the resulting time measurement 3 * keep_alive_interval should be fine. Something like "Ping request to 'server IP' timed out after 600s. Client will be restarted." is imho ok. Additional meta fields like server IP or the name of the PID (if named) would be very nice but this stuff is atm also not part of the other logging (see hello_log.hrl). We should fix that in the future.

i do not see any added value by doing 3x ping instead of doing one. a reconnection should be triggered when a single ping/pong gets a delay greater then the threshold.

surik · 2015-08-27T11:52:27Z

I've reworked PR

thz · 2015-08-27T12:04:08Z

src/hello_client.erl

@@ -165,12 +166,16 @@ handle_call(terminate, _From, State) ->
 %% @hidden
 handle_info({?INCOMING_MSG, Message}, State) ->
    incoming_message(Message, State);
+handle_info(?PING, State = #client_state{ping_sent = true, keep_alive_interval = KeepAliveInterval}) ->
+    ?LOG_ERROR("There is no PONG answer on PING for ~p msec. Client will be restarted.", [KeepAliveInterval]),


@skattelmann, please!

thz · 2015-08-27T17:13:21Z

src/hello_client.erl

@@ -165,12 +166,16 @@ handle_call(terminate, _From, State) ->
 %% @hidden
 handle_info({?INCOMING_MSG, Message}, State) ->
    incoming_message(Message, State);
+handle_info(?PING, State = #client_state{waiting_for_pong = true, keep_alive_interval = KeepAliveInterval}) ->
+    ?LOG_ERROR("There is no PONG answer on PING for ~p msec. Client will be restarted.", [KeepAliveInterval]),


"Client will be restarted"? Isn't it more the connection that will be reestablished?
For ~p msec, seems wrong too, because the argument is only the "KeepAliveInterval" and not elapsed time since last PING (now - last_ping). So when there was no PONG since 7 seconds, but the Interval is 5 seconds, it will incorrectly say "5 seconds", right?

@thz 1) we are restarting client process. if would be more accurate, we are stopping client process and supervisor will restart it if it possible. Maybe we should write "Client will be stopped" and have log message when client starts? 2) Why we should wait for 7 seconds if KeepAliveInterval is 5? When KeepAliveInterval is expired we stop process (actually, we try to send new PING message and realize that there is no PONG and we stop process) see hello_client.erl. Of course, these values more approximately but not like 5 and 7 seconds.

There are two things which are kind of irritating here:

The client process is restarted and not only the transport handler which is in our case the zmq or http client. The problem which should be fixed here is transport specific.

It would be nice to get to whole time we are waiting for a $PONG since the first $PING where we got no $PONG answer. Currently we only know that a $PING wasn't answered in the KeepAliveInterval. Of course this time can include several restarts of the transport client.

@skattelmann

Transport specific thing will happen in transport specific modules. When we are stopping client with error reason we call transport_terminate/2 and in zmq case it will close socket.

Do you mean that we need to save timestamp between client restarts just for logging it?

Maybe we should think about some metrics for avg/min/max time for ping-pong?

to 1) yes i know, but the whole client state is also dismissed =)
to 2) yes that would be nice. I know this sounds a bit over the top but knowing the time of a connection loss can be crucial in troubleshooting. If you just restart the client transport you could save such a timestamp in the client state and restart the transport without loosing it.

What do you exactly want to do with these metrics? Just log them? For the timeout we should define a certain 'KeepAliveTimeout' Interval or whatever to be clear about when the client reestablishes the connection.

@skattelman, I disagree little bit, that it is a job of hello to know, what time the connection was dropped. The connection should write some metric, like alive or not to exometer and some other aggregation system can do graph it and take the time, where it happens first time. There are ready and simple tools to do it, to not make from hello complex, where it possible to reuse existing tools for the same reasons.

"The connection should write some metric, like alive or not"
I thought this is atm not possible with ezmq? If it is possible I'm perfectly fine with your proposal.

so, a user of a tiny tiny tiny little hello-service would need the hello-service and then grafana (or the like) to get the very very tiny information for how long the other service did not answer?

really? is it such a hard problem to remember the last time the hello service has received an answer for a pong? we are talking about one 64bit value, at most.

@mgumz this is not about hard or easy to remember value. This is about what we need to do in this unexpected case: crash and restart or just reconnect socket. I see that second way is more popular :)

But I still think metrics for ping/pong are good although it out of this PR.

currently we are talking about what to write into the log-message...

surik · 2015-08-28T18:51:06Z

Add time of last received pong to log message and now we are reestablishing connection instead of restart client process

Reconnect client if no pong answer

liveforeverx reviewed Aug 25, 2015
View reviewed changes

surik force-pushed the restart_client_by_no_ping branch 3 times, most recently from f346149 to 0ad3b7b Compare August 27, 2015 11:48

thz reviewed Aug 27, 2015
View reviewed changes

surik force-pushed the restart_client_by_no_ping branch from 0ad3b7b to 1f3d21a Compare August 27, 2015 13:09

thz reviewed Aug 27, 2015
View reviewed changes

surik force-pushed the restart_client_by_no_ping branch from 1f3d21a to 909003e Compare August 28, 2015 17:38

surik changed the title ~~Restart client if no pong answer for a long time~~ Reconnect client if no pong answer Aug 28, 2015

Reconnect client if no pong answer

879c75a

surik force-pushed the restart_client_by_no_ping branch from 909003e to 879c75a Compare August 28, 2015 17:46

liveforeverx added a commit that referenced this pull request Aug 30, 2015

Merge pull request #64 from surik/restart_client_by_no_ping

cc52856

Reconnect client if no pong answer

liveforeverx merged commit cc52856 into travelping:master Aug 30, 2015

surik deleted the restart_client_by_no_ping branch December 17, 2015 16:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconnect client if no pong answer #64

Reconnect client if no pong answer #64

surik commented Aug 25, 2015

liveforeverx Aug 25, 2015

thz Aug 25, 2015

surik Aug 26, 2015

skattelmann Aug 26, 2015

thz Aug 26, 2015

surik commented Aug 27, 2015

thz Aug 27, 2015

thz Aug 27, 2015

surik Aug 28, 2015

skattelmann Aug 28, 2015

surik Aug 28, 2015

skattelmann Aug 28, 2015

liveforeverx Aug 28, 2015

skattelmann Aug 28, 2015

mgumz Aug 28, 2015

surik Aug 28, 2015

mgumz Aug 28, 2015

surik commented Aug 28, 2015

Reconnect client if no pong answer #64

Reconnect client if no pong answer #64

Conversation

surik commented Aug 25, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

surik commented Aug 27, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

surik commented Aug 28, 2015