Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconnect client if no pong answer #64

Merged
merged 1 commit into from
Aug 30, 2015

Conversation

surik
Copy link
Contributor

@surik surik commented Aug 25, 2015

If no pong answer on ping client connection will be reestablished

Relate to #62

@@ -165,12 +166,17 @@ handle_call(terminate, _From, State) ->
%% @hidden
handle_info({?INCOMING_MSG, Message}, State) ->
incoming_message(Message, State);
handle_info(?PING, State = #client_state{ping_counter = PingCounter}) when PingCounter >= 3 ->
?LOG_ERROR("There is no PONG answer on PING for a long time. Client will be restarted.", []),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ContentSafe Please, check, what additional information(meta fields) or reword needed here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"for a long time" - seriously? can we please be more specific?
it seems you are just pretending time-measurement but instead you are counting missed ping/pong.
how about being precise and honest in one go? keep track of the timestamp of the last successful ping/pong and compare that to now(). you will have a precise time-period to expose for the log-reader and the tolerance can be configured directly as a time-span (instead of some combination of PingCounter and PingInterval).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thz in this log message I wanted to be more clearer: we sent PONG 3 times and didn't receive any answers. For me, 3 * keep_alive_interval is a long time for one PONG answer.
Of course, we could more specific and have timestamp for last PONG message (if there was).
And of course, we could get rid of PingCounter and PIngInterval combination and have something like PingNoAnswerInterval option, which parametrizes time-span in seconds to shutdown if there no PONG answer within. But, in my opinion, PingCounter is more OTP way (like supervisor restart politic)
I maybe don't understand you on 100%, can you give me more examples?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@surik at least the time hello_client was waiting for a $PONG should be logged when the PingCounter is >=3. Assuming the $PING is sent approximately right away from the client the resulting time measurement 3 * keep_alive_interval should be fine. Something like "Ping request to 'server IP' timed out after 600s. Client will be restarted." is imho ok. Additional meta fields like server IP or the name of the PID (if named) would be very nice but this stuff is atm also not part of the other logging (see hello_log.hrl). We should fix that in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i do not see any added value by doing 3x ping instead of doing one. a reconnection should be triggered when a single ping/pong gets a delay greater then the threshold.

@surik surik force-pushed the restart_client_by_no_ping branch 3 times, most recently from f346149 to 0ad3b7b Compare August 27, 2015 11:48
@surik
Copy link
Contributor Author

surik commented Aug 27, 2015

I've reworked PR

@@ -165,12 +166,16 @@ handle_call(terminate, _From, State) ->
%% @hidden
handle_info({?INCOMING_MSG, Message}, State) ->
incoming_message(Message, State);
handle_info(?PING, State = #client_state{ping_sent = true, keep_alive_interval = KeepAliveInterval}) ->
?LOG_ERROR("There is no PONG answer on PING for ~p msec. Client will be restarted.", [KeepAliveInterval]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skattelmann, please!

@@ -165,12 +166,16 @@ handle_call(terminate, _From, State) ->
%% @hidden
handle_info({?INCOMING_MSG, Message}, State) ->
incoming_message(Message, State);
handle_info(?PING, State = #client_state{waiting_for_pong = true, keep_alive_interval = KeepAliveInterval}) ->
?LOG_ERROR("There is no PONG answer on PING for ~p msec. Client will be restarted.", [KeepAliveInterval]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Client will be restarted"? Isn't it more the connection that will be reestablished?
For ~p msec, seems wrong too, because the argument is only the "KeepAliveInterval" and not elapsed time since last PING (now - last_ping). So when there was no PONG since 7 seconds, but the Interval is 5 seconds, it will incorrectly say "5 seconds", right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thz 1) we are restarting client process. if would be more accurate, we are stopping client process and supervisor will restart it if it possible. Maybe we should write "Client will be stopped" and have log message when client starts? 2) Why we should wait for 7 seconds if KeepAliveInterval is 5? When KeepAliveInterval is expired we stop process (actually, we try to send new PING message and realize that there is no PONG and we stop process) see hello_client.erl. Of course, these values more approximately but not like 5 and 7 seconds.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two things which are kind of irritating here:

  1. The client process is restarted and not only the transport handler which is in our case the zmq or http client. The problem which should be fixed here is transport specific.
  2. It would be nice to get to whole time we are waiting for a $PONG since the first $PING where we got no $PONG answer. Currently we only know that a $PING wasn't answered in the KeepAliveInterval. Of course this time can include several restarts of the transport client.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skattelmann

  1. Transport specific thing will happen in transport specific modules. When we are stopping client with error reason we call transport_terminate/2 and in zmq case it will close socket.
  2. Do you mean that we need to save timestamp between client restarts just for logging it?

Maybe we should think about some metrics for avg/min/max time for ping-pong?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to 1) yes i know, but the whole client state is also dismissed =)
to 2) yes that would be nice. I know this sounds a bit over the top but knowing the time of a connection loss can be crucial in troubleshooting. If you just restart the client transport you could save such a timestamp in the client state and restart the transport without loosing it.

What do you exactly want to do with these metrics? Just log them? For the timeout we should define a certain 'KeepAliveTimeout' Interval or whatever to be clear about when the client reestablishes the connection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skattelman, I disagree little bit, that it is a job of hello to know, what time the connection was dropped. The connection should write some metric, like alive or not to exometer and some other aggregation system can do graph it and take the time, where it happens first time. There are ready and simple tools to do it, to not make from hello complex, where it possible to reuse existing tools for the same reasons.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The connection should write some metric, like alive or not"
I thought this is atm not possible with ezmq? If it is possible I'm perfectly fine with your proposal.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, a user of a tiny tiny tiny little hello-service would need the hello-service and then grafana (or the like) to get the very very tiny information for how long the other service did not answer?

really? is it such a hard problem to remember the last time the hello service has received an answer for a pong? we are talking about one 64bit value, at most.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgumz this is not about hard or easy to remember value. This is about what we need to do in this unexpected case: crash and restart or just reconnect socket. I see that second way is more popular :)

But I still think metrics for ping/pong are good although it out of this PR.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently we are talking about what to write into the log-message...

@surik surik changed the title Restart client if no pong answer for a long time Reconnect client if no pong answer Aug 28, 2015
@surik
Copy link
Contributor Author

surik commented Aug 28, 2015

Add time of last received pong to log message and now we are reestablishing connection instead of restart client process

liveforeverx added a commit that referenced this pull request Aug 30, 2015
@liveforeverx liveforeverx merged commit cc52856 into travelping:master Aug 30, 2015
@surik surik deleted the restart_client_by_no_ping branch December 17, 2015 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants