-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconnect client if no pong answer #64
Reconnect client if no pong answer #64
Conversation
@@ -165,12 +166,17 @@ handle_call(terminate, _From, State) -> | |||
%% @hidden | |||
handle_info({?INCOMING_MSG, Message}, State) -> | |||
incoming_message(Message, State); | |||
handle_info(?PING, State = #client_state{ping_counter = PingCounter}) when PingCounter >= 3 -> | |||
?LOG_ERROR("There is no PONG answer on PING for a long time. Client will be restarted.", []), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ContentSafe Please, check, what additional information(meta fields) or reword needed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"for a long time" - seriously? can we please be more specific?
it seems you are just pretending time-measurement but instead you are counting missed ping/pong.
how about being precise and honest in one go? keep track of the timestamp of the last successful ping/pong and compare that to now(). you will have a precise time-period to expose for the log-reader and the tolerance can be configured directly as a time-span (instead of some combination of PingCounter and PingInterval).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thz in this log message I wanted to be more clearer: we sent PONG 3 times and didn't receive any answers. For me, 3 * keep_alive_interval
is a long time for one PONG answer.
Of course, we could more specific and have timestamp for last PONG message (if there was).
And of course, we could get rid of PingCounter and PIngInterval combination and have something like PingNoAnswerInterval option, which parametrizes time-span in seconds to shutdown if there no PONG answer within. But, in my opinion, PingCounter is more OTP way (like supervisor restart politic)
I maybe don't understand you on 100%, can you give me more examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@surik at least the time hello_client was waiting for a $PONG should be logged when the PingCounter is >=3. Assuming the $PING is sent approximately right away from the client the resulting time measurement 3 * keep_alive_interval should be fine. Something like "Ping request to 'server IP' timed out after 600s. Client will be restarted." is imho ok. Additional meta fields like server IP or the name of the PID (if named) would be very nice but this stuff is atm also not part of the other logging (see hello_log.hrl). We should fix that in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i do not see any added value by doing 3x ping instead of doing one. a reconnection should be triggered when a single ping/pong gets a delay greater then the threshold.
f346149
to
0ad3b7b
Compare
I've reworked PR |
@@ -165,12 +166,16 @@ handle_call(terminate, _From, State) -> | |||
%% @hidden | |||
handle_info({?INCOMING_MSG, Message}, State) -> | |||
incoming_message(Message, State); | |||
handle_info(?PING, State = #client_state{ping_sent = true, keep_alive_interval = KeepAliveInterval}) -> | |||
?LOG_ERROR("There is no PONG answer on PING for ~p msec. Client will be restarted.", [KeepAliveInterval]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skattelmann, please!
0ad3b7b
to
1f3d21a
Compare
@@ -165,12 +166,16 @@ handle_call(terminate, _From, State) -> | |||
%% @hidden | |||
handle_info({?INCOMING_MSG, Message}, State) -> | |||
incoming_message(Message, State); | |||
handle_info(?PING, State = #client_state{waiting_for_pong = true, keep_alive_interval = KeepAliveInterval}) -> | |||
?LOG_ERROR("There is no PONG answer on PING for ~p msec. Client will be restarted.", [KeepAliveInterval]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Client will be restarted"? Isn't it more the connection that will be reestablished?
For ~p msec, seems wrong too, because the argument is only the "KeepAliveInterval" and not elapsed time since last PING (now - last_ping). So when there was no PONG since 7 seconds, but the Interval is 5 seconds, it will incorrectly say "5 seconds", right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thz 1) we are restarting client process. if would be more accurate, we are stopping client process and supervisor will restart it if it possible. Maybe we should write "Client will be stopped" and have log message when client starts? 2) Why we should wait for 7 seconds if KeepAliveInterval is 5? When KeepAliveInterval is expired we stop process (actually, we try to send new PING message and realize that there is no PONG and we stop process) see hello_client.erl. Of course, these values more approximately but not like 5 and 7 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two things which are kind of irritating here:
- The client process is restarted and not only the transport handler which is in our case the zmq or http client. The problem which should be fixed here is transport specific.
- It would be nice to get to whole time we are waiting for a $PONG since the first $PING where we got no $PONG answer. Currently we only know that a $PING wasn't answered in the KeepAliveInterval. Of course this time can include several restarts of the transport client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Transport specific thing will happen in transport specific modules. When we are stopping client with error reason we call transport_terminate/2 and in zmq case it will close socket.
- Do you mean that we need to save timestamp between client restarts just for logging it?
Maybe we should think about some metrics for avg/min/max time for ping-pong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to 1) yes i know, but the whole client state is also dismissed =)
to 2) yes that would be nice. I know this sounds a bit over the top but knowing the time of a connection loss can be crucial in troubleshooting. If you just restart the client transport you could save such a timestamp in the client state and restart the transport without loosing it.
What do you exactly want to do with these metrics? Just log them? For the timeout we should define a certain 'KeepAliveTimeout' Interval or whatever to be clear about when the client reestablishes the connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skattelman, I disagree little bit, that it is a job of hello to know, what time the connection was dropped. The connection should write some metric, like alive or not to exometer and some other aggregation system can do graph it and take the time, where it happens first time. There are ready and simple tools to do it, to not make from hello complex, where it possible to reuse existing tools for the same reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The connection should write some metric, like alive or not"
I thought this is atm not possible with ezmq? If it is possible I'm perfectly fine with your proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, a user of a tiny tiny tiny little hello-service would need the hello-service and then grafana (or the like) to get the very very tiny information for how long the other service did not answer?
really? is it such a hard problem to remember the last time the hello service has received an answer for a pong? we are talking about one 64bit value, at most.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgumz this is not about hard or easy to remember value. This is about what we need to do in this unexpected case: crash and restart or just reconnect socket. I see that second way is more popular :)
But I still think metrics for ping/pong are good although it out of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently we are talking about what to write into the log-message...
1f3d21a
to
909003e
Compare
909003e
to
879c75a
Compare
Add time of last received pong to log message and now we are reestablishing connection instead of restart client process |
Reconnect client if no pong answer
If no pong answer on ping client connection will be reestablished
Relate to #62