Changing the way timers work #48

merged 5 commits into from Jun 17, 2011

2 participants


I'm opening this as a pull request so we can discuss whatever happens in my fork for this one.

Following OJ's problems, I've rewritten all our heartbeat timeouts and polling limits to be based on the erlang timer BIFs rather than the 'Timeout' value returned in a gen_server tuple. This should generally be safer considering any standard call to the transport process (including to get the event manager and whatnot) would reset the timer and cause heartbeats to be missed on the server-side, prompting the client to reconnect for no reason.

I've also added a few default values (changed from infinity to 10s, to be in sync with the websocket transport), removed a few useless processes spawn (was necessary for the timers anyway), and changed many apply/3 calls with a known arity to a Mod:Fun(Args) format, which should be faster.

We still need people to check that I didn't break anything with the timers too, and note that the close_timeouts used by the polling transports weren't changed because I have no idea what should cancel them in term of events.

ferd added some commits Jun 15, 2011
@ferd ferd Adding required fields to the app file
Also fixing an error message when the value can't be retrieved.
@ferd ferd Fixing timers in websockets
The current way of doing timers in websockets is fickle and bound to
error when it is reset on all messages we get. Just by sending a process
random messages, the timers get reset all the time.

This fix adds a different way of handling timers by using the erlang
module's start_timer/3 and cancel_timer/1 to get a safer
timeout/heartbeat behaviour.
@ferd ferd Reworking the timers in transport_polling
The current way of handling timeouts is risky with regards to user calls
able to mess with the timeouts and get things out of sync with the
client side.

This fix replaces the timeouts coming from the gen_server behaviour with
the erlang timer BIFs, likely safer and independent of the messages

The timeouts having to do with a connection being closed are still
handled the old way.
@ferd ferd Fixing more timers
More timers and heartbeats fixed, including htmlfile and xhr multipart.
Nothing in the tests seemed to have an influence on the htmlfile
transport. Further tests ought to be added in time.

Note that as for the other transport, the timeout on a disconnection is
still not a BIF, compared to the heartbeats which now use the Erlang
timer functions.

This stops xhr-multipart disconnect events from working completely.


Hm, interesting. I must have combined something wrong in there. I'll try to look it up and fix it. I don't think we had test cases for these disconnections I had no good use case myself. Probably some wrong pattern matching on the events.

@jannschu jannschu commented on the diff Jun 16, 2011
@@ -237,3 +243,8 @@ send_message(Message, ServerModule, Req) ->
Message0 = binary_to_list(jsx:term_to_json(list_to_binary(Message), [{strict, false}])),
M = "<script>parent.s._(" ++ Message0 ++ ", document);</script>",
apply(ServerModule, chunk, [Req, M]).
+reset_heartbeat({TimerRef, Time}) ->

Is this function duplicated with a small change in the transports? (reset_duration, reset_interval)

ferd added a note Jun 16, 2011

Most of them yes. I could likely generalize it by making the message to send a third parameter in a triple. I guess it would make more sense, although I'm not sure where the function would sit.

Maybe adding a socketio_transport module, but it would be a bit empty. I did not look closer, eventually there is more to generalize.

ferd added a note Jun 16, 2011

Agreed. A lot of the handling of 'heartbeat vs. message' is surprisingly similar and could likely be abstracted away too. I'll look into that once we get it to work (I've seen your fix, I'll do it in a few minutes after lunch)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jannschu jannschu and 1 other commented on an outdated diff Jun 16, 2011
@@ -189,10 +195,10 @@ handle_info({'EXIT',_Port,_Reason}, #state{ close_timeout = ServerTimeout} = Sta
handle_info(timeout, #state{ server_module = ServerModule,

timeout to {timeout, _Ref, heartbeat} ?

ferd added a note Jun 16, 2011

Not sure. This is a different timer. That one stands in for the disconnection timeout, which shouldn't have changed. It should be coming from the handle_info clause that has {'EXIT', ..., ...} in it. I'll try it anyway just to see.

EDIT: No, you're right. this fixes it. There's some weird voodoo timer going on there that I'll try to clarify, but for now I'll be making a new commit for that.

As I understand it it's clear: previously the timeout feature of gen_server was used which sent the timeout message to handle_info. With erlang:start_timer this tuple is sent instead and it's also handled by handle_info being no regular gen_server call.

Isn't the disconnection timeout the same as the heartbeat? Solely difference is that the decision whether it's a disconnect or heartbeat is based on the connection status set by {initializaion, ...} and {'EXIT', ...}?

EDIT: Just found line {noreply, State#state { connection_reference = {TransportType, none}}, CloseTimeout};. That's strange, indeed.

EDIT2: Another idea is: The start_timer generated message 'destroys' the gen_server timeout, which is only one time used in an after clause. So to correctly fix the behaviour, the timer should be set to close_timeout in {'EXIT', ...} and back to heartbeat for a new connection.

ferd added a note Jun 16, 2011

Ah, I think it was kind of a timer conflict. You've got two of them going on: one for the heartbeat, and one for the close timeout.
Because before they were using the same message, you could interchangeably use either the heartbeat or the close timeout. However, when I made the change, I kept the close timeout the way it was and the heartbeat was a new message. So when the session was about to close and the close timeout got in (with maybe 8s), the heartbeat timer (at 10s) would pretty much always come in and cancel the close timeout, which couldn't happen beforehand. With the fix I've done, I've just made it so I only listen to the heartbeat timeout. I need to fix it so the close timeout is a timer of its own and I only listen to that one.

The problem with that is that I have no idea what kinds of events can cancel the close timeout -- it just did so by itself before. The current behaviour is ugly, but likely less buggy than if I'm handling the close timeout without knowing what the hell it should do.

Then again, I could have twice the same pattern for that, so that the close timeout and the heartbeat are both considered, whichever comes first. That way the close timeout can work the way it does right now, and the heartbeat keep working in the new way. This might be the safest way around, as it would replicate the old behaviour.

ferd added a note Jun 17, 2011

There, just amended. The behaviour I described seemed to be right. I've tested it on my own app, and before the amend, if you disconnect right after a heartbeat is sent, you can trigger the normal behaviour. So now I handle the two possibilities and everything looks fine. I've also added documentation in comments so people know what the hell is going on.


If it's pulled I think I can improve the implementation (less hackish). But that's not something urgent right now.
So: please pull :)

ferd added a note Jun 17, 2011

Yeah, once it's pulled I'd like to abstract a few things a bit myself (all the reset functions and the fake make_ref() and whatnot) and then find where the hell the disconnect timeout can be canceled so that we can get clear semantics on there rather that this messy thing we're doing right now. At least it should work now.

Did you read the edited notice above? 'EDIT2' is interesting here.

The gen_server timeout feature is also the reason for issue #45 I think. When a {send, ....} is sent to the transport after it waits for a disconnect timeout, the gen_server timeout gets 'destroyed'.

I have a quite clear idea of an implementation, which is cleaner and correct (I hope). By correct I mean the close_timeout, which isn't used here. When it's pulled, I could easier give it a try.

ferd added a note Jun 17, 2011

Yeah, it sounds like a smarter implementation of that one, so we do give a good logical time to reset things. I can try to fit it in, I'll think of some idea that keeps things clean enough (I tried to keep things opaque for each counter, but doing this means manually juggling with a few values here and there). If you do want to do it yourself, I can leave it to you too. I won't spit on someone else fixing things for me ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

If you change the line 212 in socketio_xhr_transport.erl, line 196 in htmlfile and 219 in polling this will work and also solve issue #45 I think.

I try to find time at the weekend to write a test for this.

@ferd ferd Fixing the timeout events on transports.
We fix the timer issues having to do with erlang BIF timers conflicting
with the connection timeouts occurring after a closed socket.

Any client whose connection is closed at the point where the 'dead
connection timeout' is longer than the polling/heartbeat timer's time
left will have conflicting timers where the polling/heartbeat resets the
dead connection timeout.

Given the delays between the 'dead connection timeout' and the regular
polling/heartbeat timeout can be very close (and are set that way by
default), this conflict happens fairly often.

Because we are not exactly sure what makes it valid to cancel the dead
connection timer, this fix makes it so both timers are handled the same
way, replicating the old behaviour present before the recent fixes to
the timers when it comes to these connections.
@ferd ferd merged commit 4bff2ef into yrashk:master Jun 17, 2011
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment