You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As far as I understand, connection state recovery works like this, on a high level, when used with redis-streams-adapter:
Events are stored in a stream, which is a list where every position has a certain offset (index). When an event is delivered to a client, the client keeps track of the offset of that event. When a client reconnects, it sends its last processed offset to the server. The server checks if the offset exists in the stream. If it does, it will read all events from that offset and until the end of the stream into memory. Then it will loop over the events and for each event decide if it should be sent to this particular client (it checks the events against the rooms that the client is a member of).
And this is the key part: the stream is global meaning that it contains all events emitted, not segmented by room or anything else. So if you are a client with a last processed offset X, the server will have read and inspect all messages that have been emitted between X and the end of the stream, even if none of those will actually be sent to you. And it does this for each client in isolation; events read are not cached for other reconnections to use, they all perform their own XRANGE {offset} + commands.
Let's say we use a maxDisconnectionDuration setting of 30 seconds in the connection state recovery feature. This means that in the worst case, every client could force the server to read the last 30 seconds' worth of events into memory.
In our scenario, during peaks we usually see up to ~3000 events per 30 seconds. You could argue this is not a particularly huge number.
Based on observations from our production environment, each event on average occupies ~360 bytes of Redis memory. So in the worst case, the Socket.io server could read 3000*360 B = ~1 MB of events into memory for each client (assuming the event takes up the same amount of memory in the application process, which might not be true but could be a reasonable simplification).
It is not hard to see how this becomes a problem if you have 1000+ active connections as they will all reconnect at the same time when you perform a deployment, reload ingress configuration, restart a host machine or similar.
We would really like to use connection state recovery, but when we have tried - even for a small portion of clients - memory spikes are huge and will easily cause the Socket.io server to run out of memory.
So what could be done about it? Perhaps the Socket.io server could stream events from the adapter rather than reading them into memory? Alternatively - and this might be easier to implement - the adapter could cache events in memory for a certain amount of time, so that simultaneous connection recoveries can share the same event objects rather than having their own copy of the stream segment?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
As far as I understand, connection state recovery works like this, on a high level, when used with
redis-streams-adapter:Events are stored in a stream, which is a list where every position has a certain offset (index). When an event is delivered to a client, the client keeps track of the offset of that event. When a client reconnects, it sends its last processed offset to the server. The server checks if the offset exists in the stream. If it does, it will read all events from that offset and until the end of the stream into memory. Then it will loop over the events and for each event decide if it should be sent to this particular client (it checks the events against the rooms that the client is a member of).
And this is the key part: the stream is global meaning that it contains all events emitted, not segmented by room or anything else. So if you are a client with a last processed offset X, the server will have read and inspect all messages that have been emitted between X and the end of the stream, even if none of those will actually be sent to you. And it does this for each client in isolation; events read are not cached for other reconnections to use, they all perform their own
XRANGE {offset} +commands.Let's say we use a
maxDisconnectionDurationsetting of 30 seconds in the connection state recovery feature. This means that in the worst case, every client could force the server to read the last 30 seconds' worth of events into memory.In our scenario, during peaks we usually see up to ~3000 events per 30 seconds. You could argue this is not a particularly huge number.
Based on observations from our production environment, each event on average occupies ~360 bytes of Redis memory. So in the worst case, the Socket.io server could read 3000*360 B = ~1 MB of events into memory for each client (assuming the event takes up the same amount of memory in the application process, which might not be true but could be a reasonable simplification).
It is not hard to see how this becomes a problem if you have 1000+ active connections as they will all reconnect at the same time when you perform a deployment, reload ingress configuration, restart a host machine or similar.
We would really like to use connection state recovery, but when we have tried - even for a small portion of clients - memory spikes are huge and will easily cause the Socket.io server to run out of memory.
So what could be done about it? Perhaps the Socket.io server could stream events from the adapter rather than reading them into memory? Alternatively - and this might be easier to implement - the adapter could cache events in memory for a certain amount of time, so that simultaneous connection recoveries can share the same event objects rather than having their own copy of the stream segment?
Beta Was this translation helpful? Give feedback.
All reactions