Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket.io connects and disconnects only for some clients resulting in package lost / unreliability #4298

Open
aaronkchsu opened this issue Feb 27, 2022 · 11 comments
Labels
question Further information is requested

Comments

@aaronkchsu
Copy link

aaronkchsu commented Feb 27, 2022

Describe the bug
A clear and concise description of what the bug is.
We have an nodejs websocket server with 3k+ concurrent connections. A few segment of clients disconnect and reconnect every few seconds/minutes.

We trigger the connection by joining to a room.

        roomsOnline.add(socket.handshake.query.roomId);
        socket.join(socket.handshake.query.roomId);

We use AWS ELB and their support says the load balancer has no problem.

image

On my machine i stay connected with the exact same roomId.

However his connection on the server end looks like this where it disconnects and reconnects. When we try to send the client a message using io.to(roomId).emit() it will only few of the messages instead of every message emitted

image

One of the clients that saw this behavior we tested with a different browser source from another websocket app and the messages work every time to his computer. The same client also had high speed spectrum internet.

To Reproduce

Our server settings - version 4.4.1

 allowEIO3: true, // false by default
    pingInterval: 9000,
    pingTimeout: 15000,

Server

const server = process.env.USE_HTTPS
    ? require("https").createServer(
          {
              cert: fs.readFileSync(process.env.TLS_CERT),
              key: fs.readFileSync(process.env.TLS_KEY),
          },
          app,
      )
    : require("http").createServer(app);

const { Server } = require("socket.io");

// https://stackoverflow.com/questions/48648555/understanding-socket-io-ping-interval-timeout-settings
const io = new Server(server, {
    // transports: ["polling", "websocket"],
    allowEIO3: true, // false by default
    pingInterval: 9000,
    pingTimeout: 15000,

    cors: {
        origin: "*",
        methods: [
            "GET",
            "POST",
            "OPTIONS",
            "HEAD",
            "PUT",
            "PATCH",
            "POST",
            "DELETE",
        ],
    },
});

Socket.IO client version: x.y.z

Client

    <script src="/socket.io/socket.io.js"></script>

    let socket = io({
          query: {
            roomId: last_part
          }
        });

Expected behavior
Expect all clients to stay connected

Platform:

  • Device:
  • CPU Name: AMD Ryzen 5 5600X 6-Core Processor
    CPU Speed: 3700MHz
  • OS: Windows, Chrome/OBS

Additional context
I added a disconnect protocol which helped some clients that were international, but few clients still seeing this behavior

Server
let disconnectCheck = setInterval(() => {
            socket
                .timeout(16000)
                .emit(
                    "heart_beat_check_server",
                    "beating",
                    { running: new Date() },
                    (err, payload) => {
                        console.log(
                            "HEART_BEAT_PAYLOAD",
                            payload,
                            socket.handshake.query.roomId,
                        );
                        if (
                            err &&
                            payload &&
                            payload.version &&
                            payload.version >= 3
                        ) {
                            // the client did not acknowledge the event in the given delay
                            console.error(
                                "HEARTBEAT_ERROR_TIMEOUT",
                                socket.handshake.query.roomId,
                            );
                            socket.disconnect(true);
                        } else {
                        }
                    },
                );
        }, 17000);

Client
   socket.on("heart_beat_check_server",  (_, __, callback) => {
            console.error("heart_beat_check_server_client")
            callback({ roomdId: last_part, version: browserVersion })
        });

socket.on("disconnect", (reason) => {
          console.error("DISCONNECTED ", reason)
          if (reason === "io server disconnect" || reason === "io client disconnect" || reason === "ping timeout") {
            // the disconnection was initiated by the server, you need to reconnect manually
            socket.connect();
          }
          // else the socket will automatically try to reconnect
        });
@aaronkchsu aaronkchsu added the to triage Waiting to be triaged by a member of the team label Feb 27, 2022
@darrachequesne
Copy link
Member

That sounds weird indeed.

Do you know the reason of the disconnection on the client side? Any error thrown?

Is this specific to a given browser?

@darrachequesne darrachequesne added question Further information is requested and removed to triage Waiting to be triaged by a member of the team labels Mar 9, 2022
@aaronkchsu
Copy link
Author

Not a specific browser, it happens on all their browsers which leads me to believe its a network issue?

We get a ping timeout but disappears after a few times which makes think it might think its connected when its not?

For separate clients we have seen a transport close error after few hours and no try to reconnect, is that normal or a way to fix or debug that?

Thanks so much @darrachequesne

@matiaslopezd
Copy link

matiaslopezd commented Apr 5, 2022

+1! We're dealing with that but we don't know if is socket.io and/or the tab throttling feature in browsers. (Maybe exist a bug in Chromium-based browsers for a long period of sessions where drop WebSocket connections)

@aaronkchsu ping me, maybe we can gather info around this and share with @darrachequesne.

The most common error we detect is transport close, and the socket.io-client does not reconnect automatically.

image

@hmeerlo
Copy link

hmeerlo commented May 16, 2022

I'm also experiencing a similar issue where clients occasionally report a 'ping timeout' and the server-side the corresponding transport close error. I debugged this by doing a tcpdump on the client-side. And what I observed is that the server very neatly sends the PING packet every 30s (my pingInterval, pingTimout is 10s). But when it fails it has sent the PING packet too late (after 42s, more than pingInterval + pingTimeout). So it smells like a server side issue to me.

@johnfilo-kmart
Copy link

johnfilo-kmart commented Aug 5, 2022

I have a similar problem with a flask-socketio server app hosted behind an aws alb and the client socket session seems to be consistently disconnecting every 26 seconds.

Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/eventlet/wsgi.py", line 573, in handle_one_response
result = self.application(self.environ, start_response)
File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 2464, in call
return self.wsgi_app(environ, start_response)
File "/usr/local/lib/python3.8/site-packages/flask_socketio/init.py", line 45, in call
return super(_SocketIOMiddleware, self).call(environ,
File "/usr/local/lib/python3.8/site-packages/engineio/middleware.py", line 60, in call
return self.engineio_app.handle_request(environ, start_response)
File "/usr/local/lib/python3.8/site-packages/socketio/server.py", line 560, in handle_request
return self.eio.handle_request(environ, start_response)
File "/usr/local/lib/python3.8/site-packages/engineio/server.py", line 374, in handle_request
socket = self._get_socket(sid)
File "/usr/local/lib/python3.8/site-packages/engineio/server.py", line 565, in _get_socket
raise KeyError('Session is disconnected')
KeyError: 'Session is disconnected'

Client requests use /socket.io/?EIO=3&transport=polling&t=O9hzOSn&sid=xxxx

Server version of SocketIO is 4.6.0

@matiaslopezd
Copy link

matiaslopezd commented Aug 5, 2022

This could be related to the timeout of the TCP connections in the proxy servers or balancers in front of the app. We increased the TCP timeout and the stability of the connection was improved, but not 100%.

In fact, the ping/pong heartbeat is designed not only to check the connection between server and client but is to maintain the connection alive to avoid proxy timeout, by default on proxies is 60 seconds. So, check if the interval is lower than the default value.

Nginx timeout, AWS Load Balancer.

Also, check this post: https://blog.martinfjordvald.com/websockets-in-nginx.

@johnfilo-kmart
Copy link

The idle timeout was set to 30 seconds on the load balancer in AWS. I did increase this to 90 seconds this morning to see if this played a role in any way, and the consistent ~26 second session disconnects remained unchanged,

@johnfilo-kmart
Copy link

As far as how SocketIO is configured in this server app, here is the code (it's quite vanilla):

!/usr/bin/python

coding: utf-8

from flask import Flask
from flask_socketio import SocketIO
import os

from Config import ServerConfig

from main.Utils import ServerLogger

socketio = SocketIO()

gLogger = ServerLogger()

from dbinterface.DatabaseInterface import DatabaseInterface
gDatabaseInterface = DatabaseInterface()

from settings.Settings import SettingsManager
gSettingsManager = SettingsManager()

from robots.RobotsManager import RobotsManager
gRobotsManager = RobotsManager()

from users.UserManager import UserManager
gUserManager = UserManager()

def create_app(config_class=ServerConfig):
app = Flask(name)
app.config.from_object(config_class)
app.config.from_envvar("SERVER_CONFIG_FILE")
if not os.path.exists(app.config["LOG_PATH"]):
os.makedirs(app.config["LOG_PATH"])
gLogger.setLogPath(app.config["LOG_PATH"])
gLogger.log("Creating App")
socketio.init_app(app)
registerBlueprints(app)
gDatabaseInterface.init_interface(app)
gRobotsManager.loadRobots()
gUserManager.loadUsers()
gUserManager.setSecretKey(app.config["SECRET_KEY"])
gSettingsManager.loadSettings()
gLogger.log("Init done")
return app

@johnfilo-kmart
Copy link

And for the client:

let ServiceModule = angular.module('ServiceModule', []);
ServiceModule.service('Socket', function($rootScope){
var socket = io.connect();
return {
on: function(eventName, callback) {
socket.on(eventName, function() {
var args = arguments;
$rootScope.$apply(function() {
callback.apply(socket, args);
});
});
},
emit: function(eventName, data, callback) {
if(typeof data == 'function') {
callback = data;
data = {};
}
socket.emit(eventName, data, function() {
var args = arguments;
$rootScope.$apply(function() {
if(callback) {
callback.apply(socket, args);
}
});
});
},
emitAndListen: function(eventName, data, callback) {
this.emit(eventName, data, callback);
this.on(eventName, callback);
}
};
});

@matiaslopezd
Copy link

@johnfilo-kmart Do you consider the wake-up throttling in Chrome? In fact, is an active bug, check: https://bugs.chromium.org/p/chromium/issues/detail?id=1224672&q=websocket&can=2

You need to apply a technique to avoid this.

@johnfilo-kmart
Copy link

Ooh, no I didn't know about that. I'll look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants