Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Session ID unknown" after handshake on high server load [Socket.io 1.0.6] #1739

Closed
Rundfunk opened this issue Aug 21, 2014 · 30 comments
Closed

Comments

@Rundfunk
Copy link

I am running a multi-node server (16 workers running Socket.io 1.0.6; accessed via Nginx, configured as a reverse proxy supporting sticky sessions) for ~ 5k users. While the load of the server is low (23 on a 20 core server / 2k users), everyone is able to connect instantly. When the load of the server gets higher (56 / 5k users), new users are not able to connect and receive data instantly. In this case, it takes 2~4 handshakes for the users to connect succesfully.

This is what happens (high load):

  • User opens the website; receives HTML and JS
  • User's browser attempts to initialize a socket.io connection to the server (io.connect(...))
  • A handshake request is sent to the server, the server responds with a SID and other information ({"sid":"f-re6ABU3Si4pmyWADCx","upgrades":["websocket"],"pingInterval":25000,"pingTimeout":60000})
  • The client initiates a polling-request, including this SID: GET .../socket.io/?EIO=2&transport=polling&t=1408648886249-1&sid=f-re6ABU3Si4pmyWADCx
  • Instead of sending data, the server responds with 400 Bad Request: {"code":1,"message":"Session ID unknown"}
  • The client performs a new handshake (GET .../socket.io/?EIO=2&transport=polling&t=1408648888050-3, notice the previously received SID is omitted)
  • The server responds with new connection data, including a new SID: ({"sid":"DdRxn2gv6vrtZOBiAEAS","upgrades":["websocket"],"pingInterval":25000,"pingTimeout":60000})
  • The client performs a new polling request, including the new SID: GET .../socket.io/?EIO=2&transport=polling&t=1408648888097-4&sid=DdRxn2gv6vrtZOBiAEAS
  • The server responds with the data that is emitted in the worker source code.

Depending on the load of the server, it may happen 1~3 times that the server responds with "Session ID unknown" and the client needs to perform a new handshake before data is actually received.

@danielcha
Copy link

I have the same problem... Any solution?

@MTK-mcs
Copy link

MTK-mcs commented Sep 10, 2014

same problem
#1503

@rauchg
Copy link
Contributor

rauchg commented Sep 25, 2014

How's the stickiness configured?

@runningskull
Copy link

@Rundfunk @danielcha @guille I'm seeing the same thing. Intermittent 502's on a high load multi-node server. My sticky session configuration is very similar to what's in balancerbattle

Still investigating - will update this ticket if I find anything helpful.

@rauchg
Copy link
Contributor

rauchg commented Nov 23, 2014

Keep us posted. Please try it with 1.2.1 as we've pushed some important fixes

@runningskull
Copy link

Killer - thanks for the quick response. I'll let you know.

@leroix
Copy link

leroix commented Nov 23, 2014

upgrading to v1.2.1 didn't fix

@rauchg
Copy link
Contributor

rauchg commented Nov 24, 2014

Do you guys have any further debugging information? Could it be a problem in the stickiness logic?
The only way for {"code":1,"message":"Session ID unknown"} to be returned is if the SID is simply not in the in-memory datastructure.

@runningskull
Copy link

Nothing of note yet - We're still investigating. Can't replicate in test environments, so debugging is tricky. I'll post here when/if we get something good.

@rauchg
Copy link
Contributor

rauchg commented Nov 24, 2014

Btw, in your example this is interesting:

GET .../socket.io/?EIO=2&transport=polling&t=1408648888050-3

Notice the -3, that means 3 requests are being started in the same exact ms?

@rauchg
Copy link
Contributor

rauchg commented Nov 24, 2014

(since we have counters at the end of Date.now for higher precision)

@runningskull
Copy link

We finally figured this out. The root cause in our case:

  • When receiving a 5xx error, nginx proxy by default will take the errant upstream out of rotation for 10 seconds
  • when upstream-A is unavailable, ip_hash will route all of A's requests instead to upstream-B
  • unfortunately, when upstream-B gets the new requests, it spits out 5xx errors (correctly) because the SID is not found in this.clients
  • that makes them get taken out of rotation as well, and their requests get routed to upstream-C
  • recurse...

We solved it by changing the nginx max_fails to something more reasonable (and upping the open file-descriptor limit for our app's user, which was a secondary failure point, exacerbated by the constant reconnects)

@rauchg
Copy link
Contributor

rauchg commented Nov 25, 2014

wow this is extremely useful feedback for others. Thanks a lot @runningskull.
What was the cause of the original 5xx error?

@rauchg rauchg closed this as completed Nov 25, 2014
@runningskull
Copy link

I think the very original 5xx error that triggered the chain was just some code that responded 500 as a run-of-the-mill error instead of a more useful error code.

We'll probably do some more research soon into into something like using hash instead of ip_hash, or other measures to increase robustness. If we discover anything generally useful about production setup, I'll be sure to report back. socket.io is great - happy to help in w/e tiny way I can ;)

EDIT: worth mentioning that we had set proxy_next_upstream to do the behavior above on 5xx errors. By default nginx only does this on connection/header errors. However, it seems fairly common (best practice?) to proxy_next_upstream on 5xx errors. We've had similar issues in the past with PaaS vendors doing the same thing.

@Sesshomurai
Copy link

Hi, I am having this same problem with nginx, node and socket.io. There is a way for nginx to use 'sticky' session ids passed along in the HTTP cookie that would solve it, but its part of their commerical offering. I was hoping the socket.io redis would address this by storing the session id in redis and using it from another socket.io-redis enabled node, but it doesn't work. Maybe this is something that could be made to work using the redis adaptor?

@aPoCoMiLogin
Copy link

@runningskull +1 thats explain a lot.

@p3x-robot
Copy link

for me, it was with nginx ssl http2, and it was polling, so the good config is:

 const ioSocket = io('', {
      // Send auth token on connection, you will need to DI the Auth service above
      // 'query': 'token=' + Auth.getToken()
      path: '/socket.io',
      transports: ['websocket'],
      secure: true,
    });

@flaviolivolsi
Copy link

I had this problem hosting my project with Heroku when I switched to multiple dynos, I solved enabling the sticky sessions with heroku features:enable http-session-affinity.

@ashish101184
Copy link

Our application hosted in AWS and we have sticky session but still we have issue of Session ID unknown.

Please let me know if any update on this.

Socket IO : 1.7.3
Socket IO redis : 5.2.0
Node : 7.10.0

@p3x-robot
Copy link

socket.io is around 2.2, nodejs stable is v10, but we are using v12 and for the socket.io we only use websockets.
see:
#1739 (comment)

@ashish101184
Copy link

ashish101184 commented Jul 5, 2019

So you think that we if I will upgrade socket io and node js will resolve problem?

@p3x-robot
Copy link

you should explicitly only use websockets as well as the comment says

@p3x-robot
Copy link

you are using polling, you can't have a sticky session with polling or overly complex, with a connection via websocket, it opens and it keeps it there, if you have a cluster socket does not know where he is and it will creating a new connection every time or fails
image

@p3x-robot
Copy link

as you can see other people solved it as well, check the YES and HURRAY icons.

@ashish101184
Copy link

I am getting fail in python socket client library where python always use first polling request. So from python client never getting connected.

So any solution for that?

@jaykumarthaker
Copy link

jaykumarthaker commented Jan 15, 2020

DON'T FORGET TO CONFIGURE CLIENT AS WELL

Making just nodejs backend to use transport as websocket protocol won't do much. socket.io clients are also required to set with the same configuration. So, in my onion below should work:

in nodejs:

  const ioSocket = io('',  {
      transports: ['websocket',  'polling']
    });

and in js client:

  socket = io.connect(SocketServerRootURL, {
        transports:['websocket', 'polling']
    });

['websocket', 'polling'] will force socket.io to try webscoket as the first protocol to connect, otherwise fall back to polling (just in case some browsers/clients may not support websockets). For cluster environment, better to use ['websocket'] only.

@mirkadev
Copy link

Thanks @jaykumarthaker it works for my cluster

@over2000
Copy link

over2000 commented Feb 8, 2023

for me, it was with nginx ssl http2, and it was polling, so the good config is:

 const ioSocket = io('', {
      // Send auth token on connection, you will need to DI the Auth service above
      // 'query': 'token=' + Auth.getToken()
      path: '/socket.io',
      transports: ['websocket'],
      secure: true,
    });

This fix the problem, with 'polling' my socket make too many requests

@darrachequesne
Copy link
Member

For future readers:

Please note that using transports: ['websocket'] disables HTTP long-polling, so there's no fallback if the WebSocket connection cannot be achieved (which might be acceptable or not, depending on your use case).

Reference: https://socket.io/docs/v4/client-options/#transports

@over2000 if HTTP long-polling makes too many requests, then that surely means something is wrong with the setup, like CORS. Please check our troubleshooting guide: https://socket.io/docs/v4/troubleshooting-connection-issues/

@mzahidriaz
Copy link

Issue mentioned in Socket.IO Documentation, how to handle in load balanced environment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

17 participants