Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gets disconnected when recieving HIGH number of msgs per second #786

Closed
pssolanki111 opened this issue Feb 4, 2022 · 15 comments
Closed

Comments

@pssolanki111
Copy link

Hey There,

Appreciate all the work creating this amazing client implementation.

I used this to write the callback based streaming client in my polygon API package and it has been doing very good work.

A few days back, one of the users of my lib reported disconection issues when the number of msgs being recieved per second was high. I made sure they weren't doing any heavy lifting in the handler itself. In fact they were just pushing messages to a redis queue from threads

Now on to some pointers,

  • I'm aware of the FAQ on why the lib is slow. about that though, we are not sending more than a couple of msgs. All the volume is being received. Further, I keep utf-8 validation disabled.
  • another accompanying issue is the fact that implementing a reliable reconection mechanism for this client hasn't been the easiest thing to do. Every approach has some quirks (unclosed clients, recursive calls from close handler, messed up threading to name a few)

What could be the reasons behind why it's not able to process the messages quickly enough and what could be done to resolve it?

further have we found a reliable and robust way to reconnect in case it does disconnect? @engn33r

Thanks : )

@engn33r
Copy link
Collaborator

engn33r commented Feb 7, 2022

I don't see any error messages related to your issue which limits how much help I can provide. To handle large numbers of connections on Linux, which is probably what is used if redis is involved, you should make sure the value of ulimit -n is large, which is described in #607. That same issue mentions creating a custom dispatcher, which is what I would suggest for handling the reconnection process. An easier way to handle the reconnections is using rel, which is in the README and was the best solution I found for #580. The threading doc mentions other ideas.

@pssolanki111
Copy link
Author

To handle large numbers of connections on Linux, which is probably what is used if redis is involved,

No it happens on windows as well (yes redis can run on windows) as *nix OS. Windows doesn't impose a file dscriptor limit on processes.

I don't see any error messages related to your issue which limits how much help I can provide

Oh yeah My bad. I had the error messages copied, forgot to append to msg lol. here it is (i only happen to have a picture at the moment, it's a ping/pong time out)
Screen_Shot_2022-01-26_at_2 46 39_PM

An easier way to handle the reconnections is using rel

Yeah I have followed the thread about multiple reconnection mechanisms and found rel at the end a month ago.

@engn33r
Copy link
Collaborator

engn33r commented Feb 7, 2022

Are you setting a ping interval and ping timeout in your code? Can you share the lines of code where you set these values?

The error looks like the message from the server is delayed and the ping/pong timeout ends the connection. The delay might be happening on your local system if it cannot keep up with the high amount of server traffic sent to it. Modifying the ping/pong timeout to allow for a greater delay should help.

@pssolanki111
Copy link
Author

Are you setting a ping interval and ping timeout in your code

yes 21 and 20 seconds (delay and timeout respectively)

Can you share the lines of code where you set these values?

These are set in the library. HERE

self.WS there is just a WebSocketApp instance, created HERE

The delay might be happening on your local system if it cannot keep up with the high amount of server traffic sent to it

Yes. that's what I mentioned in the title itself. For a test, I just printed whatever message came in inside the message handler. I don't think there is any more lightweight thing to do with the msg.

so I tried with nothing but a print statement, and it still disconnects with the same error.

Modifying the ping/pong timeout to allow for a greater delay should help.

What values do ya recommend instead of 21 and 20 seconds?

@engn33r

@pssolanki111
Copy link
Author

follow up

I tried the rel approach reusing the example linked above and it appeared to timeout after a while when no network was available

Traceback (most recent call last):
  File "/home/pssolanki/projects/tests/lib_test/test.py", line 35, in <module>
    rel.dispatch()
  File "/home/pssolanki/projects/openSource/polygon/venv/lib/python3.8/site-packages/rel/rel.py", line 205, in dispatch
    registrar.dispatch()
  File "/home/pssolanki/projects/openSource/polygon/venv/lib/python3.8/site-packages/rel/registrar.py", line 72, in dispatch
    if not self.loop():
  File "/home/pssolanki/projects/openSource/polygon/venv/lib/python3.8/site-packages/rel/registrar.py", line 81, in loop
    e = self.check_events()
  File "/home/pssolanki/projects/openSource/polygon/venv/lib/python3.8/site-packages/rel/registrar.py", line 236, in check_events
    self.handle_error(fd)
  File "/home/pssolanki/projects/openSource/polygon/venv/lib/python3.8/site-packages/rel/registrar.py", line 131, in handle_error
    self.callback('read', fd)
  File "/home/pssolanki/projects/openSource/polygon/venv/lib/python3.8/site-packages/rel/registrar.py", line 125, in callback
    self.events[etype][fd].callback()
  File "/home/pssolanki/projects/openSource/polygon/venv/lib/python3.8/site-packages/rel/listener.py", line 108, in callback
    if not self.cb(*self.args) and not self.persist and self.active:
  File "/home/pssolanki/projects/openSource/polygon/venv/lib/python3.8/site-packages/websocket/_app.py", line 335, in read
    op_code, frame = self.sock.recv_data_frame(True)
  File "/home/pssolanki/projects/openSource/polygon/venv/lib/python3.8/site-packages/websocket/_core.py", line 396, in recv_data_frame
    frame = self.recv_frame()
  File "/home/pssolanki/projects/openSource/polygon/venv/lib/python3.8/site-packages/websocket/_core.py", line 435, in recv_frame
    return self.frame_buffer.recv_frame()
  File "/home/pssolanki/projects/openSource/polygon/venv/lib/python3.8/site-packages/websocket/_abnf.py", line 337, in recv_frame
    self.recv_header()
  File "/home/pssolanki/projects/openSource/polygon/venv/lib/python3.8/site-packages/websocket/_abnf.py", line 293, in recv_header
    header = self.recv_strict(2)
  File "/home/pssolanki/projects/openSource/polygon/venv/lib/python3.8/site-packages/websocket/_abnf.py", line 372, in recv_strict
    bytes_ = self.recv(min(16384, shortage))
  File "/home/pssolanki/projects/openSource/polygon/venv/lib/python3.8/site-packages/websocket/_core.py", line 519, in _recv
    return recv(self.sock, bufsize)
  File "/home/pssolanki/projects/openSource/polygon/venv/lib/python3.8/site-packages/websocket/_socket.py", line 113, in recv
    bytes_ = _recv()
  File "/home/pssolanki/projects/openSource/polygon/venv/lib/python3.8/site-packages/websocket/_socket.py", line 90, in _recv
    return sock.recv(bufsize)
  File "/usr/lib/python3.8/ssl.py", line 1226, in recv
    return self.read(buflen)
  File "/usr/lib/python3.8/ssl.py", line 1101, in read
    return self._sslobj.read(len)
TimeoutError: [Errno 110] Connection timed out

this is not very critical to me at the moment though.

@engn33r
Copy link
Collaborator

engn33r commented Feb 11, 2022

@bubbleboy14 can probably tell us whether this error is coming from websocket-client or rel. I can't tell at first glance.

On a different topic, a suggestion for bubbleboy14: if you are willing to migrate or clone the rel repository to GitHub, you probably would 1. get more feedback on the project 2. it could be easier to improve documentation and take PRs. I'd like to add more documentation around using rel for websocket-client but google code isn't used much anymore.

Are you setting a ping interval and ping timeout in your code

yes 21 and 20 seconds (delay and timeout respectively)

I don't know what the server timeout is for your APIs, but one idea is to set the ping timeout values closer to that. For example, ping_interval=50 and ping_timeout=49 if the server timeout is 60 seconds. If the server timeout is 5 minutes, try ping_interval=290 and ping_timeout=280, etc.

If you're dealing with large streams of network data, your network connection could be the bottleneck. In that case, reducing the amount of websocket data or improving your system's ability to handle large volumes of network data are solutions that are outside the scope of this library.

@bubbleboy14
Copy link
Collaborator

Hey @engn33r - ah yes, indeed! Yeah, I think there's a rel Google Code project from waaaay back in the day -- the github repo just isn't called rel for some reason -- don't remember why, but I called it registeredeventlistener, here:

https://github.com/bubbleboy14/registeredeventlistener

Anyway, @pssolanki111 - the traceback points to sock.recv(bufsize) on line 90 of _socket.py:

https://github.com/websocket-client/websocket-client/blob/master/websocket/_socket.py

So I think the TimeoutError (hit on line 1101 of ssl.py) is being caught on line 93 of _socket.py (in _recv(), called on line 113), and then raised anyway due to line 97 - line 97 seems a little off right? Assuming errno.EAGAIN and errno.EWOULDBLOCK are two different values, line 97 would evaluate to true (and therefore raise the exception) no matter what - maybe that should be an "and" instead of "or"? I mean, I think it's correct to raise it in this case, but I still think the logic of line 97 should be revisited.

Anyway, it looks like the error raised on line 98 is not caught by lines 114 or 117, possibly because line 98 simply reads "raise"? Not sure, but this may need to be replaced with "raise exc", possibly.

Also, it should be noted that the error in the screenshot correctly reported a WebSocketTimeoutException, which in that (non-rel) case is raised on line 366 of _app.py. I must confess that the check() function that contains line 366 is not utilized by the rel dispatcher.

So I can see three ways to correct this:

option 1) _socket.py

  • adjust line 97 (possibly replacing "or" with "and")
  • change line 98 (possibly to "raise exc"), which would probably get lines 114 or 117 (or other?) to raise proper WebSocketTimeoutException

option 2) your app -> lib_test/test.py

  • wrap rel.dispatch() in try/except for TimeoutError

option 3) rel integration

  • do something with check() timeout function
  • this actually shouldn't be necessary if we change lines 97/98 in _socket.py (which is already trying to detect these timeouts - the logic is just slightly off)

So I like option 1. Thoughts?

@engn33r
Copy link
Collaborator

engn33r commented Feb 11, 2022

Thank you bubbleboy14! I forgot there was a github repo and the search engine showed me the old google code link. I added a link in this project's README with commit e49e0e8 so it is easier to find in the future.

I will respond to the other details you shared later.

@bubbleboy14
Copy link
Collaborator

bubbleboy14 commented Feb 11, 2022

Awesome, thanks @engn33r!

I should note that although it's possible that the rel dispatcher facilitates a websocket-client reconnect mechanism, rel itself provides no such mechanism - it looked like you were hitting the TimeoutError in Python's core ssl library, a socket level timeout that would understandably fire at some point (we should probs handle this case by raising the correct [WebSocketTimeoutException] error as outlined in option 1, above).

The big selling point of rel (in my book / with regard to websocket-client) is that you can use it to run as many WebSocketApps as you want w/o resorting to (complications such as) threads. There's a pretty decent example of what this looks like here:

#389

Anyway, with regard to the auto-reconnect, the last time I tested this (by turning my wifi off and on), it worked fine. I'd be interested (@pssolanki111) in hearing how well things work on your end - LMK if you think any of the options mentioned above are worth pursuing!

Also, I would further note that a socket-level TimeoutError isn't necessarily something we'd want to "automatically" handle, beyond notifying the application (via a WebSocketTimeoutException, or possibly supporting a user-defined callback for it). I can imagine wanting to handle it in a variety of different ways, depending on the particulars of the app: perhaps an admin should be notified; perhaps reconnects should be attempted (with exponential backoff) up to a certain point; and after that point, should some other logic be triggered? Seems to me it largely depends on the app. But of course, I'm for considering all options.

Thoughts?

@pssolanki111
Copy link
Author

hey @bubbleboy14

highly appreciate the inputs. Didn't know you maintained rel. That's amazing.

Anyway, with regard to the auto-reconnect, the last time I tested this (by turning my wifi off and on), it worked fine

so i tested this in a few ways.

  • start stream -> turn off system wifi - messages stop -> turn wifi back on after a short time -> messages start coming in again.
  • start stream -> turn off system wifi - messages stop -> keep wifi off for a while (like a minute or more) - the above timeout is hit
  • start stream -> keep system wifi on but turn of network connectivity (from router or mobile hotspot) - messages stop -> turn connectivity back on -> messages do not start streaming again

@bubbleboy14
Copy link
Collaborator

bubbleboy14 commented Feb 13, 2022

Ah, I see. I made a couple tweaks to address these issues - check it out:

#795

I adjusted lines 95-97 in _socket.py, which actually may not be related but is nevertheless (I think) a logical improvement (correct me if I'm wrong @engn33r). Also, _socket.recv() now catches TimeoutErrors, which are now expressed (for consistency) as WebSocketTimeoutExceptions.

On the purely custom dispatcher side of things, I also added a WrappedDispatcher to _app.py, which incorporates support for ping_interval/ping_timeout (by way of check() function in _app), and also restores pyevent compatibility (and simplifies rel integration - rel.safe_read() is no longer required).

Also, I noticed that 4 checks failed on the pull request -- do you know what that's about @engn33r?

I tested in the first two (of the three) ways you mentioned @pssolanki111:

  1. same result
  2. same result, but now WebSocketTimeoutException
  3. didn't wanna mess with router (housemates) - lemme know if it works for you!

Anyway, I think that in the second and third cases, you'll now get a WebSocketTimeoutException, which your app can use to trigger whatever logic is appropriate -- switch to a different websocket feed, notify admins, attempt to resume after some period, etc.

LMK what yall think!

@pssolanki111
Copy link
Author

I think the changes sound good. some checks were unsuccessful on the pull but the changes seem good.

@engn33r
Copy link
Collaborator

engn33r commented Feb 25, 2022

Thanks for the detailed analysis @bubbleboy14! The git history shows that logic bug you found on line 97 of _socket.py has been there since 2011, so very nice catch. My understanding of the dispatcher code isn't as thorough as it should be, so thanks for the PR introducing the WrappedDispatcher.

The failing checks were mostly due to some public WebSocket endpoint URLs used in tests going offline, so I replaced those. The only failing check remaining is the code coverage check, which can be handled later.

One point you bring up which should get fixed later is the bare raise, which is worse than even raising a generic exception with raise Exception. I should go through the code and add specific exceptions for every bare raise that exists. That could help a lot with future troubleshooting and may expose other bugs that are hiding in the code.

@bubbleboy14
Copy link
Collaborator

Right on! Hey @engn33r, I should mention that the docs should be updated in a couple places:

https://github.com/websocket-client/websocket-client#long-lived-connection

and https://websocket-client.readthedocs.io/en/latest/examples.html#dispatching-multiple-websocketapps

Basically, "rel.safe_read()" is no longer necessary - now that we have the WrappedDispatcher, rel and pyevent will work fine without it.

@engn33r
Copy link
Collaborator

engn33r commented Mar 11, 2022

Good catch again, I removed the rel.safe_read() lines in commit f0bf03d and confirmed the examples work without that line

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants