Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

websocket_recv_nb fails under heavy load but websocket_recv is fine #1716

Open
kylemacfarlane opened this issue Jan 12, 2018 · 9 comments
Open

Comments

@kylemacfarlane
Copy link

kylemacfarlane commented Jan 12, 2018

If you put a uWSGI websocket server using websocket_recv_nb as shown in tests/websockets_chat.py under heavy load then a couple errors start appearing. On the other hand using websocket_recv inside gevent.spawn and applying gevent monkeypatching as shown in tests/websockets_chat_2.py appears to work fine.

First of all I don't think it's caused by load per se. I think the problem is due to messages arriving at the same time which just happens more frequently under heavy load. To recreate the issue I hit the server with two messages at the same time every 100ms.

First Symptom

You will encounter "delayed" messages as reported in #1241. To solve this you can simply iterate over websocket_recv_nb until it returns None.

Second Symptom

You will start getting this error when reading from the websocket:

no PONG received in 3 seconds !!!

This was reported in #1533. How can there be no PONG response when the client and server are talking at 20 messages per second? Even aggressively reading from the websocket as much as possible doesn't solve the problem.

Third Symptom

You will start getting this error when reading from Redis:

OSError: unable to fd 12 to the event queue
epoll_ctl(): Bad file descriptor [core/event.c line 638]

This was reported on StackOverflow and the uWSGI maillist:

Other Errors

If you start playing around there are other errors that can happen. For example if you suppress the OSError from Redis then next time you try to read the websocket you will get an fd error from that as well. But the above are the three errors you will most commonly encounter if you copy the examples in the tests directory: https://github.com/unbit/uwsgi/tree/master/tests

Testing

I experimented with various settings and the following are the outcomes. I didn't try asyncio because the docs say it's experimental.

  • uWSGI 2.0.15
  • uwsgi.wait_fd_read based on tests/websockets_chat_async.py
  • Run with --async 100 --ugreen
  • Hit it with 2 requests as close as possible every 100ms (equivalent to 20rps)
  • ❌Connection usually fails within 0-2 minutes

.

  • uWSGI 2.0.15
  • uwsgi.async_sleep instead of waiting for any fds
  • Run with --async 100 --ugreen
  • Hit it with 2 requests as close as possible every 100ms (equivalent to 20rps)
  • ❌Connection usually fails within 0-2 minutes

.

  • uWSGI 2.0.15
  • gevent.select based on tests/websockets_chat.py
  • Run with --gevent 100 --gevent-monkey-patch
  • Hit it with 2 requests as close as possible every 100ms (equivalent to 20rps)
  • ❌Connection usually fails within 5-15 minutes

.

  • uWSGI master (f74553d)
  • uwsgi.wait_fd_read based on tests/websockets_chat_async.py
  • Run with --async 100 --ugreen
  • Hit it with 2 requests as close as possible every 100ms (equivalent to 20rps)
  • ❌Connection usually fails within 0-2 minutes

.

  • uWSGI master (f74553d)
  • gevent.select based on tests/websockets_chat.py
  • Run with --gevent 100 --gevent-monkey-patch
  • Hit it with 2 requests as close as possible every 100ms (equivalent to 20rps)
  • ❌Connection usually fails within 0-2 minutes
  • Worse than gevent on 2.0.15 because it seems to have other bugs and will encounter other random errors such as:
    • KeyError: 'env'
    • SystemError: <built-in function uwsgi_gevent_request> returned a result with an error set

.

  • uWSGI master (f74553d)
  • gevent.spawn and request_context loosely based on tests/websockets_chat_2.py
  • Use non-blocking methods uwsgi.websocket_recv_nb and redis.get_message combined with gevent.sleep
  • Run with --gevent 100 --gevent-monkey-patch
  • Hit it with 2 requests as close as possible every 100ms (equivalent to 20rps)
  • ❌Connection usually fails within 0-2 minutes

.

  • uWSGI master (f74553d)
  • gevent.spawn and request_context loosely based on tests/websockets_chat_2.py
  • Use blocking methods uwsgi.websocket_recv and redis.listen and hope that gevent monkeypatching prevents any actual blocks
  • Run with --gevent 100 --gevent-monkey-patch
  • Hit it with 2 requests as close as possible every 100ms (equivalent to 20rps)
  • ✔ Connection seems to survive but is it truly async?

.

  • Combine the two previous tests that used gevent.spawn and request_context
  • "Blocking" uwsgi.websocket_recv + gevent monkeypatch with non-blocking redis.get_message + gevent.sleep
  • ✔ Connection seems to survive but is it truly async?
  • Non-blocking uwsgi.websocket_recv_nb + gevent.sleep with "blocking" redis.listen + gevent monkeypatch
  • ❌Connection usually fails within 5-15 minutes
  • Something seems to be wrong with websocket_recv_nb.
@kylemacfarlane
Copy link
Author

Bonus: this seems to have also fixed #1602 (which may be the same as #1623).

#1602 is about sending single large responses over websocket_send. At first glance that seems like the exact opposite problem of receiving multiple small requests but it's easy to see how it can become the same.

A large response that takes a long time to send is more likely to overlap with other requests or responses. Thus again it's probably not heavy load per se but multiple messages being received or sent at almost the same time which causes all these bugs.

@kylemacfarlane
Copy link
Author

I noticed in #1602 I said it only happens behind nginx. That's not true for this bug (#1716) which happens with and without nginx.

@marc1n
Copy link

marc1n commented Nov 15, 2018

In above (failed) tests did you run uWSGI with --http, --http-socket or --socket?

@kylemacfarlane
Copy link
Author

In the tests I was hitting uWSGI directly so it would most likely have been --http.

In production it's behind nginx and uses --socket.

The problem appears in both places when using websocket_recv_nb.

@marc1n
Copy link

marc1n commented Nov 16, 2018

How did you run the tests? Could you share a testing script?

@kylemacfarlane
Copy link
Author

The problem is at least partially repeatable with the demos in the tests folder.

Using this demo: https://github.com/unbit/uwsgi/blob/master/tests/websockets_chat_async.py

First you need to update it for current versions of uwsgi (change websocket_handshake) and then add a setInterval to repeatedly send the message: https://gist.github.com/kylemacfarlane/841efeec92d63d2396dc0cab0d2a4e85

Note that for this demo I had to set the interval to 5ms / 200rps whereas in my real project 50ms / 20rps is enough. Probably something to do with message sizes and/or network latency.

Run it with:

uwsgi --shared-socket 0.0.0.0:443, --https =0,cert,key --http-websockets --wsgi-file websockets_chat_async.py --async 100 --ugreen

Access the page and type something in the message box so that it starts sending. Within a few minutes (perhaps even seconds) the websocket connection will drop with Bad file descriptor either from wait_fd_read or uwsgi_response_write_body_do.

@marc1n
Copy link

marc1n commented Nov 19, 2018

I am interested in gevent version so I run server app:
uwsgi --http :12345 --http-websockets --wsgi-file websockets_chat.py --gevent-monkey-patch --gevent 100
and in Chrome browser in console I run:
setInterval(invia, 5)

I don't observe any errors after some minutes of running this test (over 10000 websocket messages so far)...

@marc1n
Copy link

marc1n commented Nov 19, 2018

I run this version of uWSGI:

*** Starting uWSGI 2.0.17.1 (64bit) on [Mon Nov 19 10:16:02 2018] ***
compiled with version: 5.4.0 20160609 on 06 September 2018 13:40:59
os: Linux-4.15.0-39-generic #42-Ubuntu SMP Tue Oct 23 15:48:01 UTC 2018

@kylemacfarlane
Copy link
Author

gevent.select on stable is the hardest method to make fail. In the tests above I noted that it takes 5-15 minutes at 20rps over the real network so in a simple test on the same computer you'd probably have to leave it running for ages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants