Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nginx + sanic problems #2531

Closed
oren-nonamesecurity opened this issue Aug 18, 2022 · 12 comments
Closed

nginx + sanic problems #2531

oren-nonamesecurity opened this issue Aug 18, 2022 · 12 comments

Comments

@oren-nonamesecurity
Copy link

Describe the bug

Hi,
I'm using nginx as reverse-proxy to handle TLS and route to multiple applications on same server.
My sanic app is the main route that handles most of the traffic.
The traffic is pretty high (~5000 requests per second, on 16 workers) and I get some random errors in the nginx every few minutes. The errors are from these 3 types:

2022/08/18 12:05:20 [error] 42#42: *34037 writev() failed (32: Broken pipe) while sending request to upstream, client: 10.100.13.56, server: , request: "POST /engine HTTP/1.1", upstream: "http://127.0.0.1:8080/", host: "10.100.13.154"
2022/08/18 12:05:40 [error] 42#42: *35312 upstream prematurely closed connection while reading response header from upstream, client: 10.100.13.82, server: , request: "POST /engine HTTP/1.1", upstream: "http://127.0.0.1:8080/", host: "10.100.13.154"
2022/08/18 12:10:52 [error] 40#40: *55659 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.100.13.66, server: , request: "POST /engine HTTP/1.1", upstream: "http://127.0.0.1:8080/", host: "10.100.13.154"

On the sanic side there's nothing in the logs.
access_log=True and also debug=True didn't show any error or anything suspicious, I also tried verbosity=5 but still nothing relevant.

Going deeper, I started tshark to try to see what actually happens, and I saw that the Sanic side actually sent tcp RST in the middle of a POST request being sent to it.

I tried setting SANIC_NOISY_EXCEPTIONS, increase SANIC_KEEP_ALIVE_TIMEOUT and REQUEST_BUFFER_SIZE - but still nothing helps.

I saw #991 which is quite old and has similar errors, but I'm using latest version and still have them.

Any idea what can cause it? Or how to fix it?

Environment (please complete the following information):

  • OS: Ubuntu 18.04
  • Sanic Version: 22.6.2 (Running inside docker using FROM python:3.8-slim-buster)
  • Docker version: 20.10.16, running containers with host network
@ahopkins
Copy link
Member

I'd guess likely timeout issue. What are your timeouts?

@Tronic
Copy link
Member

Tronic commented Aug 18, 2022

@oren-nonamesecurity Can you see with packet capture at which point of connection this occurs? Has Sanic already handled some request(s) on that connection or does it reset prior to responding anything? If it is possible by packet timing, determine more precisely at which time this occurs (is the connection idle before, did a request go without response, or did Sanic RST right after sending a response, etc). That should narrow down the bug hunt.

I assume you are not using https with the internal connection to Sanic, that could cause TLS handshake errors not being logged by Sanic (a known problem).

@oren-nonamesecurity
Copy link
Author

I started with default timeouts, and increased keep alive timeout to 15 seconds.

From the PCAP it looks like the connection was active for a while (at least 5 previous request-response pairs), and the current problem happens during a new POST request (of size a bit bigger than 1MB). The RST is the first response from the sanic side, no other response before that.
The client sent 7 tcp payloads of 65551 before the sanic responded with RST, which made me think of buffer sizes or some system limits.

The app itself always responds 200 OK with "OK" as content, nothing complicated on the response.

And you're right, the nginx<=>sanic connection is http only.

@oren-nonamesecurity
Copy link
Author

@Tronic @ahopkins

I sniffed another PCAP that shows a more concrete flow:
image

The client (nginx) sends a long POST and got 200 OK with "Connection: keep-alive".
About 5 secods later, the server (sanic) sends FIN+ACK while the client (nginx) sends new POST request (packets #31781 and #31955).
Then the client keeps sending more data and the server responds RST multiple times.

This happens even when I clean the entire sanic app, leaving single endpoint with return only.

Any idea what happens here?

@Tronic
Copy link
Member

Tronic commented Aug 23, 2022

@oren-nonamesecurity It certainly looks like the idle connection timing out after five seconds and Sanic is closing it at 31781 (no logging because this is a normal operation). The default SANIC_KEEP_ALIVE_TIMEOUT is five seconds, so double check that your setting is reaching the Sanic server in case you tried another value.

After that Nginx is expected to disconnect but why it instead sends another request 0.6 seconds later, I don't know. Technically it can because its side of the socket is still open, but since Sanic is no longer reading, it'll only lead to those resets that you are seeing.

@ahopkins This is a subtle detail but we should probably make Sanic explicitly shutdown the socket in both directions when it no longer intends to read or write anything:

sock.shutdown(socket.SHUT_RDWR)
sock.close()

@oren-nonamesecurity
Copy link
Author

@oren-nonamesecurity It certainly looks like the idle connection timing out after five seconds and Sanic is closing it at 31781 (no logging because this is a normal operation). The default SANIC_KEEP_ALIVE_TIMEOUT is five seconds, so double check that your setting is reaching the Sanic server in case you tried another value.

After that Nginx is expected to disconnect but why it instead sends another request 0.6 seconds later, I don't know. Technically it can because its side of the socket is still open, but since Sanic is no longer reading, it'll only lead to those resets that you are seeing.

@ahopkins This is a subtle detail but we should probably make Sanic explicitly shutdown the socket in both directions when it no longer intends to read or write anything:

sock.shutdown(socket.SHUT_RDWR)
sock.close()

The FIN from sanic and the extra POST from nginx happen almost simultaneously (6ms difference).
Maybe the shutdown is what's missing here.

For now I increased the sanic keep alive timeout to be larger than the nginx keep alive timeout, and it prevents these errors from happening because the nginx will close from his side first.

@Tronic
Copy link
Member

Tronic commented Aug 23, 2022

The FIN from sanic and the extra POST from nginx happen almost simultaneously (6ms difference).
Maybe the shutdown is what's missing here.

Oops, I misread the capture, although ~7 ms on localhost is still plenty of time. It'd be interesting to see whether the shutdown makes a difference, if you can easily reproduce this issue. It might be a one-liner addition to https://github.com/sanic-org/sanic/blob/main/sanic/server/protocols/base_protocol.py#L90 directly accessing the asyncio transport socket. The actual PR needs more careful thought and review even if that works.

@ahopkins
Copy link
Member

ahopkins commented Aug 23, 2022

Have not looked too deeply into the above (sorry). But I did a quick experiment:

            sock: Optional[socket] = self.transport.get_extra_info("socket")
            if sock is not None:
                sock.shutdown(SHUT_RDWR)
Traceback (most recent call last):
  File "/home/adam/Projects/Sanic/sanic/sanic/server/protocols/http_protocol.py", line 169, in connection_task
    self.close()
  File "/home/adam/Projects/Sanic/sanic/sanic/server/protocols/base_protocol.py", line 94, in close
    sock.shutdown(SHUT_RDWR)
  File "uvloop/pseudosock.pyx", line 155, in uvloop.loop.PseudoSocket.shutdown
  File "uvloop/pseudosock.pyx", line 19, in uvloop.loop.PseudoSocket._na
TypeError: transport sockets do not support shutdown() method

Using uvloop this is not the approach.

@Tronic
Copy link
Member

Tronic commented Aug 23, 2022

There is an open issue (with a hacky workaround too)
MagicStack/uvloop#241

@ahopkins
Copy link
Member

Nice find.

@nuxion
Copy link
Contributor

nuxion commented Jul 13, 2023

Hi @Tronic Can this be closed with the release of sanic 23.6 related to the #2670 ?

@Tronic
Copy link
Member

Tronic commented Jul 14, 2023

This should solve the issue in practice, even though the root problem cannot be fixed. The uvloop issue is still open with no activity...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants