nginx + sanic problems #2531

oren-nonamesecurity · 2022-08-18T13:32:11Z

Describe the bug

Hi,
I'm using nginx as reverse-proxy to handle TLS and route to multiple applications on same server.
My sanic app is the main route that handles most of the traffic.
The traffic is pretty high (~5000 requests per second, on 16 workers) and I get some random errors in the nginx every few minutes. The errors are from these 3 types:

2022/08/18 12:05:20 [error] 42#42: *34037 writev() failed (32: Broken pipe) while sending request to upstream, client: 10.100.13.56, server: , request: "POST /engine HTTP/1.1", upstream: "http://127.0.0.1:8080/", host: "10.100.13.154"
2022/08/18 12:05:40 [error] 42#42: *35312 upstream prematurely closed connection while reading response header from upstream, client: 10.100.13.82, server: , request: "POST /engine HTTP/1.1", upstream: "http://127.0.0.1:8080/", host: "10.100.13.154"
2022/08/18 12:10:52 [error] 40#40: *55659 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.100.13.66, server: , request: "POST /engine HTTP/1.1", upstream: "http://127.0.0.1:8080/", host: "10.100.13.154"

On the sanic side there's nothing in the logs.
access_log=True and also debug=True didn't show any error or anything suspicious, I also tried verbosity=5 but still nothing relevant.

Going deeper, I started tshark to try to see what actually happens, and I saw that the Sanic side actually sent tcp RST in the middle of a POST request being sent to it.

I tried setting SANIC_NOISY_EXCEPTIONS, increase SANIC_KEEP_ALIVE_TIMEOUT and REQUEST_BUFFER_SIZE - but still nothing helps.

I saw #991 which is quite old and has similar errors, but I'm using latest version and still have them.

Any idea what can cause it? Or how to fix it?

Environment (please complete the following information):

OS: Ubuntu 18.04
Sanic Version: 22.6.2 (Running inside docker using FROM python:3.8-slim-buster)
Docker version: 20.10.16, running containers with host network

The text was updated successfully, but these errors were encountered:

ahopkins · 2022-08-18T15:51:00Z

I'd guess likely timeout issue. What are your timeouts?

Tronic · 2022-08-18T17:39:23Z

@oren-nonamesecurity Can you see with packet capture at which point of connection this occurs? Has Sanic already handled some request(s) on that connection or does it reset prior to responding anything? If it is possible by packet timing, determine more precisely at which time this occurs (is the connection idle before, did a request go without response, or did Sanic RST right after sending a response, etc). That should narrow down the bug hunt.

I assume you are not using https with the internal connection to Sanic, that could cause TLS handshake errors not being logged by Sanic (a known problem).

oren-nonamesecurity · 2022-08-20T15:14:09Z

I started with default timeouts, and increased keep alive timeout to 15 seconds.

From the PCAP it looks like the connection was active for a while (at least 5 previous request-response pairs), and the current problem happens during a new POST request (of size a bit bigger than 1MB). The RST is the first response from the sanic side, no other response before that.
The client sent 7 tcp payloads of 65551 before the sanic responded with RST, which made me think of buffer sizes or some system limits.

The app itself always responds 200 OK with "OK" as content, nothing complicated on the response.

And you're right, the nginx<=>sanic connection is http only.

oren-nonamesecurity · 2022-08-21T12:46:30Z

@Tronic @ahopkins

I sniffed another PCAP that shows a more concrete flow:

The client (nginx) sends a long POST and got 200 OK with "Connection: keep-alive".
About 5 secods later, the server (sanic) sends FIN+ACK while the client (nginx) sends new POST request (packets #31781 and #31955).
Then the client keeps sending more data and the server responds RST multiple times.

This happens even when I clean the entire sanic app, leaving single endpoint with return only.

Any idea what happens here?

Tronic · 2022-08-23T08:05:43Z

@oren-nonamesecurity It certainly looks like the idle connection timing out after five seconds and Sanic is closing it at 31781 (no logging because this is a normal operation). The default SANIC_KEEP_ALIVE_TIMEOUT is five seconds, so double check that your setting is reaching the Sanic server in case you tried another value.

After that Nginx is expected to disconnect but why it instead sends another request 0.6 seconds later, I don't know. Technically it can because its side of the socket is still open, but since Sanic is no longer reading, it'll only lead to those resets that you are seeing.

@ahopkins This is a subtle detail but we should probably make Sanic explicitly shutdown the socket in both directions when it no longer intends to read or write anything:

sock.shutdown(socket.SHUT_RDWR)
sock.close()

oren-nonamesecurity · 2022-08-23T08:19:49Z

@oren-nonamesecurity It certainly looks like the idle connection timing out after five seconds and Sanic is closing it at 31781 (no logging because this is a normal operation). The default SANIC_KEEP_ALIVE_TIMEOUT is five seconds, so double check that your setting is reaching the Sanic server in case you tried another value.

After that Nginx is expected to disconnect but why it instead sends another request 0.6 seconds later, I don't know. Technically it can because its side of the socket is still open, but since Sanic is no longer reading, it'll only lead to those resets that you are seeing.

@ahopkins This is a subtle detail but we should probably make Sanic explicitly shutdown the socket in both directions when it no longer intends to read or write anything:
sock.shutdown(socket.SHUT_RDWR)
sock.close()

The FIN from sanic and the extra POST from nginx happen almost simultaneously (6ms difference).
Maybe the shutdown is what's missing here.

For now I increased the sanic keep alive timeout to be larger than the nginx keep alive timeout, and it prevents these errors from happening because the nginx will close from his side first.

Tronic · 2022-08-23T08:50:10Z

The FIN from sanic and the extra POST from nginx happen almost simultaneously (6ms difference).
Maybe the shutdown is what's missing here.

Oops, I misread the capture, although ~7 ms on localhost is still plenty of time. It'd be interesting to see whether the shutdown makes a difference, if you can easily reproduce this issue. It might be a one-liner addition to https://github.com/sanic-org/sanic/blob/main/sanic/server/protocols/base_protocol.py#L90 directly accessing the asyncio transport socket. The actual PR needs more careful thought and review even if that works.

ahopkins · 2022-08-23T10:11:21Z

Have not looked too deeply into the above (sorry). But I did a quick experiment:

            sock: Optional[socket] = self.transport.get_extra_info("socket")
            if sock is not None:
                sock.shutdown(SHUT_RDWR)

Traceback (most recent call last):
  File "/home/adam/Projects/Sanic/sanic/sanic/server/protocols/http_protocol.py", line 169, in connection_task
    self.close()
  File "/home/adam/Projects/Sanic/sanic/sanic/server/protocols/base_protocol.py", line 94, in close
    sock.shutdown(SHUT_RDWR)
  File "uvloop/pseudosock.pyx", line 155, in uvloop.loop.PseudoSocket.shutdown
  File "uvloop/pseudosock.pyx", line 19, in uvloop.loop.PseudoSocket._na
TypeError: transport sockets do not support shutdown() method

Using uvloop this is not the approach.

Tronic · 2022-08-23T10:55:51Z

There is an open issue (with a hacky workaround too)
MagicStack/uvloop#241

ahopkins · 2022-08-23T11:18:23Z

Nice find.

nuxion · 2023-07-13T21:28:06Z

Hi @Tronic Can this be closed with the release of sanic 23.6 related to the #2670 ?

Tronic · 2023-07-14T00:03:27Z

This should solve the issue in practice, even though the root problem cannot be fixed. The uvloop issue is still open with no activity...

Tronic mentioned this issue Feb 12, 2023

Send keep-alive header and increase the default timeout #2681

Closed

1 task

Tronic closed this as completed Jul 14, 2023

Tronic mentioned this issue Mar 30, 2024

Sanic drops part of HTTP response data #2921

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nginx + sanic problems #2531

nginx + sanic problems #2531

oren-nonamesecurity commented Aug 18, 2022

ahopkins commented Aug 18, 2022

Tronic commented Aug 18, 2022

oren-nonamesecurity commented Aug 20, 2022

oren-nonamesecurity commented Aug 21, 2022

Tronic commented Aug 23, 2022

oren-nonamesecurity commented Aug 23, 2022

Tronic commented Aug 23, 2022 •

edited

Loading

ahopkins commented Aug 23, 2022 •

edited

Loading

Tronic commented Aug 23, 2022

ahopkins commented Aug 23, 2022

nuxion commented Jul 13, 2023

Tronic commented Jul 14, 2023

nginx + sanic problems #2531

nginx + sanic problems #2531

Comments

oren-nonamesecurity commented Aug 18, 2022

ahopkins commented Aug 18, 2022

Tronic commented Aug 18, 2022

oren-nonamesecurity commented Aug 20, 2022

oren-nonamesecurity commented Aug 21, 2022

Tronic commented Aug 23, 2022

oren-nonamesecurity commented Aug 23, 2022

Tronic commented Aug 23, 2022 • edited Loading

ahopkins commented Aug 23, 2022 • edited Loading

Tronic commented Aug 23, 2022

ahopkins commented Aug 23, 2022

nuxion commented Jul 13, 2023

Tronic commented Jul 14, 2023

Tronic commented Aug 23, 2022 •

edited

Loading

ahopkins commented Aug 23, 2022 •

edited

Loading