Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanic drops part of HTTP response data #2921

Closed
1 task done
xbeastx opened this issue Feb 25, 2024 · 16 comments · Fixed by #2973
Closed
1 task done

Sanic drops part of HTTP response data #2921

xbeastx opened this issue Feb 25, 2024 · 16 comments · Fixed by #2973
Assignees
Labels

Comments

@xbeastx
Copy link

xbeastx commented Feb 25, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Describe the bug

The bug is that Sanic closes the connection before the final transfer of all data (see example below)

From my point of view this is a very critical bug. For a reason unknown to me, it repeats only when starting the Sanic server inside the docker container (i.e., perhaps it is somehow related to the networkmode or delays during the work of the docker).

After analyzing the commits, we managed to understand that the bug was introduced in 1310684 and not repeated on d1fc867

Code snippet

So we have really simple server returning some json demo.py:

from sanic import Sanic
from sanic.response import json

app = Sanic("Demo")

@app.route("/")
async def handle(request):
    return json({'some_val': 'A'*743000})

if __name__ == "__main__":
    app.run(host="0.0.0.0")

Running in docker container:

FROM python:3.11

COPY demo.py demo.py
RUN pip install sanic==23.12.1
ENTRYPOINT python3 demo.py
$ docker build -t demo .
$ docker run -p 127.0.0.1:8000:8000 --rm -it demo

and the client.py:

import socket
import sys


def getsock(ip, port, req):
    s = socket.socket(2, 1, 6)
    s.connect((ip, port))
    s.send(req)
    return s


REQ = (
    b'GET / HTTP/1.1\r\n'
    b'User-Agent: Mozilla/5.0\r\n'
    b'Connection: close\r\n'
    b'\r\n'
)

s = getsock('127.0.0.1', 8000, REQ)

headers = s.recv(94).decode()

print(f'Headers: {headers!r}')

total = 0
by_len = int(sys.argv[1])
while (data := s.recv(by_len)):
    total += len(data)

print(f'Total length: {total}')

s.close()

I was not able to reproduce by curl, may be it's read too fast... But in real case it's repeats with Nginx proxy and Sanic as upstream.

so now if you will run hundred times you will get something like this:

$ for i in {1..100}; do python3 client.py 4096; done
...
Headers: 'HTTP/1.1 200 OK\r\ncontent-length: 743015\r\nconnection: close\r\ncontent-type: application/json\r\n\r\n'
Total length: 586346
Headers: 'HTTP/1.1 200 OK\r\ncontent-length: 743015\r\nconnection: close\r\ncontent-type: application/json\r\n\r\n'
Total length: 586346
Headers: 'HTTP/1.1 200 OK\r\ncontent-length: 743015\r\nconnection: close\r\ncontent-type: application/json\r\n\r\n'
Total length: 652954
Headers: 'HTTP/1.1 200 OK\r\ncontent-length: 743015\r\nconnection: close\r\ncontent-type: application/json\r\n\r\n'
Total length: 586346
Headers: 'HTTP/1.1 200 OK\r\ncontent-length: 743015\r\nconnection: close\r\ncontent-type: application/json\r\n\r\n'
Total length: 650058
Headers: 'HTTP/1.1 200 OK\r\ncontent-length: 743015\r\nconnection: close\r\ncontent-type: application/json\r\n\r\n'
Total length: 586346
Headers: 'HTTP/1.1 200 OK\r\ncontent-length: 743015\r\nconnection: close\r\ncontent-type: application/json\r\n\r\n'
Total length: 652954
Headers: 'HTTP/1.1 200 OK\r\ncontent-length: 743015\r\nconnection: close\r\ncontent-type: application/json\r\n\r\n'
Total length: 586346
...

So length should be 743015 but sanic returns only 586346-652954.

client.py must be runner outside the docker container, e.g on host. If you will run inside the docker it will not reproduce.

Expected Behavior

Return all the data from response.

How do you run Sanic?

Sanic CLI

Operating System

Linux

Sanic Version

v23.12.1

Additional context

No response

@xbeastx xbeastx added the bug label Feb 25, 2024
@makeev
Copy link

makeev commented Feb 26, 2024

I can confirm this problem and this is a very critical bug.

@makeev
Copy link

makeev commented Feb 27, 2024

I believe the issue lies within the close() method. Instead of directly calling self.abort(), it should be replaced with:

                timeout = self.app.config.GRACEFUL_SHUTDOWN_TIMEOUT
                self.loop.call_later(timeout, self.abort)

By making this adjustment, the connection will function correctly. Otherwise, nginx encounters an error stating upstream prematurely closed connection while reading upstream.

@jhonsonlaid
Copy link

I encountered the same issue. When utilizing the Python packages requests or aiohttp to send requests to a Sanic server, everything works fine. However, when using Nginx as a reverse proxy, Nginx reports an error: 'upstream prematurely closed connection while reading upstream.' Downgrading to sanic 23.6.0 resolves this error.

@robd003
Copy link
Contributor

robd003 commented Mar 26, 2024

@ahopkins any chance you can confirm that the fix from @makeev all that is needed?

If so can we please do a point release for the 23.12 LTS branch?

@Tronic
Copy link
Member

Tronic commented Mar 30, 2024

I believe the issue lies within the close() method. Instead of directly calling self.abort(), it should be replaced with:

                timeout = self.app.config.GRACEFUL_SHUTDOWN_TIMEOUT
                self.loop.call_later(timeout, self.abort)

By making this adjustment, the connection will function correctly. Otherwise, nginx encounters an error stating upstream prematurely closed connection while reading upstream.

This code in sanic/server/protocol/http_protocol.py:242 appears to have been introduced in #2831. Ping @gluhar2006 can you have a look? This asyncio transport code is very complicated and easily broken by small details, also the docstring of that function is not quite understandable.

The graceful shutdown timeout normally has a different meaning in Sanic (roughly speaking: how long to wait for handler to finish), not that of just closing a TCP connection. Would need a bit deeper look at what exactly is being fixed here and what is to proper approach + write tests for those cases.

Possibly related also to #2531 (Nginx failures).

@gluhar2006
Copy link
Contributor

I believe the issue lies within the close() method. Instead of directly calling self.abort(), it should be replaced with:

                timeout = self.app.config.GRACEFUL_SHUTDOWN_TIMEOUT
                self.loop.call_later(timeout, self.abort)

By making this adjustment, the connection will function correctly. Otherwise, nginx encounters an error stating upstream prematurely closed connection while reading upstream.

This code in sanic/server/protocol/http_protocol.py:242 appears to have been introduced in #2831. Ping @gluhar2006 can you have a look? This asyncio transport code is very complicated and easily broken by small details, also the docstring of that function is not quite understandable.

The graceful shutdown timeout normally has a different meaning in Sanic (roughly speaking: how long to wait for handler to finish), not that of just closing a TCP connection. Would need a bit deeper look at what exactly is being fixed here and what is to proper approach + write tests for those cases.

Possibly related also to #2531 (Nginx failures).

It appears that the incorrect behavior of my changes was due to my misunderstanding of the meaning of the graceful shutdown timeout. I tried to look deeper, but so far without success. The main problem here is that I use Keep Alive 99% of the time, so I haven't encountered this problem when testing my changes.
I'll keep researching and come back if I find something.

As for now, I still cannot reproduce such problems with using keep-alive.

@ahopkins
Copy link
Member

ahopkins commented Apr 1, 2024

The graceful shutdown timeout normally has a different meaning in Sanic (roughly speaking: how long to wait for handler to finish), not that of just closing a TCP connection. Would need a bit deeper look at what exactly is being fixed here and what is to proper approach + write tests for those cases.

@Tronic I think they are just using the shutdown timer here as a further delay of the response timeout.


@ahopkins any chance you can confirm that the fix from @makeev all that is needed?

If so can we please do a point release for the 23.12 LTS branch?

@robd003 No. That is an incorrect use of graceful timeout. But also, I am not sure why we would want to further delay abort. Needs investigation.


@xbeastx Shouldn't this just be easily solvable by increasing the response timeout?

lrnselfreliance added a commit to lrnselfreliance/wrolpi that referenced this issue Apr 30, 2024
…ged how multiprocessing works.

Python 3.11 is now required, old versions were causing unpredictability in tests.  (Sanic does not yet
support 3.12)

Sanic has been upgraded to 23.6.0, which is the latest version that avoids this bug:
 sanic-org/sanic#2921

New strategy for multiprocessing is to create all multiprocessing tools in one process, then fork
to other processes.  The previous strategy was to declare multiprocessing tools at the top of every
file, or wherever they were needed at import/creation.  Now all multiprocessing tools are attached to the
app.shared_ctx.  This means `api_app` is imported in many, many places.

This forced a change in how the DownloadManager works.  Previously, it would continually run
download workers which would pull downloads from a multiprocessing.Queue.  Now, a single worker checks
for new downloads and sends a Sanic signal.

Flags have been reworked to use the `api_app`.  I removed the `which` flag functionality because the
`which` are called at import and needed their own multiprocessing.Event.
lrnselfreliance added a commit to lrnselfreliance/wrolpi that referenced this issue Apr 30, 2024
…ged how multiprocessing works.

Python 3.11 is now required, old versions were causing unpredictability in tests.  (Sanic does not yet
support 3.12)

Sanic has been upgraded to 23.6.0, which is the latest version that avoids this bug:
 sanic-org/sanic#2921

New strategy for multiprocessing is to create all multiprocessing tools in one process, then fork
to other processes.  The previous strategy was to declare multiprocessing tools at the top of every
file, or wherever they were needed at import/creation.  Now all multiprocessing tools are attached to the
app.shared_ctx.  This means `api_app` is imported in many, many places.

This forced a change in how the DownloadManager works.  Previously, it would continually run
download workers which would pull downloads from a multiprocessing.Queue.  Now, a single worker checks
for new downloads and sends a Sanic signal.

Flags have been reworked to use the `api_app`.  I removed the `which` flag functionality because the
`which` are called at import and needed their own multiprocessing.Event.
lrnselfreliance added a commit to lrnselfreliance/wrolpi that referenced this issue Apr 30, 2024
…ged how multiprocessing works.

Python 3.11 is now required, old versions were causing unpredictability in tests.  (Sanic does not yet
support 3.12)

Sanic has been upgraded to 23.6.0, which is the latest version that avoids this bug:
 sanic-org/sanic#2921

New strategy for multiprocessing is to create all multiprocessing tools in one process, then fork
to other processes.  The previous strategy was to declare multiprocessing tools at the top of every
file, or wherever they were needed at import/creation.  Now all multiprocessing tools are attached to the
app.shared_ctx.  This means `api_app` is imported in many, many places.

This forced a change in how the DownloadManager works.  Previously, it would continually run
download workers which would pull downloads from a multiprocessing.Queue.  Now, a single worker checks
for new downloads and sends a Sanic signal.

Flags have been reworked to use the `api_app`.  I removed the `which` flag functionality because the
`which` are called at import and needed their own multiprocessing.Event.
@gregflynn
Copy link

@xbeastx Shouldn't this just be easily solvable by increasing the response timeout?

@ahopkins I just came across this issue upgrading from 23.6.0 to 23.12.1 and my response timeout is configured to 60 seconds which is not reached before response data is truncated, in my case always at 109kb, so to your question about increasing response_timeout, i don't think so.

There's also a thread on discord which seems to be the same issue https://discord.com/channels/812221182594121728/1209575840203939880

@robd003
Copy link
Contributor

robd003 commented May 30, 2024

Any work arounds for this problem?

@robd003 robd003 mentioned this issue May 30, 2024
@gluhar2006
Copy link
Contributor

Unfortunately I didn't find any explanation yet which would suit my understanding of this case. May be it's time to revert the changes

@ahopkins ahopkins self-assigned this Jun 23, 2024
@ahopkins
Copy link
Member

Working on resolving in upcoming release and will port back to the LTS.

@ahopkins
Copy link
Member

Anyone have some good minimal reproducible cases to add to test suite?

@ahopkins
Copy link
Member

ahopkins commented Jun 23, 2024

So, to recap, I was able to reproduce as follows:

# server.py
from sanic import Sanic, json

app = Sanic("TEST")


@app.route("")
async def handle(_):
    return json({"A": "A" * 7430000})
# nginx.conf
events {}

http {
    server {
        listen 80;

        location / {
            proxy_pass http://<MY LOCAL IP>:9999;
            proxy_set_header Host $host;

            proxy_read_timeout 1s;
            proxy_connect_timeout 1s;
            proxy_send_timeout 1s;
        }
    }
}
# docker-compose.yml
services:
  nginx:
    image: nginx:latest
    container_name: nginx-proxy
    ports:
      - "8888:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    networks:
      - proxy-network

networks:
  proxy-network:
    driver: bridge
# client.py
import socket
import sys


def getsock(ip, port, req):
    s = socket.socket(2, 1, 6)
    s.connect((ip, port))
    s.send(req)
    return s


REQ = (
    b"GET / HTTP/1.1\r\n"
    b"User-Agent: Mozilla/5.0\r\n"
    b"Connection: close\r\n"
    b"\r\n"
)

s = getsock("127.0.0.1", 9999, REQ)

headers = s.recv(94).decode()


total = 0
by_len = int(sys.argv[1])
while data := s.recv(by_len):
    total += len(data)

expected = 7430020
if total != expected:
    print(f"Expected {expected}, got {total}")

s.close()
for i in {1..100}; do python3 client.py 4096; done

I saw a few misses. After making the change in #2966, I saw none even when bumping the workload much higher. Will continue to verify the solution, but @makeev approach (excluding the graceful shutdown) does appear to solve the proxy problem. But I am not convinced this is the complete story, rather just masking it.

@ashleysommer
Copy link
Member

ashleysommer commented Jun 26, 2024

@ahopkins @Tronic Its been a while since I've played with the Sanic codebase, but this bug caught my interest.

From what I can see, the issue is relatively straightforward.

Before the release of v23.12, when a HttpProtocol instance got closed, it used the close() implementation in the parent class SanicProtocol here:

def close(self, timeout: Optional[float] = None):
"""
Attempt close the connection.
"""
# Cause a call to connection_lost where further cleanup occurs
if self.transport:
self.transport.close()
if timeout is None:
timeout = self.app.config.GRACEFUL_SHUTDOWN_TIMEOUT
self.loop.call_later(timeout, self.abort)

As you can see, it calls abort with call_later, using the GRACEFUL_SHUTDOWN timeout. (Not sure if that is the correct use of GRACEFUL_SHUTDOWN but it's worked like that for a long time without issue).

In PR #2381, a new close() implementation was added to HttpProtocol, that behaves differently than that in SanicProtocol. That was released in Sanic v23.12, and the abort() issues started popping up.
https://github.com/gluhar2006/sanic/blob/6771fac8375f98fb2d26aa7f6730cd5219de7c73/sanic/server/protocols/http_protocol.py#L231-L242
I can see three bugs in that replacement implementation:

  1. If the close() fn is called with a manually-specified timeout argument, then it short-circuits to calling the super().close() function, and skips cancelling the instance's scheduled tasks, defeating the point of the PR (to reduce memory usage).
  2. If self._callback_check_timeouts is None for any reason (eg, if connection is made, but connect_task setup fails, then the connection is live but _callback_check_timeouts is not set) then the close() routine will completely skip the transport close() and abort() steps, that leaves dangling connections open, thats bad.
  3. Unlike the parent protocol close implementation, there is no call_later on abort(). That causes the transport to be killed before it has finished closing, before the remaining data is flushed. That is what is directly causing the described issues in this thread.

That is why @makeev suggested to use the GRACEFUL_SHUTDOWN in the comment #2921 (comment)
Because that is exactly what the parent protocol (and previously the default behavior) does.

I think the issue here is there is a misunderstanding of exactly what transport.abort() is for.
It force-kills the TCP stream, and is supposed to be a fallback only. transport.close() is supposed to gracefully close the TCP connection, but that can take time, it needs to flush all the pending data first (and in Sanic, when using streaming responses, it seems like there can often be a lot of pending data), and also needs to call the connection_lost() callbacks that do cleanup operations, all before finally cleanly shutting down the TCP connection.
That is why calling transport.abort() immediately after calling transport.close() can cut off the rest of the response.
See the implementation of transport.close() and transport.abort() in the UVLoop codebase here.

By calling abort() with loop.call_later, it allows the transport to cleanly close itself before it is force-killed. However that is also not the ideal behaviour either. abort() should only be called if the close() has not completed after a certain period of time.

I don't know if the use of GRACEFUL_SHUTDOWN_TIMEOUT in the superclass SanicProtocol close() implementation is correct either, but it works. (EDIT: You can blame me!) I suppose the ideal solution would be to have a new GRACEFUL_TCP_CLOSE_TIMEOUT value to use instead (default about 1 second?), and if the transport is already successfully closed before that timeout expires, then no need to call abort() at all.

@ahopkins Can you please try your reproduction with this version of HttpProtocol.close(). I haven't tested it, but it fixes the three issues outlined above, and always calls the superclass close() that ensures it adheres to the pre-v23.12 behavior.

def close(self, timeout: Optional[float] = None):
        """
        Requires to prevent checking timeouts for closed connections
        """
        if self._callback_check_timeouts:
            self._callback_check_timeouts.cancel()
        super().close(timeout=timeout)

@ashleysommer
Copy link
Member

ashleysommer commented Jun 26, 2024

Looking further at the implementation of transport.close() in UVLoop
https://github.com/MagicStack/uvloop/blob/6c770dc3fbdd281d15c2ad46588c139696f9269c/uvloop/handles/basetransport.pyx#L252-L262

I think I see another contributor to this issue. Notice, transport.close() closes the read stream, but if there is still data in the transport's write buffer, it does not close the connection. That is why Sanic always relied on a fallback to subsequently call transport.abort() to kill the transport if close() left it open.

Perhaps a better solution would be to check get_write_buffer_size() on close, then if there is still data in the write buffer, do not attempt transport.close(), force a flush somehow.

Just looking through the flow-control code in the UVLoop transport stream code, I can see how there may be some situations where there is still data in the transport send buffer after the final write() operation.
Eg, if a very large chunk of data was sent to write(), then the transport tried to send it all but couldn't send the whole chunk, it keeps some in its buffer. If the buffer is below the high-water mark, it pauses the protocol, so sanic cannot send more data. It runs an async send with the rest of its buffer, and calls transport._on_write() when the async transfer is done, which unpauses the transport.
But two situations arise:

  1. Sanic still has data sitting in send() that cannot send because the write-buffer flow-control is paused, and the transport is killed while its waiting, so sanic cannot send the rest of the payload.
  2. Sanic thinks the initial write was successful, and closes the transport, then subsequently aborts the transport, so uv_write() never completes its asynchronous write with the rest of the buffer.

I notice in the transport._on_write() callback in UVLoop, it checks if the transport is in a closing state, if it is, it completes the second half of the close() routine that was skipped above, namely it schedules the call the protocol connection_lost() callback, and then actually closes the connection.

I suppose the issue here boils down to two false assumptions:

  1. When we write to the TCP transport using transport.write() it seems like a synchronous operation, that sends the whole buffer. In reality it can send only part of the buffer, in that case it will be an asynchronous operation. The only way to tell is to check get_write_buffer_size which can even tell you if there is still data in the async libuv stream buffer.
  2. When close the TCP transport using transport.close() it seems like a synchronous operation, and it usually is, but if there is a pending async send then close becomes an asynchronous operation, and the only way to tell is by analyzing when the connection_lost() callback was triggered.

I still don't know why this only occurs on NGINX. There must be something special about the NGINX TCP implementation that causes UVLoop to not be able to send the whole buffer on each write, that then triggers the uvloop async write operation. Its probably NGINX's flow control fighting with UVLoop's flow control, or perhaps NGINX deliberately does it to force the sender into async writes to reduce blocking writes (and blocking reads on NGINX side).

@ashleysommer
Copy link
Member

ashleysommer commented Jun 26, 2024

@ahopkins I'm using your docker-compose, nginx, client, and server snippets above to try to reproduce this for the purposes of creating a real fix.
When pointing the client at the nginx server, I'm getting the error:

request without "Host" header while reading client request headers

The client is not sending a request Host header, so nginx cannot deal with it.

You did not see this in your testing because you have another error, in your client connector you are connecting to port 9999 that is your local sanic server, not the nginx proxy. So I suspect you were not actually running the reproduction test through nginx.

Edit: After changing the reproduction code to hit the NGINX proxy, I'm seeing this issue reproduce 100% of the time (no need to hammer it with a for-loop, its every 1 out of 1 attempt). I'm seeing the same as @xbeastx. When data size is above 700kb, Sanic always returns length 586346-652954 bytes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
9 participants