Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP client gives incorrect Content-Length for automatically decompressed responses #2743

Open
andersk opened this issue Sep 24, 2019 · 3 comments

Comments

@andersk
Copy link
Contributor

andersk commented Sep 24, 2019

The application below has two handlers, the first of which serves a gzipped tarball at /hello.tar.gz, and the second of which tries to proxy /hello2.tar.gz to the first. But the second actually raises tornado.httputil.HTTPOutputError: Tried to write more data than Content-Length.

The reason is that the Tornado HTTP client has decompressed the Content-Encoding: gzip without adjusting the Content-Length field, resulting in an invalid response. A workaround is setting decompress_response=False.

When decompress_response is on, Tornado should adjust or remove the Content-Length field of decompressed responses, similarly to how it already removes the Content-Encoding field.

I would also argue that decompress_response is unexpected behavior that should really be off by default, even if it’s fixed, as it changes the semantic meaning of the response: Content-Encoding is not Transfer-Encoding.

import tarfile
import io
import tornado.ioloop
import tornado.web
import tornado.httpclient

class TgzHandler(tornado.web.RequestHandler):
    def get(self):
        b = io.BytesIO()
        with tarfile.open(fileobj=b, mode="w:gz") as tar:
            contents = b"Hello, world!\n"
            info = tarfile.TarInfo("hello.txt")
            info.size = len(contents)
            tar.addfile(info, io.BytesIO(contents))
        self.set_header("Content-Type", "application/x-tar")
        self.set_header("Content-Encoding", "gzip")
        self.write(b.getvalue())

class ProxyHandler(tornado.web.RequestHandler):
    async def get(self):
        resp = await tornado.httpclient.AsyncHTTPClient().fetch(
            "http://localhost:8888/hello.tar.gz"
        )
        for k, v in resp.headers.get_all():
            self.add_header(k, v)
        self.write(resp.body)

app = tornado.web.Application(
    [(r"/hello.tar.gz", TgzHandler), (r"/hello2.tar.gz", ProxyHandler)]
)
app.listen(8888)
tornado.ioloop.IOLoop.current().start()
andersk added a commit to andersk/zulip that referenced this issue Sep 24, 2019
Apparently Tornado decompresses gzip responses by default.  Worse, it
fails to adjust the Content-Length header when it does.

tornadoweb/tornado#2743

Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
timabbott pushed a commit to zulip/zulip that referenced this issue Sep 24, 2019
Apparently Tornado decompresses gzip responses by default.  Worse, it
fails to adjust the Content-Length header when it does.

tornadoweb/tornado#2743

Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
@bdarnell
Copy link
Member

Content-Encoding is not Transfer-Encoding.

That's true in principle, but that ship has sailed. Content-Encoding: gzip is used everywhere as if it's a transfer-encoding, and processing it transparently is more in line with (most) users' expectations.

We probably should rename the received Content-Length header (if any) the same way we rename the Content-Encoding header. But note that this is already implementation-dependent (curl_httpclient doesn't do it) and we don't remove Transfer-Encoding: chunked from the headers when processing it. If your TgzHandler used chunked encoding, the ProxyHandler would pass on the transfer-encoding header but write a non-chunked response, breaking everything (I think; haven't tried it). Proxy-style usage will always have to be careful about exactly which headers it passes through from the upstream request.

@andersk
Copy link
Contributor Author

andersk commented Sep 27, 2019

If your TgzHandler used chunked encoding, the ProxyHandler would pass on the transfer-encoding header but write a non-chunked response, breaking everything (I think; haven't tried it).

That’s true. Perhaps RequestHandler should automatically clear Transfer-Encoding when writing a non-chunked response, for the same reason that it already automatically sets Transfer-Encoding: chunked when writing a chunked response.

But Content-Length shouldn’t, in general, just be ignored. If you imagine a streaming version of ProxyHandler, you certainly want it to pass on Content-Length if it’s available, and that is indeed what happens naturally:

class ProxyHandler(tornado.web.RequestHandler):
    def initialize(self):
        self.headers = None
        self.headers_added = False

    def header_callback(self, line):
        if self.headers is None:
            start_line = tornado.httputil.parse_response_start_line(line)
            self.set_status(start_line.code, start_line.reason)
            self.headers = tornado.httputil.HTTPHeaders()
        else:
            self.headers.parse_line(line)

    def add_headers(self):
        if not self.headers_added:
            for name, value in self.headers.get_all():
                self.clear_header(name)
            for name, value in self.headers.get_all():
                if name != "Transfer-Encoding":
                    self.add_header(name, value)
            self.headers_added = True

    def streaming_callback(self, chunk):
        self.add_headers()
        self.write(chunk)
        self.flush()

    async def get(self):
        await tornado.httpclient.AsyncHTTPClient().fetch(
            "http://localhost:8888/hello.tar.gz",
            header_callback=self.header_callback,
            streaming_callback=self.streaming_callback,
            # decompress_response=False,
        )
        self.add_headers()

This correctly passes on Content-Length if available and Transfer-Encoding: chunked if not—unless we were given an incorrect Content-Length because of decompress_response.

matheuscmelo pushed a commit to matheuscmelo/zulip that referenced this issue Sep 28, 2019
Apparently Tornado decompresses gzip responses by default.  Worse, it
fails to adjust the Content-Length header when it does.

tornadoweb/tornado#2743

Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
@bdarnell
Copy link
Member

Perhaps RequestHandler should automatically clear Transfer-Encoding when writing a non-chunked response, for the same reason that it already automatically sets Transfer-Encoding: chunked when writing a chunked response.

The idea is that applications should be free to implement transfer-encodings themselves, without Tornado's knowledge or interference. If you pass a Transfer-Encoding header (or a Content-Encoding or Content-Type for that matter), you're expected to produce a response that conforms to that encoding/type.

It seems a similar argument applies to Transfer-Encoding: we're not giving the application a body representation that includes the chunked framing, so when we decode chunked encoding we should remove that from the headers we deliver to the application (perhaps storing it in a renamed header so there's some record of the transformation happening).

evo42 pushed a commit to evo42/zulip that referenced this issue Oct 30, 2019
Apparently Tornado decompresses gzip responses by default.  Worse, it
fails to adjust the Content-Length header when it does.

tornadoweb/tornado#2743

Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
YashRE42 pushed a commit to YashRE42/zulip that referenced this issue Dec 12, 2019
Apparently Tornado decompresses gzip responses by default.  Worse, it
fails to adjust the Content-Length header when it does.

tornadoweb/tornado#2743

Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants