Skip to content

Content-Length header missing in response headers #5009

@elacuesta

Description

@elacuesta

Description

The Content-Length header missing in the response headers. I stumbled upon this while working on #4897.

Steps to Reproduce

$ scrapy shell https://example.org
(...)
>>> response.headers["Content-Length"]
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/.../scrapy/scrapy/http/headers.py", line 40, in __getitem__
    return super().__getitem__(key)[-1]
  File "/.../scrapy/scrapy/utils/datatypes.py", line 23, in __getitem__
    return dict.__getitem__(self, self.normkey(key))
KeyError: b'Content-Length'

or

import scrapy

class ContentLengthSpider(scrapy.Spider):
    name = "foo"
    start_urls = ["https://example.org"]

    def parse(self, response):
        print(response.headers["Content-Length"])

Versions

Scrapy       : 2.4.1
lxml         : 4.6.2.0
libxml2      : 2.9.10
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.22.0
Twisted      : 20.3.0
Python       : 3.8.2 (default, Apr 18 2020, 17:39:30) - [GCC 7.5.0]
pyOpenSSL    : 20.0.1 (OpenSSL 1.1.1i  8 Dec 2020)
cryptography : 3.4.5
Platform     : Linux-4.15.0-128-generic-x86_64-with-glibc2.27

Additional context

I can see the server does return the header with cURL:

$ curl -s --http1.1 -D - https://example.org -o /dev/null | grep Content-Length
Content-Length: 1256

and python-requests:

>>> import requests
>>> requests.get("https://example.org", headers={"Accept-Encoding": "identity"}).headers["Content-Length"]
'1256'

It seems to me like Twisted itself is dropping the header:

from twisted.internet import reactor
from twisted.web.client import Agent

agent = Agent(reactor)
d = agent.request(b"GET", b"http://example.org")

def print_response(response):
    print(response.version)
    print(response.headers)

d.addCallback(print_response)
d.addCallback(lambda _: reactor.stop())

reactor.run()
(b'HTTP', 1, 1)
Headers({b'age': [b'164125'], b'cache-control': [b'max-age=604800'], b'content-type': [b'text/html; charset=UTF-8'], b'date': [b'Wed, 24 Feb 2021 14:38:05 GMT'], b'etag': [b'"3147526947+ident"'], b'expires': [b'Wed, 03 Mar 2021 14:38:05 GMT'], b'last-modified': [b'Thu, 17 Oct 2019 07:18:26 GMT'], b'server': [b'ECS (mic/9ABB)'], b'vary': [b'Accept-Encoding'], b'x-cache': [b'HIT']})

I wanted to ask here before opening an issue in Twisted, because this seems like a rather odd thing to do and I'm wondering if I'm missing something 🤔

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions