Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content-Length header missing in response headers #5009

Closed
elacuesta opened this issue Feb 24, 2021 · 3 comments · Fixed by #5057
Closed

Content-Length header missing in response headers #5009

elacuesta opened this issue Feb 24, 2021 · 3 comments · Fixed by #5057

Comments

@elacuesta
Copy link
Member

Description

The Content-Length header missing in the response headers. I stumbled upon this while working on #4897.

Steps to Reproduce

$ scrapy shell https://example.org
(...)
>>> response.headers["Content-Length"]
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/.../scrapy/scrapy/http/headers.py", line 40, in __getitem__
    return super().__getitem__(key)[-1]
  File "/.../scrapy/scrapy/utils/datatypes.py", line 23, in __getitem__
    return dict.__getitem__(self, self.normkey(key))
KeyError: b'Content-Length'

or

import scrapy

class ContentLengthSpider(scrapy.Spider):
    name = "foo"
    start_urls = ["https://example.org"]

    def parse(self, response):
        print(response.headers["Content-Length"])

Versions

Scrapy       : 2.4.1
lxml         : 4.6.2.0
libxml2      : 2.9.10
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.22.0
Twisted      : 20.3.0
Python       : 3.8.2 (default, Apr 18 2020, 17:39:30) - [GCC 7.5.0]
pyOpenSSL    : 20.0.1 (OpenSSL 1.1.1i  8 Dec 2020)
cryptography : 3.4.5
Platform     : Linux-4.15.0-128-generic-x86_64-with-glibc2.27

Additional context

I can see the server does return the header with cURL:

$ curl -s --http1.1 -D - https://example.org -o /dev/null | grep Content-Length
Content-Length: 1256

and python-requests:

>>> import requests
>>> requests.get("https://example.org", headers={"Accept-Encoding": "identity"}).headers["Content-Length"]
'1256'

It seems to me like Twisted itself is dropping the header:

from twisted.internet import reactor
from twisted.web.client import Agent

agent = Agent(reactor)
d = agent.request(b"GET", b"http://example.org")

def print_response(response):
    print(response.version)
    print(response.headers)

d.addCallback(print_response)
d.addCallback(lambda _: reactor.stop())

reactor.run()
(b'HTTP', 1, 1)
Headers({b'age': [b'164125'], b'cache-control': [b'max-age=604800'], b'content-type': [b'text/html; charset=UTF-8'], b'date': [b'Wed, 24 Feb 2021 14:38:05 GMT'], b'etag': [b'"3147526947+ident"'], b'expires': [b'Wed, 03 Mar 2021 14:38:05 GMT'], b'last-modified': [b'Thu, 17 Oct 2019 07:18:26 GMT'], b'server': [b'ECS (mic/9ABB)'], b'vary': [b'Accept-Encoding'], b'x-cache': [b'HIT']})

I wanted to ask here before opening an issue in Twisted, because this seems like a rather odd thing to do and I'm wondering if I'm missing something 🤔

@wRAR
Copy link
Member

wRAR commented Feb 24, 2021

twisted.web._newclient.HTTPClientParser has headers and connHeaders and it looks like it only copies the former into the response object.

CONNECTION_CONTROL_HEADERS = set([
            b'content-length', b'connection', b'keep-alive', b'te',
            b'trailers', b'transfer-encoding', b'upgrade',
            b'proxy-connection'])

@GeorgeA92
Copy link
Contributor

GeorgeA92 commented Feb 24, 2021

@elacuesta
I confirm @wRAR 's statement - headers from CONNECTION_CONTROL_HEADERS (including Content-Length header`) is not copied to headers of response objects.

Solved it by this monkeypatch:

spider code
import scrapy
from scrapy.crawler import CrawlerProcess
from twisted.web._newclient import HTTPParser

# HTTPParser.CONNECTION_CONTROL_HEADERS.clear()   # <- works (not as expected)
# initially I tried to use the same approach as for _caseMappings
# mentioned in this comment
# https://github.com/scrapy/scrapy/issues/2711#issuecomment-367342284


class HTTTParser_H(HTTPParser):
    def headerReceived(self, name, value):
        name = name.lower()

        if self.isConnectionControlHeader(name):
            self.connHeaders.addRawHeader(name, value)
            self.headers.addRawHeader(name, value)
        else:
            headers = self.headers
            headers.addRawHeader(name, value)

HTTPParser.headerReceived = HTTTParser_H.headerReceived

class ContentLengthSpider(scrapy.Spider):
    name = "foo"
    start_urls = ["https://example.org"]

    def parse(self, response):
        print(response.headers)

process = CrawlerProcess()
process.crawl(ContentLengthSpider)
process.start()

Log output:

...
2021-02-24 20:43:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.org> (referer: None)
2021-02-24 20:43:33 [scrapy.core.engine] INFO: Closing spider (finished)
{b'Accept-Ranges': [b'bytes'], b'Age': [b'466759'], b'Cache-Control': [b'max-age=604800'],
b'Content-Type': [b'text/html; charset=UTF-8'],
b'Date': [b'Wed, 24 Feb 2021 18:43:30 GMT'], b'Etag': [b'"3147526947"'],
b'Expires': [b'Wed, 03 Mar 2021 18:43:30 GMT'], b'Last-Modified': [b'Thu, 17 Oct 2019 07:18:26 GMT'],
b'Server': [b'ECS (bsa/EB15)'],
b'Vary': [b'Accept-Encoding'],
b'X-Cache': [b'HIT'], b'Content-Length': [b'648']}
2021-02-24 20:43:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 211,...

Content-Length reached to scrapy Response object and to spider parse method

@elacuesta
Copy link
Member Author

Interesting, many thanks for the research @wRAR and @GeorgeA92 😄
Looks like a design decision rather than a bug, do you know what is the reason behind it?
Should we consider doing this in Scrapy itself? I was able to work around this in 84e91b6, so I have no further need for this specifically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants