Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix “not enough values to unpack” when parsing headers #5911

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Apr 25, 2023

Fixes #355.

To do:

  • Figure out the issue affecting vapestore.co.uk
  • Reproduce the issue about a header that is too long, and solve it as well, maybe increasing the maximum to a reasonable higher vallue.
  • Do not rely on private APIs (too messy)
  • Add a test.

@Gallaecio
Copy link
Member Author

I have been trying to fix #355 (comment), which currently affects responses from vapestore.co.uk.

With the current changes, where I print the header lines as they come, to try and find out why Scrapy is choking on one of them, the response content suddenly starts looking binary:

Line: b'HTTP/1.1 200 OK'
Line: b'Server: nginx'
Line: b'Date: Tue, 25 Apr 2023 08:13:55 GMT'
Line: b'Content-Type: text/html; charset=UTF-8'
Line: b'Transfer-Encoding: chunked'
Line: b'Connection: keep-alive'
Line: b'X-Cache-Lifetime: 15552000'
Line: b'Report-To: {"group":"report-endpoint","max_age":10886400,"endpoints":[{"url":"\\/csp_reporter.php"}]}'
Line: b'{&0\x08\xa6\x946z\xd4\xd4\xec<7\xe5\xcf\r\x10[\xbe \x0b1\xe4\x7f\xd6\x9fX\x03\x11,<\xa6\x10\xe9\xe0\xe0\xaa\xf9\xb1y|q\xf5\xe1\xf0\xfc\xa2y\xdc<\x93\xaeOo\xce/\xae\xce\x9a\xe7\'\x97g\xfbM\t\x05L\xba8<\x92Y\x86\xacy| \xd9\x86\xb7\xcd\xbd\x8b\xcb\xb3\xe6\xd5\xf1\xdeQS\xc2\xee\xed\xf1\x95\xfa\xde\xc3\x97\xf3\x8b\xbd\xb3\x0b\xc9A\x9c\x9c\x1d\xbe\x13\xadJ7\xad\xd3\xcb\xf3\xf7\x98\xe3B\x94\x1e\xb0\\\xcbq\xfa\x8a\xcd\x1f\x1e\xbf\xbbzs\xf9\xf6m\xf3\xec\xea\xed\xe5\x87\x0frb\xb2\xeau;\x11:5N\xd9\xdd\x0c^\xd2\xc6\xed\x02-\xd5qj"\x1e\x87\x93\xa8\xcb\x85\x85\x9d8\x88\xe9\x03{K\x13\xd5\x89\x933\xf9\x9d\x104J_\xba\xaeY\x89\xd1\x04\xcfd\x13x\xe4\x01\xb0\\}\x98\xf6\xc0\\\xed\xb2\x9ex\x98\xb0!U\x81\\\x000\xab\x03\xd7D\xa5"\x11,\x93]\xbb\x81\xf3\xeb\xc4i~dc\xd7,*\xbf\x01-^l=AM\xf4i|sr\xf0\xab\xf49\xdb\x7f\x93\x02\xfe\x1e_2\xc8L\xd2\x19k\x96\xce\x18\xf1\x0f\xf8|%\x9e\xaf\xd2ZG2\xe1\xe0\xe4\xb8)\x1d\xeaDBV\xf9X\x9f\xf1n~\xc6cvx|\xd1<\xdb\xdb\xbf8<9\x96\xc8\xa3\xa5\\\xed\x9d\x1eJ\xcc\xd0S\tA\xce\xe5\xfc\xa2\x88p\x9a\xd6\xdf\x91\xef\xc7\'\x07M\x89B?\x9e\x13n\xd2\xdb\x05;\xda\xfbE\xa0*\xa0\xca\xc1\xbb\xe6\x85\xc4\xe0\xb3\xe6\xd1\xde\xe11`\x90\xc4\xcb\xf3\xd3\xbd\xac\x8e\x81\x86Dw\xd2p\x12d\xe0\x94\xd5\x13~\x83s\xd8\x95\xe7ysX6\xf6\x00\xb9Zf\x17X\xcf\x1b\x93\x99\xf1\xa4\x03\x02\x0c<\x00\xff4\x06,\x8a\xc5#J\xc8\xe2i2\x86_\xb9W\xa2\xa1\xc5\xcb\x97/\x11\xe1\xfa\x81\xc2\xb2.\xbd\x10\x9e\x01\xc2u;*\x1d\x90\xce\x1b\xfb\x15\xff>\xa8\x98\x88x\x11\x1fy$$\x9b\x88}$0\t\xe3VB@\xe8\xd71\xc8\xd2\x88~\xe6\xe78\x0c\xc6\xe2m\x8cGy\xc07\xaaJo\xd5{/\x0c\xe0\xf3H\xbdv\xc2\xde\x14\x9a\xe9\xc8\xc2\xa2;S\x81\xc9\xe7\x17\xec.[\x14M\xb5(\xf6\xe0\x01\x8b\xa1\x994t\xdb\\m\xb2\x0b,\x8e\x905\xd9\x95\xac\x19\x10\xbe\xb6U}\xce*\xec\xe79\xe9C\xd8\x86\xf8\xa9]\x18Z\x8elmn(\xcb\x11\xb4+Y\xe4L\x89\x0c1\x19\xbb\x15l\xfe\\\xced\xf2u\xc4I]\x83\x07\xd4\x98\xc29\xf0\xcb\xca\xa6\x89\x11\xb3\x83\xc6\x1ae\x95\xcc\xd9o\x08O>G\x1e\x98\xa2\x87\x99\x1b\x89\xa4\xce0\xec\xde\xa0aIM\xbcw\x07\xbc{\x93s=\x01\xd6\xa7,\x91\xcc\x1c\xa8\x17\xc3\xb7\x0bz\xa1XMd\x1d\xc9\xb8\xe4\xfa\x1d\x9a\xda#\x03\xbc\xeb\xe8\x0es3:\'\x15\x199\x13\xb9\x03g\x7fkA\xbd\x85\xc2\x9a\x08N\x07\xff\x9f\xcb\x8b\xb1|7\x18v\xa3^\xd2\xb5\xbc+\x1f\x9f\xcd\x84\xd9 \xce\xeb3\x08\xf2Y\xac\xe0\x9c}\xe7\x8e\x10O\x9d\x00\xad:B\xb5Fs\x96E\xf2\xb0\x1c\xd6\x98\xb2;\x9a\xa4\x18\xd5Oe\xdc\x1eZ\x1an\xac\xbf\xb4aaA\xda\xf6V\xd5\x86%\xa5NM\xae\x951),&\xa8\x19\xb8,\xd1\x89\xdb\x0c\x0f\x95\xa7J\x9c\x04\xf5VX\xd8\x8b\xda(\x80\x88\xe4\x94{\x0b3^\xaa\xcd\xd0\xb3\x8c>g,M\x9b\xfc\xa8DUc\xaf=\x93M\x14\xbdT\xecT\xd1\x03\x80\xbe\x15\x80n\xb4L\x10[\x89)|/l\x02\x80\xfe\xa0v-\x1e\xa0\xf6\x00w\xa4\x8b\x90:f\xa2\x7f\xdf\xd0\x0f\xf8{"`\xf0\xe1L\x18\x8d\x8b\x97\xd3\xb47H\xe6x\xb2?\x89"\x98\xf03\xa0\x9a\x84\xe6"\xf5Tr)Y\xca>\x19\xdf\xa7\xf3L\xadh\xb4\xca\x0c\xc2\xc4\xefr\xc1\xb5\xb63\xcb\xb7\x04]uQS\xac\x89\x13\x99\xa3DQ\x86a\x01\xe9\xb6\xc4\x91_\xa4\x9c%"2\x92A\x17\tM\xae\xf1SI&\xb3\x8e\xfb\xfa\xaa_\xd7v\xa3J\xad^\xc5&jz\x13\x95Z\xa1\x11\xa1\xf9m\xb5\x0br\xf5\xad8\xa5\xe4\xf3\xf2q\xc1\x8e\x97;\xfc~\x1c\xc6\xbc\xb7\xbc\xcc\x1d\xa0\xfc\x00\x8b\xe5\xe5P\x1cW\xaa\x04\x0b\xd6T@6\x8c\x80BJ\x9a\xdb\x15\r\xc6\x1f\xbda\\\x0f[\xd5\xf6\x0cV\x0ce\x9c\xe5\xacm\xee\xbeC$\xcc\x9b\x9b>\xcc\x008\xf3\x06\x01\xbb\xfa\xc1\xbfr@\xb8\x9b\xcf\xb8\xa1e\xdcH3> `\xeaMA4\xf6XN4\xae\xdf3i\xee^\xbfp\x1f4w\x0c`\xfa%\xb4\xeaW\xeeRu\xe6&r\x89\x9c\xebKa\xef\xf1\xd1j\xba\xe7\x04z\xd8\xaf\xce\x05\x11\xba\x87\x87\\+\xb0\xda\x9d\xdbmt\xfd\xc9Q?\xa2\x99\xf9$\xe6\xc7B\x97\x8e\xde\x95d\x9b\xef9\xfbd\xb8\xb6\'\xdf\x0e\xae\xf1\xad\x89r\xbeLy\xd7\xc4\x94{-%\xbe\xc4\x94\x0b\x9bMSE\xc1\xe7\x0c\xedE8\x00\xb2\xb4\xea9g\x98\x13\xf9\x003\x13\x11\x80\x06\x08\xfa3\xc9\xecpa\'\xbf\x15VK\xe4\xa9\x19\x996\xee\xe5\xc8?\xe06\xdeY5\x89\x8dhh\xb8 Hz\x17m\xb9\xfe\x00\x8a\xd1.Z\xe4\x06\xb8\x94\xf7\xac\x0e\x10\xf5\xa5\xaa\x1c\x01\xa2m\xe0\xe4H\x0be)R\x9b%\xb21\xd3\xc8!\x16+\xa1@\xa2p\x94\x11$Y\x10H\xa6,\xa1S\xa7\x0c\xc2\xea\xf4\xcc\x14\xeai\xcd\xa6Q*%\xc4\xb1j\x92;MM\xe8\x105\x91S\x88\xec\x01\x1a\xf5\x15\x1d\x8d\\+z|\xa4\x83\xdf\xfa\xda\x9a\xf8\xeah\xde\xe8\xa6\xbd'

Curl reads the headers as follows:

$ curl --http1.1 -s -o /dev/null -D - https://www.vapestore.co.uk/
HTTP/1.1 200 OK
Server: nginx
Date: Tue, 25 Apr 2023 08:17:49 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
X-Cache-Lifetime: 15552000
Report-To: {"group":"report-endpoint","max_age":10886400,"endpoints":[{"url":"\/csp_reporter.php"}]}
Content-Security-Policy: font-src *.sagepay.com fonts.googleapis.com …

At the moment I am puzzled as to what is causing this.

@GeorgeA92
Copy link
Contributor

@Gallaecio
From my understanding source of twisted.web._newclient.ResponseFailed: [<twisted.python.failure.Failure builtins.ValueError: not enough values to unpack (expected 2, got 1)>] error on
call of line, self._buffer = self._buffer.split(self.delimiter, 1) from this lines from
twisted/source

            try:
                line, self._buffer = self._buffer.split(self.delimiter, 1)
            except ValueError:
                if len(self._buffer) >= (self.MAX_LENGTH + len(self.delimiter)):
                    line, self._buffer = self._buffer, b""
                    return self.lineLengthExceeded(line)
                return
            else:
                lineLength = len(line)
                if lineLength > self.MAX_LENGTH:
                    exceeded = line + self.delimiter + self._buffer
                    self._buffer = b""
                    return self.lineLengthExceeded(exceeded)
                why = self.lineReceived(line)
                if why or self.transport and self.transport.disconnecting:

It happen when... buffer contain no self.delimiter - \r\n on split call as that chunk doesn't include complete value of header
On next iteration whet _buffer size is... ~2chunks it call self.lineLengthExceeded which leads to self.transport.loseConnection()

Value of twisted.protocols.basic.LineReceiver.MAX_LENGTH that caused this case is relatively easy to update/monkeypatch:

from twisted.protocols.basic import LineReceiver
LineReceiver.MAX_LENGTH = 2 ** 17

import scrapy; from scrapy.crawler import CrawlerProcess

class TestSpider(scrapy.Spider):
    name = "test"; start_urls = ["https://www.vapestore.co.uk/"]
    custom_settings = {'RETRY_TIMES': 0}
    def parse(self, response): pass

if __name__ == "__main__": p = CrawlerProcess(); p.crawl(TestSpider); p.start()

# log output (fragment):
'''
2023-04-28 21:06:32 [scrapy.core.engine] INFO: Spider opened
2023-04-28 21:06:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-04-28 21:06:33 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-04-28 21:06:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vapestore.co.uk/> (referer: None)
2023-04-28 21:06:33 [scrapy.core.engine] INFO: Closing spider (finished)
2023-04-28 21:06:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
'''

@Gallaecio
Copy link
Member Author

Indeed it was as simple as that 🤦

@codecov
Copy link

codecov bot commented May 4, 2023

Codecov Report

Merging #5911 (e517f64) into master (cddb8c1) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head e517f64 differs from pull request most recent head 5f45494. Consider uploading reports for the commit 5f45494 to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5911   +/-   ##
=======================================
  Coverage   88.21%   88.21%           
=======================================
  Files         163      163           
  Lines       11533    11535    +2     
  Branches     1877     1877           
=======================================
+ Hits        10174    10176    +2     
  Misses       1037     1037           
  Partials      322      322           
Files Changed Coverage Δ
scrapy/core/downloader/handlers/http11.py 92.35% <100.00%> (+0.05%) ⬆️

@Gallaecio Gallaecio marked this pull request as ready for review May 4, 2023 15:46
Comment on lines 15 to 26
from twisted.web.client import (
URI,
Agent,
HTTPConnectionPool,
ResponseDone,
ResponseFailed,
from twisted.web._newclient import HTTP11ClientProtocol as TxHTTP11ClientProtocol
from twisted.web._newclient import HTTPClientParser as TxHTTPClientParser
from twisted.web._newclient import (
RequestGenerationFailed,
RequestNotSent,
TransportProxyProducer,
)
from twisted.web.client import URI, Agent
from twisted.web.client import HTTPConnectionPool as TxHTTPConnectionPool
from twisted.web.client import ResponseDone, ResponseFailed
from twisted.web.client import _HTTP11ClientFactory as TxHTTP11ClientFactory
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning on avoiding importing private stuff from Twisted, but things got quite complicated quickly. I think it may be easier to deal with breakages as they come.

@wRAR
Copy link
Member

wRAR commented Jun 21, 2023

Is it possible to just monkey-patch HTTPClientParser?

@Gallaecio
Copy link
Member Author

I was trying to avoid that all cost as a matter of principle, but for this specific case, that may be best indeed.

@wRAR
Copy link
Member

wRAR commented Jun 21, 2023

Alternatively, can you please add a comment to the line(s) changed in the copied class (i.e. the lines that use _HTTPClientParser as far as I understand), as I think the top-level comment is not enough to quickly understand what was changed?

@Gallaecio Gallaecio marked this pull request as draft June 21, 2023 18:24
@Gallaecio Gallaecio marked this pull request as ready for review August 31, 2023 07:23
@wRAR wRAR requested a review from kmike August 31, 2023 09:19
@Gallaecio Gallaecio marked this pull request as draft August 31, 2023 09:19
@Gallaecio
Copy link
Member Author

@kmike suggested that, before we consider merging this approach, we should bring the topic up with the Twisted folk, in case they are open to consider either incrementing the default value upstream or providing an easier way to edit it without the need to monkey-patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Default downloader fails to get page
3 participants