Fix “not enough values to unpack” when parsing headers #5911

Gallaecio · 2023-04-25T08:12:29Z

Fixes #355.

To do:

Figure out the issue affecting vapestore.co.uk
Reproduce the issue about a header that is too long, and solve it as well, maybe increasing the maximum to a reasonable higher vallue.
~~Do not rely on private APIs~~ (too messy)
Add a test.

Gallaecio · 2023-04-25T08:20:34Z

I have been trying to fix #355 (comment), which currently affects responses from vapestore.co.uk.

With the current changes, where I print the header lines as they come, to try and find out why Scrapy is choking on one of them, the response content suddenly starts looking binary:

Line: b'HTTP/1.1 200 OK'
Line: b'Server: nginx'
Line: b'Date: Tue, 25 Apr 2023 08:13:55 GMT'
Line: b'Content-Type: text/html; charset=UTF-8'
Line: b'Transfer-Encoding: chunked'
Line: b'Connection: keep-alive'
Line: b'X-Cache-Lifetime: 15552000'
Line: b'Report-To: {"group":"report-endpoint","max_age":10886400,"endpoints":[{"url":"\\/csp_reporter.php"}]}'
Line: b'{&0\x08\xa6\x946z\xd4\xd4\xec<7\xe5\xcf\r\x10[\xbe \x0b1\xe4\x7f\xd6\x9fX\x03\x11,<\xa6\x10\xe9\xe0\xe0\xaa\xf9\xb1y|q\xf5\xe1\xf0\xfc\xa2y\xdc<\x93\xaeOo\xce/\xae\xce\x9a\xe7\'\x97g\xfbM\t\x05L\xba8<\x92Y\x86\xacy| \xd9\x86\xb7\xcd\xbd\x8b\xcb\xb3\xe6\xd5\xf1\xdeQS\xc2\xee\xed\xf1\x95\xfa\xde\xc3\x97\xf3\x8b\xbd\xb3\x0b\xc9A\x9c\x9c\x1d\xbe\x13\xadJ7\xad\xd3\xcb\xf3\xf7\x98\xe3B\x94\x1e\xb0\\\xcbq\xfa\x8a\xcd\x1f\x1e\xbf\xbbzs\xf9\xf6m\xf3\xec\xea\xed\xe5\x87\x0frb\xb2\xeau;\x11:5N\xd9\xdd\x0c^\xd2\xc6\xed\x02-\xd5qj"\x1e\x87\x93\xa8\xcb\x85\x85\x9d8\x88\xe9\x03{K\x13\xd5\x89\x933\xf9\x9d\x104J_\xba\xaeY\x89\xd1\x04\xcfd\x13x\xe4\x01\xb0\\}\x98\xf6\xc0\\\xed\xb2\x9ex\x98\xb0!U\x81\\\x000\xab\x03\xd7D\xa5"\x11,\x93]\xbb\x81\xf3\xeb\xc4i~dc\xd7,*\xbf\x01-^l=AM\xf4i|sr\xf0\xab\xf49\xdb\x7f\x93\x02\xfe\x1e_2\xc8L\xd2\x19k\x96\xce\x18\xf1\x0f\xf8|%\x9e\xaf\xd2ZG2\xe1\xe0\xe4\xb8)\x1d\xeaDBV\xf9X\x9f\xf1n~\xc6cvx|\xd1<\xdb\xdb\xbf8<9\x96\xc8\xa3\xa5\\\xed\x9d\x1eJ\xcc\xd0S\tA\xce\xe5\xfc\xa2\x88p\x9a\xd6\xdf\x91\xef\xc7\'\x07M\x89B?\x9e\x13n\xd2\xdb\x05;\xda\xfbE\xa0*\xa0\xca\xc1\xbb\xe6\x85\xc4\xe0\xb3\xe6\xd1\xde\xe11`\x90\xc4\xcb\xf3\xd3\xbd\xac\x8e\x81\x86Dw\xd2p\x12d\xe0\x94\xd5\x13~\x83s\xd8\x95\xe7ysX6\xf6\x00\xb9Zf\x17X\xcf\x1b\x93\x99\xf1\xa4\x03\x02\x0c<\x00\xff4\x06,\x8a\xc5#J\xc8\xe2i2\x86_\xb9W\xa2\xa1\xc5\xcb\x97/\x11\xe1\xfa\x81\xc2\xb2.\xbd\x10\x9e\x01\xc2u;*\x1d\x90\xce\x1b\xfb\x15\xff>\xa8\x98\x88x\x11\x1fy$$\x9b\x88}$0\t\xe3VB@\xe8\xd71\xc8\xd2\x88~\xe6\xe78\x0c\xc6\xe2m\x8cGy\xc07\xaaJo\xd5{/\x0c\xe0\xf3H\xbdv\xc2\xde\x14\x9a\xe9\xc8\xc2\xa2;S\x81\xc9\xe7\x17\xec.[\x14M\xb5(\xf6\xe0\x01\x8b\xa1\x994t\xdb\\m\xb2\x0b,\x8e\x905\xd9\x95\xac\x19\x10\xbe\xb6U}\xce*\xec\xe79\xe9C\xd8\x86\xf8\xa9]\x18Z\x8elmn(\xcb\x11\xb4+Y\xe4L\x89\x0c1\x19\xbb\x15l\xfe\\\xced\xf2u\xc4I]\x83\x07\xd4\x98\xc29\xf0\xcb\xca\xa6\x89\x11\xb3\x83\xc6\x1ae\x95\xcc\xd9o\x08O>G\x1e\x98\xa2\x87\x99\x1b\x89\xa4\xce0\xec\xde\xa0aIM\xbcw\x07\xbc{\x93s=\x01\xd6\xa7,\x91\xcc\x1c\xa8\x17\xc3\xb7\x0bz\xa1XMd\x1d\xc9\xb8\xe4\xfa\x1d\x9a\xda#\x03\xbc\xeb\xe8\x0es3:\'\x15\x199\x13\xb9\x03g\x7fkA\xbd\x85\xc2\x9a\x08N\x07\xff\x9f\xcb\x8b\xb1|7\x18v\xa3^\xd2\xb5\xbc+\x1f\x9f\xcd\x84\xd9 \xce\xeb3\x08\xf2Y\xac\xe0\x9c}\xe7\x8e\x10O\x9d\x00\xad:B\xb5Fs\x96E\xf2\xb0\x1c\xd6\x98\xb2;\x9a\xa4\x18\xd5Oe\xdc\x1eZ\x1an\xac\xbf\xb4aaA\xda\xf6V\xd5\x86%\xa5NM\xae\x951),&\xa8\x19\xb8,\xd1\x89\xdb\x0c\x0f\x95\xa7J\x9c\x04\xf5VX\xd8\x8b\xda(\x80\x88\xe4\x94{\x0b3^\xaa\xcd\xd0\xb3\x8c>g,M\x9b\xfc\xa8DUc\xaf=\x93M\x14\xbdT\xecT\xd1\x03\x80\xbe\x15\x80n\xb4L\x10[\x89)|/l\x02\x80\xfe\xa0v-\x1e\xa0\xf6\x00w\xa4\x8b\x90:f\xa2\x7f\xdf\xd0\x0f\xf8{"`\xf0\xe1L\x18\x8d\x8b\x97\xd3\xb47H\xe6x\xb2?\x89"\x98\xf03\xa0\x9a\x84\xe6"\xf5Tr)Y\xca>\x19\xdf\xa7\xf3L\xadh\xb4\xca\x0c\xc2\xc4\xefr\xc1\xb5\xb63\xcb\xb7\x04]uQS\xac\x89\x13\x99\xa3DQ\x86a\x01\xe9\xb6\xc4\x91_\xa4\x9c%"2\x92A\x17\tM\xae\xf1SI&\xb3\x8e\xfb\xfa\xaa_\xd7v\xa3J\xad^\xc5&jz\x13\x95Z\xa1\x11\xa1\xf9m\xb5\x0br\xf5\xad8\xa5\xe4\xf3\xf2q\xc1\x8e\x97;\xfc~\x1c\xc6\xbc\xb7\xbc\xcc\x1d\xa0\xfc\x00\x8b\xe5\xe5P\x1cW\xaa\x04\x0b\xd6T@6\x8c\x80BJ\x9a\xdb\x15\r\xc6\x1f\xbda\\\x0f[\xd5\xf6\x0cV\x0ce\x9c\xe5\xacm\xee\xbeC$\xcc\x9b\x9b>\xcc\x008\xf3\x06\x01\xbb\xfa\xc1\xbfr@\xb8\x9b\xcf\xb8\xa1e\xdcH3> `\xeaMA4\xf6XN4\xae\xdf3i\xee^\xbfp\x1f4w\x0c`\xfa%\xb4\xeaW\xeeRu\xe6&r\x89\x9c\xebKa\xef\xf1\xd1j\xba\xe7\x04z\xd8\xaf\xce\x05\x11\xba\x87\x87\\+\xb0\xda\x9d\xdbmt\xfd\xc9Q?\xa2\x99\xf9$\xe6\xc7B\x97\x8e\xde\x95d\x9b\xef9\xfbd\xb8\xb6\'\xdf\x0e\xae\xf1\xad\x89r\xbeLy\xd7\xc4\x94{-%\xbe\xc4\x94\x0b\x9bMSE\xc1\xe7\x0c\xedE8\x00\xb2\xb4\xea9g\x98\x13\xf9\x003\x13\x11\x80\x06\x08\xfa3\xc9\xecpa\'\xbf\x15VK\xe4\xa9\x19\x996\xee\xe5\xc8?\xe06\xdeY5\x89\x8dhh\xb8 Hz\x17m\xb9\xfe\x00\x8a\xd1.Z\xe4\x06\xb8\x94\xf7\xac\x0e\x10\xf5\xa5\xaa\x1c\x01\xa2m\xe0\xe4H\x0be)R\x9b%\xb21\xd3\xc8!\x16+\xa1@\xa2p\x94\x11$Y\x10H\xa6,\xa1S\xa7\x0c\xc2\xea\xf4\xcc\x14\xeai\xcd\xa6Q*%\xc4\xb1j\x92;MM\xe8\x105\x91S\x88\xec\x01\x1a\xf5\x15\x1d\x8d\\+z|\xa4\x83\xdf\xfa\xda\x9a\xf8\xeah\xde\xe8\xa6\xbd'

Curl reads the headers as follows:

$ curl --http1.1 -s -o /dev/null -D - https://www.vapestore.co.uk/
HTTP/1.1 200 OK
Server: nginx
Date: Tue, 25 Apr 2023 08:17:49 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
X-Cache-Lifetime: 15552000
Report-To: {"group":"report-endpoint","max_age":10886400,"endpoints":[{"url":"\/csp_reporter.php"}]}
Content-Security-Policy: font-src *.sagepay.com fonts.googleapis.com …

At the moment I am puzzled as to what is causing this.

GeorgeA92 · 2023-04-28T18:10:51Z

@Gallaecio
From my understanding source of twisted.web._newclient.ResponseFailed: [<twisted.python.failure.Failure builtins.ValueError: not enough values to unpack (expected 2, got 1)>] error on
call of line, self._buffer = self._buffer.split(self.delimiter, 1) from this lines from
twisted/source

            try:
                line, self._buffer = self._buffer.split(self.delimiter, 1)
            except ValueError:
                if len(self._buffer) >= (self.MAX_LENGTH + len(self.delimiter)):
                    line, self._buffer = self._buffer, b""
                    return self.lineLengthExceeded(line)
                return
            else:
                lineLength = len(line)
                if lineLength > self.MAX_LENGTH:
                    exceeded = line + self.delimiter + self._buffer
                    self._buffer = b""
                    return self.lineLengthExceeded(exceeded)
                why = self.lineReceived(line)
                if why or self.transport and self.transport.disconnecting:

It happen when... buffer contain no self.delimiter - \r\n on split call as that chunk doesn't include complete value of header
On next iteration whet _buffer size is... ~2chunks it call self.lineLengthExceeded which leads to self.transport.loseConnection()

Value of twisted.protocols.basic.LineReceiver.MAX_LENGTH that caused this case is relatively easy to update/monkeypatch:

from twisted.protocols.basic import LineReceiver
LineReceiver.MAX_LENGTH = 2 ** 17

import scrapy; from scrapy.crawler import CrawlerProcess

class TestSpider(scrapy.Spider):
    name = "test"; start_urls = ["https://www.vapestore.co.uk/"]
    custom_settings = {'RETRY_TIMES': 0}
    def parse(self, response): pass

if __name__ == "__main__": p = CrawlerProcess(); p.crawl(TestSpider); p.start()

# log output (fragment):
'''
2023-04-28 21:06:32 [scrapy.core.engine] INFO: Spider opened
2023-04-28 21:06:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-04-28 21:06:33 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-04-28 21:06:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vapestore.co.uk/> (referer: None)
2023-04-28 21:06:33 [scrapy.core.engine] INFO: Closing spider (finished)
2023-04-28 21:06:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
'''

Gallaecio · 2023-05-04T12:46:53Z

Indeed it was as simple as that 🤦

codecov · 2023-05-04T13:01:30Z

Codecov Report

Merging #5911 (e517f64) into master (cddb8c1) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head e517f64 differs from pull request most recent head 5f45494. Consider uploading reports for the commit 5f45494 to get more accurate results

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #5911   +/-   ##
=======================================
  Coverage   88.21%   88.21%           
=======================================
  Files         163      163           
  Lines       11533    11535    +2     
  Branches     1877     1877           
=======================================
+ Hits        10174    10176    +2     
  Misses       1037     1037           
  Partials      322      322

Files Changed	Coverage Δ
scrapy/core/downloader/handlers/http11.py	`92.35% <100.00%> (+0.05%)`	⬆️

Gallaecio · 2023-05-04T15:48:07Z

scrapy/core/downloader/handlers/http11.py

-from twisted.web.client import (
-    URI,
-    Agent,
-    HTTPConnectionPool,
-    ResponseDone,
-    ResponseFailed,
+from twisted.web._newclient import HTTP11ClientProtocol as TxHTTP11ClientProtocol
+from twisted.web._newclient import HTTPClientParser as TxHTTPClientParser
+from twisted.web._newclient import (
+    RequestGenerationFailed,
+    RequestNotSent,
+    TransportProxyProducer,
 )
+from twisted.web.client import URI, Agent
+from twisted.web.client import HTTPConnectionPool as TxHTTPConnectionPool
+from twisted.web.client import ResponseDone, ResponseFailed
+from twisted.web.client import _HTTP11ClientFactory as TxHTTP11ClientFactory


I was planning on avoiding importing private stuff from Twisted, but things got quite complicated quickly. I think it may be easier to deal with breakages as they come.

wRAR · 2023-06-21T10:09:17Z

Is it possible to just monkey-patch HTTPClientParser?

Gallaecio · 2023-06-21T17:59:40Z

I was trying to avoid that all cost as a matter of principle, but for this specific case, that may be best indeed.

wRAR · 2023-06-21T18:20:48Z

Alternatively, can you please add a comment to the line(s) changed in the copied class (i.e. the lines that use _HTTPClientParser as far as I understand), as I think the top-level comment is not enough to quickly understand what was changed?

Gallaecio · 2023-08-31T09:21:18Z

@kmike suggested that, before we consider merging this approach, we should bring the topic up with the Twisted folk, in case they are open to consider either incrementing the default value upstream or providing an easier way to edit it without the need to monkey-patch.

Debugging changes

e75e496

Finalize the fix

09bf7e9

Gallaecio added 2 commits May 4, 2023 17:35

Use private names for our customizations

d53851d

Add a test for long headers

06f0437

Gallaecio marked this pull request as ready for review May 4, 2023 15:46

Gallaecio commented May 4, 2023

View reviewed changes

Gallaecio marked this pull request as draft June 21, 2023 18:24

Gallaecio added 2 commits August 31, 2023 09:14

Merge remote-tracking branch 'scrapy/master' into fix-355

f621a81

Take the monkey-patching route

5f45494

Gallaecio marked this pull request as ready for review August 31, 2023 07:23

wRAR approved these changes Aug 31, 2023

View reviewed changes

wRAR requested a review from kmike August 31, 2023 09:19

Gallaecio marked this pull request as draft August 31, 2023 09:19

Gallaecio mentioned this pull request Dec 20, 2023

Incorrect (too long) header from server causes Twisted Agent to fail twisted/twisted#8570

Open

Gallaecio mentioned this pull request Jan 26, 2024

#8570 Increase HTTP protocol length limits twisted/twisted#12094

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix “not enough values to unpack” when parsing headers #5911

Fix “not enough values to unpack” when parsing headers #5911

Gallaecio commented Apr 25, 2023 •

edited

Gallaecio commented Apr 25, 2023

GeorgeA92 commented Apr 28, 2023

Gallaecio commented May 4, 2023

codecov bot commented May 4, 2023 •

edited

Gallaecio May 4, 2023

wRAR commented Jun 21, 2023

Gallaecio commented Jun 21, 2023

wRAR commented Jun 21, 2023

Gallaecio commented Aug 31, 2023

Fix “not enough values to unpack” when parsing headers #5911

Are you sure you want to change the base?

Fix “not enough values to unpack” when parsing headers #5911

Conversation

Gallaecio commented Apr 25, 2023 • edited

Gallaecio commented Apr 25, 2023

GeorgeA92 commented Apr 28, 2023

Gallaecio commented May 4, 2023

codecov bot commented May 4, 2023 • edited

Codecov Report

Gallaecio May 4, 2023

Choose a reason for hiding this comment

wRAR commented Jun 21, 2023

Gallaecio commented Jun 21, 2023

wRAR commented Jun 21, 2023

Gallaecio commented Aug 31, 2023

Gallaecio commented Apr 25, 2023 •

edited

codecov bot commented May 4, 2023 •

edited