New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix “not enough values to unpack” when parsing headers #5911
base: master
Are you sure you want to change the base?
Conversation
I have been trying to fix #355 (comment), which currently affects responses from vapestore.co.uk. With the current changes, where I print the header lines as they come, to try and find out why Scrapy is choking on one of them, the response content suddenly starts looking binary:
Curl reads the headers as follows:
At the moment I am puzzled as to what is causing this. |
@Gallaecio try:
line, self._buffer = self._buffer.split(self.delimiter, 1)
except ValueError:
if len(self._buffer) >= (self.MAX_LENGTH + len(self.delimiter)):
line, self._buffer = self._buffer, b""
return self.lineLengthExceeded(line)
return
else:
lineLength = len(line)
if lineLength > self.MAX_LENGTH:
exceeded = line + self.delimiter + self._buffer
self._buffer = b""
return self.lineLengthExceeded(exceeded)
why = self.lineReceived(line)
if why or self.transport and self.transport.disconnecting: It happen when... buffer contain no Value of from twisted.protocols.basic import LineReceiver
LineReceiver.MAX_LENGTH = 2 ** 17
import scrapy; from scrapy.crawler import CrawlerProcess
class TestSpider(scrapy.Spider):
name = "test"; start_urls = ["https://www.vapestore.co.uk/"]
custom_settings = {'RETRY_TIMES': 0}
def parse(self, response): pass
if __name__ == "__main__": p = CrawlerProcess(); p.crawl(TestSpider); p.start()
# log output (fragment):
'''
2023-04-28 21:06:32 [scrapy.core.engine] INFO: Spider opened
2023-04-28 21:06:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-04-28 21:06:33 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-04-28 21:06:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vapestore.co.uk/> (referer: None)
2023-04-28 21:06:33 [scrapy.core.engine] INFO: Closing spider (finished)
2023-04-28 21:06:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
''' |
Indeed it was as simple as that 🤦 |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #5911 +/- ##
=======================================
Coverage 88.21% 88.21%
=======================================
Files 163 163
Lines 11533 11535 +2
Branches 1877 1877
=======================================
+ Hits 10174 10176 +2
Misses 1037 1037
Partials 322 322
|
from twisted.web.client import ( | ||
URI, | ||
Agent, | ||
HTTPConnectionPool, | ||
ResponseDone, | ||
ResponseFailed, | ||
from twisted.web._newclient import HTTP11ClientProtocol as TxHTTP11ClientProtocol | ||
from twisted.web._newclient import HTTPClientParser as TxHTTPClientParser | ||
from twisted.web._newclient import ( | ||
RequestGenerationFailed, | ||
RequestNotSent, | ||
TransportProxyProducer, | ||
) | ||
from twisted.web.client import URI, Agent | ||
from twisted.web.client import HTTPConnectionPool as TxHTTPConnectionPool | ||
from twisted.web.client import ResponseDone, ResponseFailed | ||
from twisted.web.client import _HTTP11ClientFactory as TxHTTP11ClientFactory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was planning on avoiding importing private stuff from Twisted, but things got quite complicated quickly. I think it may be easier to deal with breakages as they come.
Is it possible to just monkey-patch |
I was trying to avoid that all cost as a matter of principle, but for this specific case, that may be best indeed. |
Alternatively, can you please add a comment to the line(s) changed in the copied class (i.e. the lines that use |
@kmike suggested that, before we consider merging this approach, we should bring the topic up with the Twisted folk, in case they are open to consider either incrementing the default value upstream or providing an easier way to edit it without the need to monkey-patch. |
Fixes #355.
To do:
Do not rely on private APIs(too messy)