Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default downloader fails to get page #355

Open
mfyang opened this issue Jul 23, 2013 · 8 comments
Open

Default downloader fails to get page #355

mfyang opened this issue Jul 23, 2013 · 8 comments

Comments

@mfyang
Copy link

@mfyang mfyang commented Jul 23, 2013

'http://autos.msn.com/research/userreviews/reviewlist.aspx?ModelID=14749'

Looks like the default downloader implemented with twisted lib can't fetch the above url. I ran 'scrapy shell http://autos.msn.com/research/userreviews/reviewlist.aspx?ModelID=14749', and got the following output.

Traceback (most recent call last):
  File "/usr/local/bin/scrapy", line 5, in <module>
    pkg_resources.run_script('Scrapy==0.17.0', 'scrapy')
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources.py", line 489, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources.py", line 1207, in run_script
    execfile(script_filename, namespace, namespace)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module>
    execute()
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/cmdline.py", line 143, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/cmdline.py", line 88, in _run_print_help
    func(*a, **kw)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/cmdline.py", line 150, in _run_command
    cmd.run(args, opts)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/commands/shell.py", line 47, in run
    shell.start(url=url, spider=spider)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/shell.py", line 43, in start
    self.fetch(url, spider)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/shell.py", line 85, in fetch
    reactor, self._schedule, request, spider)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/threads.py", line 118, in blockingCallFromThread
    result.raiseException()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/python/failure.py", line 370, in raiseException
    raise self.type, self.value, self.tb
twisted.internet.error.ConnectionDone: Connection was closed cleanly.

But both urlopen of urllib2 and requests.get can download the page smoothly.

@stav
Copy link
Contributor

@stav stav commented Aug 3, 2013

The initial cause of the error is that there is a cookie header line that is too long:

stav@maia:~$ curl -I http://autos.msn.com/research/userreviews/reviewlist.aspx?ModelID=14749
HTTP/1.1 200 OK
Cache-Control: no-cache
Pragma: no-cache
Content-Length: 195855
Content-Type: text/html; charset=utf-8
Expires: -1
Server: Microsoft-IIS/7.5
X-AspNet-Version: 2.0.50727
Set-Cookie: ResearchBackUrl=/research/userreviews/reviewlist.aspx?ModelID=14749; path=/
Set-Cookie: vReview=rid=441,1165,1248,1269,1272,1284,1417,1434,1455,1723,1800,1857,1875,2379,2396,2406,2439,2456,2734,2901,2944,2991,3046,3059,3157,3313,3576,3613,3634,3672,3986,4106,4227,4367,4461,4739,4857,4984,5073,5106,5275,5388,5406,5559,5592,5764,5771,5808,5838,5893,5962,6055,6198,6229,6332,6543,6546,6549,6826,6835,6839,6855,6881,6919,7021,7065,7112,7124,7196,7223,7329,7398,7411,7577,7579,7696,7698,7757,7759,7787,7973,7989,8136,8188,8189,8201,8231,8271,8285,8298,8346,8465,8482,8510,8521,8579,8613,8642,8744,8754,8812,8858,8875,8948,9000,9048,9116,9208,9223,9428,9468,9494,9561,9753,9844,10021,10063,10071,10091,10093,10120,10169,10193,10199,10212,10267,10317,10336,10361,10376,10446,10452,10481,10494,10500,10528,10535,10547,10556,10590,10607,10609,10619,10624,10625,10629,10662,10690,10706,10734,10753,10762,10772,10776,10819,10840,10861,10873,10902,10922,10932,11020,11031,11044,11046,11102,11132,11159,11173,11218,11227,11244,11336,11356,11434,11446,11453,11484,11531,11536,11545,11553,11559,11566,11577,11589,11595,11598,11636,11668,11706,11764,11784,11785,11792,11797,11799,11818,11829,11855,11857,11885,11943,11946,11955,11957,11963,11990,11997,12017,12059,12062,12105,12146,12163,... >>> longer than 63923
Set-Cookie: MC1=V=3&GUID=56202f9931a94d0e928050b01980dfe6; domain=.msn.com; expires=Mon, 04-Oct-2021 16:00:00 GMT; path=/
X-Powered-By: ASP.NET
Date: Sat, 03 Aug 2013 15:36:52 GMT

This is caught by twisted/protocols/basic.py:

if len(self.__buffer) > self.MAX_LENGTH:  # 16384

But the Scrapy implementation of the transport does not have a loseConnection method, ergo the Exception:

Traceback (most recent call last):
  File "/srv/scrapy/scrapy/scrapy/xlib/tx/_newclient.py", line 1431, in dataReceived
    self._parser.dataReceived(bytes)
  File "/srv/scrapy/scrapy/scrapy/xlib/tx/_newclient.py", line 382, in dataReceived
    HTTPParser.dataReceived(self, data)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/basic.py", line 556, in dataReceived
    return self.lineLengthExceeded(line)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/basic.py", line 638, in lineLengthExceeded
    return self.transport.loseConnection()
AttributeError: 'TransportProxyProducer' object has no attribute 'loseConnection'

Which is caught here:

> /srv/scrapy/scrapy/scrapy/xlib/tx/_newclient.py(1432)dataReceived()->None
-> self._parser.dataReceived(bytes)

By the infamous catch-all except: obfusticator:

def dataReceived(self, bytes):
    """
    Handle some stuff from some place.
    """
    try:
        self._parser.dataReceived(bytes)
    except:
        self._giveUp(Failure())
@boyce-ywr
Copy link

@boyce-ywr boyce-ywr commented Nov 14, 2015

is it fixed?

@nyov
Copy link
Contributor

@nyov nyov commented Mar 29, 2016

The code in xlib/tx/_newclient.py hasn't changed from what @stav wrote down. So there is no fix there. But if the issue persists with Twisted > 13, then it's (still) a bug in the twisted project, as the bundled tx code isn't used with newer Twisted versions.

If there has been a fix for this upstream, it may still be too much trouble to backport it to the old pre-13 xlib/tx code. So I would propose closing this (and reporting it to Twisted if the issue persists).

@redapple
Copy link
Contributor

@redapple redapple commented Sep 13, 2016

ftr, this still fails with Twisted 16.4

@0xbf00
Copy link

@0xbf00 0xbf00 commented Sep 11, 2018

I'm running into this issue again with

scrapy shell https://macupdate.com

This command produces

2018-09-11 17:57:04 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: mac_scraper)
2018-09-11 17:57:04 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0dev0, Python 3.7.0 (default, Jun 29 2018, 20:13:13) - [Clang 9.1.0 (clang-902.0.39.2)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018), cryptography 2.3.1, Platform Darwin-17.7.0-x86_64-i386-64bit
2018-09-11 17:57:04 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'mac_scraper', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'NEWSPIDER_MODULE': 'mac_scraper.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['mac_scraper.spiders']}
2018-09-11 17:57:04 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage']
2018-09-11 17:57:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-09-11 17:57:04 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-09-11 17:57:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-09-11 17:57:04 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-09-11 17:57:04 [scrapy.core.engine] INFO: Spider opened
2018-09-11 17:57:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.macupdate.com/robots.txt> from <GET https://macupdate.com/robots.txt>
2018-09-11 17:57:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.macupdate.com/robots.txt> (referer: None)
2018-09-11 17:57:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.macupdate.com/> from <GET https://macupdate.com>
2018-09-11 17:57:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.macupdate.com/robots.txt> (referer: None)
2018-09-11 17:57:06 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.macupdate.com/> (failed 1 times): [<twisted.python.failure.Failure builtins.ValueError: not enough values to unpack (expected 2, got 1)>]
2018-09-11 17:57:06 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.macupdate.com/> (failed 2 times): [<twisted.python.failure.Failure builtins.ValueError: not enough values to unpack (expected 2, got 1)>]
2018-09-11 17:57:07 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www.macupdate.com/> (failed 3 times): [<twisted.python.failure.Failure builtins.ValueError: not enough values to unpack (expected 2, got 1)>]
Traceback (most recent call last):
  File "/usr/local/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 150, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 90, in _run_print_help
    func(*a, **kw)
  File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 157, in _run_command
    cmd.run(args, opts)
  File "/usr/local/lib/python3.7/site-packages/scrapy/commands/shell.py", line 73, in run
    shell.start(url=url, redirect=not opts.no_redirect)
  File "/usr/local/lib/python3.7/site-packages/scrapy/shell.py", line 48, in start
    self.fetch(url, spider, redirect=redirect)
  File "/usr/local/lib/python3.7/site-packages/scrapy/shell.py", line 114, in fetch
    result = threads.blockingCallFromThread(reactor, self._schedule, request, spider)
  File "/usr/local/lib/python3.7/site-packages/twisted/internet/threads.py", line 122, in blockingCallFromThread
    result.raiseException()
  File "/usr/local/lib/python3.7/site-packages/twisted/python/failure.py", line 467, in raiseException
    raise self.value.with_traceback(self.tb)
twisted.web._newclient.ResponseFailed: [<twisted.python.failure.Failure builtins.ValueError: not enough values to unpack (expected 2, got 1)>]

I've had trouble debugging the actual underlying issue, but the server is also sending an overly large header field and so I suspect the issue is the same.
How should one go about fixing this (locally)? Since Twisted is likely not going to fix this (see here), I've tried setting larger MAX_SIZE constants in /twisted/protocols/basic.py. However, that seems to have no effect for me...

@0xbf00
Copy link

@0xbf00 0xbf00 commented Sep 18, 2018

I've written up a workaround here.

@nyov
Copy link
Contributor

@nyov nyov commented Sep 19, 2018

@0xbf00 Thanks for providing a working workaround.
That does seem kind of an obscene-overkill amount-of solution (putting a TLS MITM proxy beneath scrapy) 🤣
(I wouldn't even mind much, if mitmproxy was not so obsessive in their up-to-date dependency requirements, that's I can't easily use an up-to-date version.)

I tried to build you a more internal solution, assuming the only problem seems to be the LINE LENGTH:

# myproject/settings.py

### Force HTTP1.0 Handler
DOWNLOAD_HANDLERS = {
    'http': 'scrapy.core.downloader.handlers.http.HTTP10DownloadHandler',
    'https': 'scrapy.core.downloader.handlers.http.HTTP10DownloadHandler',
}

#TODO?# MAX_HTTP_LINE_LENGTH = 65536
DOWNLOADER_HTTPCLIENTFACTORY = 'myproject.downloader.ScrapyHTTPClientFactory'
DOWNLOADER_CLIENTCONTEXTFACTORY = 'myproject.downloader.ScrapyClientContextFactory'
# myproject/downloader.py
from OpenSSL import SSL

from scrapy.core.downloader.webclient import (
    ScrapyHTTPPageGetter as HTTPPageGetter,
    ScrapyHTTPClientFactory as HTTPClientFactory,
)
from scrapy.core.downloader.contextfactory import \
    ScrapyClientContextFactory as ClientContextFactory


class ScrapyBadHTTPPageGetter(HTTPPageGetter):

    delimiter = b'\n'
    # Maximum Line Length of LineReceiverProtocol
    MAX_LENGTH = 65536

    # no idea how to get at settings here, so scratch that
    #def __init__(self, *a, **kw):
    #    self.MAX_LENGTH = settings.getint('MAX_HTTP_LINE_LENGTH', 16384)


class ScrapyHTTPClientFactory(HTTPClientFactory):

    protocol = ScrapyBadHTTPPageGetter


class ScrapyClientContextFactory(ClientContextFactory):

    def __init__(self):
        # default method is SSLv23_METHOD
        self.method = SSL.SSLv23_METHOD

However, this still doesn't seem to work on your domain "https://www.macupdate.com/" (YMMV):
The error with this now is an SSL handshake failure: Error: [('SSL routines', 'ssl3_read_bytes', 'sslv3 alert handshake failure')]. That's not an issue of Scrapy IMO, but the server trying to negotiate something stupid.

/edit: I figured out this is because the missing SNI of the HTTP10Downloader/OpenSSL combo.

But perhaps you can manage to make that work by changing the ClientContextFactory, which is why I provided an override of ScrapyClientContextFactory here as well? (No idea actually)

(The better solution would be to fix it in the HTTP1.1 downloader instead, but that class is a lot more involved, so I couldn't manage to fix it there so far. And HTTP1.0 is usually still good enough for many sites.)

@0xbf00
Copy link

@0xbf00 0xbf00 commented Sep 19, 2018

@nyov Thanks for your input! I know that my workaround is not ideal, but it works for me and it involves no fiddling with scrapy and twisted internals. Ideally, this could be fixed upstream, but I am not the person to do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants
You can’t perform that action at this time.