Default downloader fails to get page #355

Open
mfyang opened this Issue Jul 23, 2013 · 4 comments

Comments

Projects
None yet
6 participants
@mfyang

mfyang commented Jul 23, 2013

'http://autos.msn.com/research/userreviews/reviewlist.aspx?ModelID=14749'

Looks like the default downloader implemented with twisted lib can't fetch the above url. I ran 'scrapy shell http://autos.msn.com/research/userreviews/reviewlist.aspx?ModelID=14749', and got the following output.

Traceback (most recent call last):
  File "/usr/local/bin/scrapy", line 5, in <module>
    pkg_resources.run_script('Scrapy==0.17.0', 'scrapy')
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources.py", line 489, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources.py", line 1207, in run_script
    execfile(script_filename, namespace, namespace)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module>
    execute()
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/cmdline.py", line 143, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/cmdline.py", line 88, in _run_print_help
    func(*a, **kw)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/cmdline.py", line 150, in _run_command
    cmd.run(args, opts)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/commands/shell.py", line 47, in run
    shell.start(url=url, spider=spider)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/shell.py", line 43, in start
    self.fetch(url, spider)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/shell.py", line 85, in fetch
    reactor, self._schedule, request, spider)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/threads.py", line 118, in blockingCallFromThread
    result.raiseException()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/python/failure.py", line 370, in raiseException
    raise self.type, self.value, self.tb
twisted.internet.error.ConnectionDone: Connection was closed cleanly.

But both urlopen of urllib2 and requests.get can download the page smoothly.

@stav

This comment has been minimized.

Show comment
Hide comment
@stav

stav Aug 3, 2013

Contributor

The initial cause of the error is that there is a cookie header line that is too long:

stav@maia:~$ curl -I http://autos.msn.com/research/userreviews/reviewlist.aspx?ModelID=14749
HTTP/1.1 200 OK
Cache-Control: no-cache
Pragma: no-cache
Content-Length: 195855
Content-Type: text/html; charset=utf-8
Expires: -1
Server: Microsoft-IIS/7.5
X-AspNet-Version: 2.0.50727
Set-Cookie: ResearchBackUrl=/research/userreviews/reviewlist.aspx?ModelID=14749; path=/
Set-Cookie: vReview=rid=441,1165,1248,1269,1272,1284,1417,1434,1455,1723,1800,1857,1875,2379,2396,2406,2439,2456,2734,2901,2944,2991,3046,3059,3157,3313,3576,3613,3634,3672,3986,4106,4227,4367,4461,4739,4857,4984,5073,5106,5275,5388,5406,5559,5592,5764,5771,5808,5838,5893,5962,6055,6198,6229,6332,6543,6546,6549,6826,6835,6839,6855,6881,6919,7021,7065,7112,7124,7196,7223,7329,7398,7411,7577,7579,7696,7698,7757,7759,7787,7973,7989,8136,8188,8189,8201,8231,8271,8285,8298,8346,8465,8482,8510,8521,8579,8613,8642,8744,8754,8812,8858,8875,8948,9000,9048,9116,9208,9223,9428,9468,9494,9561,9753,9844,10021,10063,10071,10091,10093,10120,10169,10193,10199,10212,10267,10317,10336,10361,10376,10446,10452,10481,10494,10500,10528,10535,10547,10556,10590,10607,10609,10619,10624,10625,10629,10662,10690,10706,10734,10753,10762,10772,10776,10819,10840,10861,10873,10902,10922,10932,11020,11031,11044,11046,11102,11132,11159,11173,11218,11227,11244,11336,11356,11434,11446,11453,11484,11531,11536,11545,11553,11559,11566,11577,11589,11595,11598,11636,11668,11706,11764,11784,11785,11792,11797,11799,11818,11829,11855,11857,11885,11943,11946,11955,11957,11963,11990,11997,12017,12059,12062,12105,12146,12163,... >>> longer than 63923
Set-Cookie: MC1=V=3&GUID=56202f9931a94d0e928050b01980dfe6; domain=.msn.com; expires=Mon, 04-Oct-2021 16:00:00 GMT; path=/
X-Powered-By: ASP.NET
Date: Sat, 03 Aug 2013 15:36:52 GMT

This is caught by twisted/protocols/basic.py:

if len(self.__buffer) > self.MAX_LENGTH:  # 16384

But the Scrapy implementation of the transport does not have a loseConnection method, ergo the Exception:

Traceback (most recent call last):
  File "/srv/scrapy/scrapy/scrapy/xlib/tx/_newclient.py", line 1431, in dataReceived
    self._parser.dataReceived(bytes)
  File "/srv/scrapy/scrapy/scrapy/xlib/tx/_newclient.py", line 382, in dataReceived
    HTTPParser.dataReceived(self, data)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/basic.py", line 556, in dataReceived
    return self.lineLengthExceeded(line)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/basic.py", line 638, in lineLengthExceeded
    return self.transport.loseConnection()
AttributeError: 'TransportProxyProducer' object has no attribute 'loseConnection'

Which is caught here:

> /srv/scrapy/scrapy/scrapy/xlib/tx/_newclient.py(1432)dataReceived()->None
-> self._parser.dataReceived(bytes)

By the infamous catch-all except: obfusticator:

def dataReceived(self, bytes):
    """
    Handle some stuff from some place.
    """
    try:
        self._parser.dataReceived(bytes)
    except:
        self._giveUp(Failure())
Contributor

stav commented Aug 3, 2013

The initial cause of the error is that there is a cookie header line that is too long:

stav@maia:~$ curl -I http://autos.msn.com/research/userreviews/reviewlist.aspx?ModelID=14749
HTTP/1.1 200 OK
Cache-Control: no-cache
Pragma: no-cache
Content-Length: 195855
Content-Type: text/html; charset=utf-8
Expires: -1
Server: Microsoft-IIS/7.5
X-AspNet-Version: 2.0.50727
Set-Cookie: ResearchBackUrl=/research/userreviews/reviewlist.aspx?ModelID=14749; path=/
Set-Cookie: vReview=rid=441,1165,1248,1269,1272,1284,1417,1434,1455,1723,1800,1857,1875,2379,2396,2406,2439,2456,2734,2901,2944,2991,3046,3059,3157,3313,3576,3613,3634,3672,3986,4106,4227,4367,4461,4739,4857,4984,5073,5106,5275,5388,5406,5559,5592,5764,5771,5808,5838,5893,5962,6055,6198,6229,6332,6543,6546,6549,6826,6835,6839,6855,6881,6919,7021,7065,7112,7124,7196,7223,7329,7398,7411,7577,7579,7696,7698,7757,7759,7787,7973,7989,8136,8188,8189,8201,8231,8271,8285,8298,8346,8465,8482,8510,8521,8579,8613,8642,8744,8754,8812,8858,8875,8948,9000,9048,9116,9208,9223,9428,9468,9494,9561,9753,9844,10021,10063,10071,10091,10093,10120,10169,10193,10199,10212,10267,10317,10336,10361,10376,10446,10452,10481,10494,10500,10528,10535,10547,10556,10590,10607,10609,10619,10624,10625,10629,10662,10690,10706,10734,10753,10762,10772,10776,10819,10840,10861,10873,10902,10922,10932,11020,11031,11044,11046,11102,11132,11159,11173,11218,11227,11244,11336,11356,11434,11446,11453,11484,11531,11536,11545,11553,11559,11566,11577,11589,11595,11598,11636,11668,11706,11764,11784,11785,11792,11797,11799,11818,11829,11855,11857,11885,11943,11946,11955,11957,11963,11990,11997,12017,12059,12062,12105,12146,12163,... >>> longer than 63923
Set-Cookie: MC1=V=3&GUID=56202f9931a94d0e928050b01980dfe6; domain=.msn.com; expires=Mon, 04-Oct-2021 16:00:00 GMT; path=/
X-Powered-By: ASP.NET
Date: Sat, 03 Aug 2013 15:36:52 GMT

This is caught by twisted/protocols/basic.py:

if len(self.__buffer) > self.MAX_LENGTH:  # 16384

But the Scrapy implementation of the transport does not have a loseConnection method, ergo the Exception:

Traceback (most recent call last):
  File "/srv/scrapy/scrapy/scrapy/xlib/tx/_newclient.py", line 1431, in dataReceived
    self._parser.dataReceived(bytes)
  File "/srv/scrapy/scrapy/scrapy/xlib/tx/_newclient.py", line 382, in dataReceived
    HTTPParser.dataReceived(self, data)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/basic.py", line 556, in dataReceived
    return self.lineLengthExceeded(line)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/basic.py", line 638, in lineLengthExceeded
    return self.transport.loseConnection()
AttributeError: 'TransportProxyProducer' object has no attribute 'loseConnection'

Which is caught here:

> /srv/scrapy/scrapy/scrapy/xlib/tx/_newclient.py(1432)dataReceived()->None
-> self._parser.dataReceived(bytes)

By the infamous catch-all except: obfusticator:

def dataReceived(self, bytes):
    """
    Handle some stuff from some place.
    """
    try:
        self._parser.dataReceived(bytes)
    except:
        self._giveUp(Failure())
@boyce-ywr

This comment has been minimized.

Show comment
Hide comment
@boyce-ywr

boyce-ywr Nov 14, 2015

is it fixed?

is it fixed?

@nyov

This comment has been minimized.

Show comment
Hide comment
@nyov

nyov Mar 29, 2016

Contributor

The code in xlib/tx/_newclient.py hasn't changed from what @stav wrote down. So there is no fix there. But if the issue persists with Twisted > 13, then it's (still) a bug in the twisted project, as the bundled tx code isn't used with newer Twisted versions.

If there has been a fix for this upstream, it may still be too much trouble to backport it to the old pre-13 xlib/tx code. So I would propose closing this (and reporting it to Twisted if the issue persists).

Contributor

nyov commented Mar 29, 2016

The code in xlib/tx/_newclient.py hasn't changed from what @stav wrote down. So there is no fix there. But if the issue persists with Twisted > 13, then it's (still) a bug in the twisted project, as the bundled tx code isn't used with newer Twisted versions.

If there has been a fix for this upstream, it may still be too much trouble to backport it to the old pre-13 xlib/tx code. So I would propose closing this (and reporting it to Twisted if the issue persists).

@redapple

This comment has been minimized.

Show comment
Hide comment
@redapple

redapple Sep 13, 2016

Contributor

ftr, this still fails with Twisted 16.4

Contributor

redapple commented Sep 13, 2016

ftr, this still fails with Twisted 16.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment