Unhandled error caused by HTTP non-compliant headers #210

Open
yeechen opened this Issue Dec 20, 2012 · 1 comment

Comments

Projects
None yet
3 participants
@yeechen

yeechen commented Dec 20, 2012

Environment:

Scrapy 0.16.2
Twisted-12.2.0
python 2.7
macosx-10.6

Use Case 1

Run:

scrapy shell http://aaa.17domn.com/bt9/file.php/MERH77V.html

Error:

[ScrapyHTTPPageGetter,client] Unhandled Error
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-macosx-10.6-intel.egg/twisted/internet/selectreactor.py", line 150, in _doReadOrWrite
why = getattr(selectable, method)()

 ...

File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-macosx-10.6-intel.egg/twisted/web/http.py", line 406, in extractHeader
key, val = header.split(':',1)
exceptions.ValueError: need more than 1 value to unpack

Solution:
https://groups.google.com/forum/#!msg/scrapy-users/xFKo8ggzPxs/VXDl3CZ4V4cJ They describe this is caused by twisted. Then I patched function extractHeader in /twisted/web/http.py from http://twistedmatrix.com/trac/ticket/2842. It works

Use Case 2:

Run:

scrapy shell http://www1.wkdown.info/fs3/file.php/M994ATR.html

Error:

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/bin/scrapy", line 5, in
...

File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.16.2-py2.7.egg/scrapy/core/downloader/webclient.py", line 122, in _build_response
status = int(self.status)

ValueError: invalid literal for int() with base 10: 'html'

unfix!

@dangra

This comment has been minimized.

Show comment
Hide comment
@dangra

dangra Jan 8, 2013

Member

thanks, the problem is clear and should be fixed.

I tried chrome and it renders the page fine ignoring the bad headers and assuming 200 status

For later debugging this is the curl output including headers for both urls:

Member

dangra commented Jan 8, 2013

thanks, the problem is clear and should be fixed.

I tried chrome and it renders the page fine ignoring the bad headers and assuming 200 status

For later debugging this is the curl output including headers for both urls:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment