Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unhandled error caused by HTTP non-compliant headers #210

Open
yeechen opened this issue Dec 20, 2012 · 1 comment
Open

Unhandled error caused by HTTP non-compliant headers #210

yeechen opened this issue Dec 20, 2012 · 1 comment

Comments

@yeechen
Copy link

@yeechen yeechen commented Dec 20, 2012

Environment:

Scrapy 0.16.2
Twisted-12.2.0
python 2.7
macosx-10.6

Use Case 1

Run:

scrapy shell http://aaa.17domn.com/bt9/file.php/MERH77V.html

Error:

[ScrapyHTTPPageGetter,client] Unhandled Error
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-macosx-10.6-intel.egg/twisted/internet/selectreactor.py", line 150, in _doReadOrWrite
why = getattr(selectable, method)()

 ...

File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-macosx-10.6-intel.egg/twisted/web/http.py", line 406, in extractHeader
key, val = header.split(':',1)
exceptions.ValueError: need more than 1 value to unpack

Solution:
https://groups.google.com/forum/#!msg/scrapy-users/xFKo8ggzPxs/VXDl3CZ4V4cJ They describe this is caused by twisted. Then I patched function extractHeader in /twisted/web/http.py from http://twistedmatrix.com/trac/ticket/2842. It works

Use Case 2:

Run:

scrapy shell http://www1.wkdown.info/fs3/file.php/M994ATR.html

Error:

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/bin/scrapy", line 5, in
...

File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.16.2-py2.7.egg/scrapy/core/downloader/webclient.py", line 122, in _build_response
status = int(self.status)

ValueError: invalid literal for int() with base 10: 'html'

unfix!

@dangra
Copy link
Member

@dangra dangra commented Jan 8, 2013

thanks, the problem is clear and should be fixed.

I tried chrome and it renders the page fine ignoring the bad headers and assuming 200 status

For later debugging this is the curl output including headers for both urls:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.