-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrapy chokes on HTTP response status lines without a Reason phrase #345
Comments
http parser doesn't like reasonless status line in response
|
As a scraping framework, we should be able to download the page and ignore the status line bug |
How to handle this error? |
Extend or fix twisted HTTPClientParser so it doesn't discard the response |
I'm scrapy user , and i got 'scrapy.xlib.tx._newclient.ResponseFailed' |
are we gonna fix this one @dangra ? |
@Tony36051: your problem is different, it was fixed in Scrapy development branch and in Scrapy 0.18.2 stable release. Create a new issue with an url to easily reproduce it if not. thanks. @pablohoffman: yes, happens that an extended http parser can't be easily hooked into twisted HTTP11 client. Want to take a look and discuss better approach? I think the longterm option is to report the bug upstream and propose two things:
to access the parser from scrapy download handler we should go trough:
Everything is easy except telling While writing about this I realized a non-monkeypatch solution, extending |
I was so stupid --- using global proxy and ran out of amount。 |
Hey there. We hit this problem recently -- is there a fix in the works? |
@kbourgoin : the far we got is the description of a possible solution by #345 (comment) |
I may not offer any help. But I found my problem that I have set a global proxy and ran out amount of flow. (so sorry for my poor English) after set the network well again, my srcapy program works well too. to find what wrong with the twist, I used urlopen(function in Python)to test the ability of downloading something in Python framework. and I found what I got is just like the error page from my proxy. in a word, my problem result from wrong global proxy config. best wish Tony -- 发自 Android 网易邮箱 在2013年12月03日 24:01, Keith Bourgoin写道: Hey there. We hit this problem recently -- is there a fix in the works? — Reply to this email directly or view it on GitHub. |
I recently solved this problem by using twisted 11.0.0 with scrapy 0.20. Thanks tips from @Tony36051 . |
Is there a way to reproduce this? I've tried different twisted versions (13.2.0, 13.1.0, 10.2.0) and different scrapy versions (0.18.4, 0.22.2, scrapy master), and scrapy fetch works fine. Maybe the website changed. I'm not sure I've understood @dangra comment about reasonless status line. Here is the current curl output:
|
the response first line was "HTTP/1.1 200", it lacked the "OK" string. |
ah, I see |
My monkey patch for workaround: def _monkey_patching_HTTPClientParser_statusReceived():
"""
monkey patching for scrapy.xlib.tx._newclient.HTTPClientParser.statusReceived
для обхода ошибки, когда статус выдаётся без "OK" в конце
"""
from scrapy.xlib.tx._newclient import HTTPClientParser, ParseError
old_sr = HTTPClientParser.statusReceived
def statusReceived(self, status):
try:
return old_sr(self, status)
except ParseError, e:
if e.args[0] == 'wrong number of parts':
return old_sr(self, status + ' OK')
raise
statusReceived.__doc__ == old_sr.__doc__
HTTPClientParser.statusReceived = statusReceived |
Where do we put the monkey patch? @tonal |
Call monkey patch before start first request. |
Thank you very much @tonal, it worked like a charm 👍 |
I got this error message when using VPN proxy, I capture wireshark and find there is no response. it is fine when I stop VPN proxy |
@lbsweek what do you mean by "no response"? an empty reply without even a first line? |
After the failed attempt to fix this issue in #1140, I think the only viable approach is a monkeypatch similar to what @tonal proposes in #345 (comment) |
On latest scrapy ubuntu package (0.25.0-454-gfa1039f+1429829085) I receive similar errors:
My monkeypatch work only for scrapy.xlib.tx._newclient.ParseError bat i receive twisted.web._newclient.ParseError. How to correct path? |
Monkey patch both. For more info on when Scrapy uses one or the other see https://github.com/scrapy/scrapy/blob/master/scrapy/xlib/tx/__init__.py. |
I added tonal's monkey patch, but I still receive the same error.
-------------------------- added by 12-08 -------------------------------- I have already solved it. I found that I did not use xlib.tx._newclient.py, but twisted.web._newlicent.py. So I changed
to
that is ok now ~ |
@sunhaowen Great thx, it works! |---------------------- |
这里尽然有好多好多 CHINESE From: liuwwei3 |
I just found out about https://twistedmatrix.com/trac/ticket/7673 |
@redapple it could also be a bad proxy, not a bad server |
True. |
Here is a live example at this time:
|
Another example are 404 and 302 responses from okcupid (200 pages have "OK"):
|
It seems that this case is common with, for example, custom nginx modules which only set the response status and no reason. |
Twisted has a patch ready to fix this issue: https://twistedmatrix.com/trac/ticket/7673#comment:5 PR twisted/twisted#723 🎉 |
Fixed in Twisted 17.5.0. Example websites from this ticket work for me with Scrapy 1.4.0 and Twisted 17.5.0, so I'm closing it. Thanks everyone! |
Basically Scrapy Ignores 404 Error by Default, It was defined in httperror middleware. So, Add HTTPERROR_ALLOW_ALL = True to your settings file. After this you can access response.status through your parse function. |
Remove invalid keyword argument 'sleep' from kafka poll
Hi, everyone, now I also met this problem. My goal is to download pdf files from websites ( such as "http://114.251.10.201/pdf/month?reportId=462837&isPublic=true") ,but I can not download these pdf files completely with scrapy downloadmiddleware ( using this method, I found the size of many pdf files downloaded is 1KB ), so I turned to stream method in requests.get function [https://github.com/scrapy/scrapy/issues/3880 . But now when I running it, scrapy often get choked and says "[urllib3.connectionpool] DEBUG: http://114.251.10.201:80 'GET /pdf/month?reportId=128520&isPublic=true HTTP/1.1'" 200 None . It looks there is no failure but scrapy just chokes for several hours. Any suggestions? |
Try fetch page:
output:
The text was updated successfully, but these errors were encountered: