Scrapy chokes on HTTP response status lines without a Reason phrase #345

Closed
tonal opened this Issue Jul 11, 2013 · 35 comments

Comments

Projects
None yet
@tonal

tonal commented Jul 11, 2013

Try fetch page:

$ scrapy fetch 'http://www.gidroprofmontag.ru/bassein/sbornue_basseynu'

output:

2013-07-11 09:15:37+0400 [scrapy] INFO: Scrapy 0.17.0-304-g3fe2a32 started (bot: amon)
/home/tonal/amon/amon/amon/downloadermiddleware/blocked.py:6: ScrapyDeprecationWarning: Module `scrapy.stats` is deprecated, use `crawler.stats` attribute instead
  from scrapy.stats import stats
2013-07-11 09:15:37+0400 [amon_ra] INFO: Spider opened
2013-07-11 09:15:37+0400 [amon_ra] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-07-11 09:15:37+0400 [amon_ra] ERROR: Error downloading <GET http://www.gidroprofmontag.ru/bassein/sbornue_basseynu>: [<twisted.python.failure.Failure <class 'scrapy.xlib.tx._newclient.ParseError'>>]
2013-07-11 09:15:37+0400 [amon_ra] INFO: Closing spider (finished)
2013-07-11 09:15:37+0400 [amon_ra] INFO: Dumping Scrapy stats:
        {'downloader/exception_count': 1,
         'downloader/exception_type_count/scrapy.xlib.tx._newclient.ResponseFailed': 1,
         'downloader/request_bytes': 256,
         'downloader/request_count': 1,
         'downloader/request_method_count/GET': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2013, 7, 11, 5, 15, 37, 512010),
         'log_count/ERROR': 1,
         'log_count/INFO': 4,
         'scheduler/dequeued': 1,
         'scheduler/dequeued/memory': 1,
         'scheduler/enqueued': 1,
         'scheduler/enqueued/memory': 1,
         'start_time': datetime.datetime(2013, 7, 11, 5, 15, 37, 257898)}
2013-07-11 09:15:37+0400 [amon_ra] INFO: Spider closed (finished)
@dangra

This comment has been minimized.

Show comment
Hide comment
@dangra

dangra Jul 11, 2013

Member

http parser doesn't like reasonless status line in response

$ curl -sv http://www.gidroprofmontag.ru/bassein/sbornue_basseynu
> GET /bassein/sbornue_basseynu HTTP/1.1
> User-Agent: curl/7.27.0
> Host: www.gidroprofmontag.ru
> Accept: */*
> 
< HTTP/1.1 200
< Server: nginx
< Date: Thu, 11 Jul 2013 21:11:04 GMT
< Content-Type: text/html; charset=windows-1251
< Transfer-Encoding: chunked
< Connection: keep-alive
< Keep-Alive: timeout=5
< Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
< Expires: Thu, 19 Nov 1981 08:52:00 GMT
< Pragma: no-cache
< Set-Cookie: PHPSESSID=a95f6bd4dd61c03de33e11c049e3e970; path=/
< Set-Cookie: Apache=190.135.189.59.933341373577064378; path=/; expires=Fri, 11-Jul-14 21:11:04 GMT
< 
* Closing connection #0
Member

dangra commented Jul 11, 2013

http parser doesn't like reasonless status line in response

$ curl -sv http://www.gidroprofmontag.ru/bassein/sbornue_basseynu
> GET /bassein/sbornue_basseynu HTTP/1.1
> User-Agent: curl/7.27.0
> Host: www.gidroprofmontag.ru
> Accept: */*
> 
< HTTP/1.1 200
< Server: nginx
< Date: Thu, 11 Jul 2013 21:11:04 GMT
< Content-Type: text/html; charset=windows-1251
< Transfer-Encoding: chunked
< Connection: keep-alive
< Keep-Alive: timeout=5
< Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
< Expires: Thu, 19 Nov 1981 08:52:00 GMT
< Pragma: no-cache
< Set-Cookie: PHPSESSID=a95f6bd4dd61c03de33e11c049e3e970; path=/
< Set-Cookie: Apache=190.135.189.59.933341373577064378; path=/; expires=Fri, 11-Jul-14 21:11:04 GMT
< 
* Closing connection #0
@dangra

This comment has been minimized.

Show comment
Hide comment
@dangra

dangra Jul 11, 2013

Member

As a scraping framework, we should be able to download the page and ignore the status line bug

Member

dangra commented Jul 11, 2013

As a scraping framework, we should be able to download the page and ignore the status line bug

@tonal

This comment has been minimized.

Show comment
Hide comment
@tonal

tonal Jul 15, 2013

How to handle this error?

tonal commented Jul 15, 2013

How to handle this error?

@dangra

This comment has been minimized.

Show comment
Hide comment
@dangra

dangra Jul 15, 2013

Member

Extend or fix twisted HTTPClientParser so it doesn't discard the response

Member

dangra commented Jul 15, 2013

Extend or fix twisted HTTPClientParser so it doesn't discard the response

@Tony36051

This comment has been minimized.

Show comment
Hide comment
@Tony36051

Tony36051 Aug 18, 2013

I'm scrapy user , and i got 'scrapy.xlib.tx._newclient.ResponseFailed'
In scrapy shell, parse any url got the same error with something like come from twisted.
So, i guess that twisted maybe the point. I should 'Extend or fix twisted HTTPClientParser so it doesn't discard the response' as dangra said, BUT, that may be TOO HARD for me, and I change my Twisted from 13.1.0 to 11.0.0
It works

I'm scrapy user , and i got 'scrapy.xlib.tx._newclient.ResponseFailed'
In scrapy shell, parse any url got the same error with something like come from twisted.
So, i guess that twisted maybe the point. I should 'Extend or fix twisted HTTPClientParser so it doesn't discard the response' as dangra said, BUT, that may be TOO HARD for me, and I change my Twisted from 13.1.0 to 11.0.0
It works

@pablohoffman

This comment has been minimized.

Show comment
Hide comment
@pablohoffman

pablohoffman Sep 4, 2013

Member

are we gonna fix this one @dangra ?

Member

pablohoffman commented Sep 4, 2013

are we gonna fix this one @dangra ?

@dangra

This comment has been minimized.

Show comment
Hide comment
@dangra

dangra Sep 6, 2013

Member

@Tony36051: your problem is different, it was fixed in Scrapy development branch and in Scrapy 0.18.2 stable release. Create a new issue with an url to easily reproduce it if not. thanks.

@pablohoffman: yes, happens that an extended http parser can't be easily hooked into twisted HTTP11 client. Want to take a look and discuss better approach?

I think the longterm option is to report the bug upstream and propose two things:

  • a fix for this specific problem in current twisted HTTPClientParser class
  • a way to easily extend the HTTPClientParser used by HTTP11ClientProtocol
    (currently even monkey patching is hard)

to access the parser from scrapy download handler we should go trough:

  • HTTPConnectionPool has a private attribute named _factory that sets _HTTP11ClientFactory
  • _HTTP11ClientFactory has a simple method named buildProtocol that instanciates a HTTP11ClientProtocol
  • HTTP11ClientProtocol instanciate an HTTPClientParser and multiples others things inside request() method

Everything is easy except telling HTTP11ClientProtocol to use a different HTTPClientParser

While writing about this I realized a non-monkeypatch solution, extending HTTP11ClientProtocol and use a property getter and setter for HTTP11ClientProtocol._parser attribute, the setter will convert the twisted HTTPClientParser instance into our extended version. It's not pretty but I can't see any better option. :)

Member

dangra commented Sep 6, 2013

@Tony36051: your problem is different, it was fixed in Scrapy development branch and in Scrapy 0.18.2 stable release. Create a new issue with an url to easily reproduce it if not. thanks.

@pablohoffman: yes, happens that an extended http parser can't be easily hooked into twisted HTTP11 client. Want to take a look and discuss better approach?

I think the longterm option is to report the bug upstream and propose two things:

  • a fix for this specific problem in current twisted HTTPClientParser class
  • a way to easily extend the HTTPClientParser used by HTTP11ClientProtocol
    (currently even monkey patching is hard)

to access the parser from scrapy download handler we should go trough:

  • HTTPConnectionPool has a private attribute named _factory that sets _HTTP11ClientFactory
  • _HTTP11ClientFactory has a simple method named buildProtocol that instanciates a HTTP11ClientProtocol
  • HTTP11ClientProtocol instanciate an HTTPClientParser and multiples others things inside request() method

Everything is easy except telling HTTP11ClientProtocol to use a different HTTPClientParser

While writing about this I realized a non-monkeypatch solution, extending HTTP11ClientProtocol and use a property getter and setter for HTTP11ClientProtocol._parser attribute, the setter will convert the twisted HTTPClientParser instance into our extended version. It's not pretty but I can't see any better option. :)

@Tony36051

This comment has been minimized.

Show comment
Hide comment
@Tony36051

Tony36051 Sep 6, 2013

I was so stupid --- using global proxy and ran out of amount。
At last, all thing go ok
thank you~again.
Tony

I was so stupid --- using global proxy and ran out of amount。
At last, all thing go ok
thank you~again.
Tony

@kbourgoin

This comment has been minimized.

Show comment
Hide comment
@kbourgoin

kbourgoin Dec 2, 2013

Hey there. We hit this problem recently -- is there a fix in the works?

Hey there. We hit this problem recently -- is there a fix in the works?

@dangra

This comment has been minimized.

Show comment
Hide comment
@dangra

dangra Dec 2, 2013

Member

@kbourgoin : the far we got is the description of a possible solution by #345 (comment)

Member

dangra commented Dec 2, 2013

@kbourgoin : the far we got is the description of a possible solution by #345 (comment)

@Tony36051

This comment has been minimized.

Show comment
Hide comment
@Tony36051

Tony36051 Dec 2, 2013

I may not offer any help. But I found my problem that I have set a global proxy and ran out amount of flow. (so sorry for my poor English) after set the network well again, my srcapy program works well too. to find what wrong with the twist, I used urlopen(function in Python)to test the ability of downloading something in Python framework. and I found what I got is just like the error page from my proxy. in a word, my problem result from wrong global proxy config. best wish Tony -- 发自 Android 网易邮箱 在2013年12月03日 24:01, Keith Bourgoin写道: Hey there. We hit this problem recently -- is there a fix in the works? — Reply to this email directly or view it on GitHub.

I may not offer any help. But I found my problem that I have set a global proxy and ran out amount of flow. (so sorry for my poor English) after set the network well again, my srcapy program works well too. to find what wrong with the twist, I used urlopen(function in Python)to test the ability of downloading something in Python framework. and I found what I got is just like the error page from my proxy. in a word, my problem result from wrong global proxy config. best wish Tony -- 发自 Android 网易邮箱 在2013年12月03日 24:01, Keith Bourgoin写道: Hey there. We hit this problem recently -- is there a fix in the works? — Reply to this email directly or view it on GitHub.

@ZhiweiWang

This comment has been minimized.

Show comment
Hide comment
@ZhiweiWang

ZhiweiWang Dec 7, 2013

I recently solved this problem by using twisted 11.0.0 with scrapy 0.20. Thanks tips from @Tony36051 .

I recently solved this problem by using twisted 11.0.0 with scrapy 0.20. Thanks tips from @Tony36051 .

@kmike

This comment has been minimized.

Show comment
Hide comment
@kmike

kmike Apr 24, 2014

Member

Is there a way to reproduce this? I've tried different twisted versions (13.2.0, 13.1.0, 10.2.0) and different scrapy versions (0.18.4, 0.22.2, scrapy master), and scrapy fetch works fine. Maybe the website changed. I'm not sure I've understood @dangra comment about reasonless status line. Here is the current curl output:

(scraping)kmike ~/scrap > curl -sv http://www.gidroprofmontag.ru/bassein/sbornue_basseynu | head
* Adding handle: conn: 0x7fd56c004000
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 0 (0x7fd56c004000) send_pipe: 1, recv_pipe: 0
* About to connect() to www.gidroprofmontag.ru port 80 (#0)
*   Trying 89.111.176.172...
* Connected to www.gidroprofmontag.ru (89.111.176.172) port 80 (#0)
> GET /bassein/sbornue_basseynu HTTP/1.1
> User-Agent: curl/7.30.0
> Host: www.gidroprofmontag.ru
> Accept: */*
> 
< HTTP/1.1 200 OK
* Server nginx is not blacklisted
< Server: nginx
< Date: Thu, 24 Apr 2014 17:42:15 GMT
< Content-Type: text/html; charset=windows-1251
< Transfer-Encoding: chunked
< Connection: keep-alive
< Keep-Alive: timeout=5
< Expires: Thu, 19 Nov 1981 08:52:00 GMT
< Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
< Pragma: no-cache
< Set-Cookie: PHPSESSID=4e95cb26606029b00725f7d4c631f974; path=/
< Set-Cookie: Apache=176.215.38.50.1398361334922150; path=/; expires=Fri, 24-Apr-15 17:42:14 GMT
< 
{ [data not shown]
<html>
<head>
Member

kmike commented Apr 24, 2014

Is there a way to reproduce this? I've tried different twisted versions (13.2.0, 13.1.0, 10.2.0) and different scrapy versions (0.18.4, 0.22.2, scrapy master), and scrapy fetch works fine. Maybe the website changed. I'm not sure I've understood @dangra comment about reasonless status line. Here is the current curl output:

(scraping)kmike ~/scrap > curl -sv http://www.gidroprofmontag.ru/bassein/sbornue_basseynu | head
* Adding handle: conn: 0x7fd56c004000
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 0 (0x7fd56c004000) send_pipe: 1, recv_pipe: 0
* About to connect() to www.gidroprofmontag.ru port 80 (#0)
*   Trying 89.111.176.172...
* Connected to www.gidroprofmontag.ru (89.111.176.172) port 80 (#0)
> GET /bassein/sbornue_basseynu HTTP/1.1
> User-Agent: curl/7.30.0
> Host: www.gidroprofmontag.ru
> Accept: */*
> 
< HTTP/1.1 200 OK
* Server nginx is not blacklisted
< Server: nginx
< Date: Thu, 24 Apr 2014 17:42:15 GMT
< Content-Type: text/html; charset=windows-1251
< Transfer-Encoding: chunked
< Connection: keep-alive
< Keep-Alive: timeout=5
< Expires: Thu, 19 Nov 1981 08:52:00 GMT
< Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
< Pragma: no-cache
< Set-Cookie: PHPSESSID=4e95cb26606029b00725f7d4c631f974; path=/
< Set-Cookie: Apache=176.215.38.50.1398361334922150; path=/; expires=Fri, 24-Apr-15 17:42:14 GMT
< 
{ [data not shown]
<html>
<head>
@dangra

This comment has been minimized.

Show comment
Hide comment
@dangra

dangra Apr 24, 2014

Member

the response first line was "HTTP/1.1 200", it lacked the "OK" string.

Member

dangra commented Apr 24, 2014

the response first line was "HTTP/1.1 200", it lacked the "OK" string.

@kmike

This comment has been minimized.

Show comment
Hide comment
@kmike

kmike Apr 24, 2014

Member

ah, I see

Member

kmike commented Apr 24, 2014

ah, I see

@tonal

This comment has been minimized.

Show comment
Hide comment
@tonal

tonal Apr 29, 2014

My monkey patch for workaround:

def _monkey_patching_HTTPClientParser_statusReceived():
  """
  monkey patching for scrapy.xlib.tx._newclient.HTTPClientParser.statusReceived
  для обхода ошибки, когда статус выдаётся без "OK" в конце
  """
  from scrapy.xlib.tx._newclient import HTTPClientParser, ParseError
  old_sr = HTTPClientParser.statusReceived
  def statusReceived(self, status):
    try:
      return old_sr(self, status)
    except ParseError, e:
      if e.args[0] == 'wrong number of parts':
        return old_sr(self, status + ' OK')
      raise
  statusReceived.__doc__ == old_sr.__doc__
  HTTPClientParser.statusReceived = statusReceived

tonal commented Apr 29, 2014

My monkey patch for workaround:

def _monkey_patching_HTTPClientParser_statusReceived():
  """
  monkey patching for scrapy.xlib.tx._newclient.HTTPClientParser.statusReceived
  для обхода ошибки, когда статус выдаётся без "OK" в конце
  """
  from scrapy.xlib.tx._newclient import HTTPClientParser, ParseError
  old_sr = HTTPClientParser.statusReceived
  def statusReceived(self, status):
    try:
      return old_sr(self, status)
    except ParseError, e:
      if e.args[0] == 'wrong number of parts':
        return old_sr(self, status + ' OK')
      raise
  statusReceived.__doc__ == old_sr.__doc__
  HTTPClientParser.statusReceived = statusReceived
@onbjerg

This comment has been minimized.

Show comment
Hide comment
@onbjerg

onbjerg Oct 1, 2014

Where do we put the monkey patch? @tonal

onbjerg commented Oct 1, 2014

Where do we put the monkey patch? @tonal

@tonal

This comment has been minimized.

Show comment
Hide comment
@tonal

tonal Oct 1, 2014

Call monkey patch before start first request.
For example In init method of You spider, or in init.py

tonal commented Oct 1, 2014

Call monkey patch before start first request.
For example In init method of You spider, or in init.py

@onbjerg

This comment has been minimized.

Show comment
Hide comment
@onbjerg

onbjerg Oct 1, 2014

Thank you very much @tonal, it worked like a charm 👍

onbjerg commented Oct 1, 2014

Thank you very much @tonal, it worked like a charm 👍

@lbsweek

This comment has been minimized.

Show comment
Hide comment
@lbsweek

lbsweek May 15, 2015

I got this error message when using VPN proxy, I capture wireshark and find there is no response. it is fine when I stop VPN proxy

lbsweek commented May 15, 2015

I got this error message when using VPN proxy, I capture wireshark and find there is no response. it is fine when I stop VPN proxy

@dangra

This comment has been minimized.

Show comment
Hide comment
@dangra

dangra May 15, 2015

Member

@lbsweek what do you mean by "no response"? an empty reply without even a first line?

Member

dangra commented May 15, 2015

@lbsweek what do you mean by "no response"? an empty reply without even a first line?

@dangra

This comment has been minimized.

Show comment
Hide comment
@dangra

dangra May 15, 2015

Member

After the failed attempt to fix this issue in #1140, I think the only viable approach is a monkeypatch similar to what @tonal proposes in #345 (comment)

Member

dangra commented May 15, 2015

After the failed attempt to fix this issue in #1140, I think the only viable approach is a monkeypatch similar to what @tonal proposes in #345 (comment)

@tonal

This comment has been minimized.

Show comment
Hide comment
@tonal

tonal Jun 5, 2015

On latest scrapy ubuntu package (0.25.0-454-gfa1039f+1429829085) I receive similar errors:

$ scrapy fetch http://only.ru/catalog/electro_oven/hiddenheater/
...
2015-06-05 12:39:39.5292+0600 [amon_ra] INFO: Spider opened
2015-06-05 12:39:39.7123+0600 [amon_ra] ERROR: Error downloading <GET http://only.ru/catalog/electro_oven/hiddenheater/>: [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-06-05 12:39:39.7137+0600 [amon_ra] INFO: Closing spider (finished)

My monkeypatch work only for scrapy.xlib.tx._newclient.ParseError bat i receive twisted.web._newclient.ParseError.

How to correct path?

tonal commented Jun 5, 2015

On latest scrapy ubuntu package (0.25.0-454-gfa1039f+1429829085) I receive similar errors:

$ scrapy fetch http://only.ru/catalog/electro_oven/hiddenheater/
...
2015-06-05 12:39:39.5292+0600 [amon_ra] INFO: Spider opened
2015-06-05 12:39:39.7123+0600 [amon_ra] ERROR: Error downloading <GET http://only.ru/catalog/electro_oven/hiddenheater/>: [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-06-05 12:39:39.7137+0600 [amon_ra] INFO: Closing spider (finished)

My monkeypatch work only for scrapy.xlib.tx._newclient.ParseError bat i receive twisted.web._newclient.ParseError.

How to correct path?

@dangra

This comment has been minimized.

Show comment
Hide comment
@dangra

dangra Jun 5, 2015

Member

My monkeypatch work only for scrapy.xlib.tx._newclient.ParseError bat i receive twisted.web._newclient.ParseError.
How to correct path?

Monkey patch both.

For more info on when Scrapy uses one or the other see https://github.com/scrapy/scrapy/blob/master/scrapy/xlib/tx/__init__.py.

Member

dangra commented Jun 5, 2015

My monkeypatch work only for scrapy.xlib.tx._newclient.ParseError bat i receive twisted.web._newclient.ParseError.
How to correct path?

Monkey patch both.

For more info on when Scrapy uses one or the other see https://github.com/scrapy/scrapy/blob/master/scrapy/xlib/tx/__init__.py.

@sunhaowen

This comment has been minimized.

Show comment
Hide comment
@sunhaowen

sunhaowen Dec 7, 2015

I added tonal's monkey patch, but I still receive the same error.

2015-12-07 23:47:19 [scrapy] DEBUG: Retrying GET https://api.octinn.com/partner/baidu3600/strategy_list/1 (failed 1 times): [twisted.python.failure.Failure twisted.web._newclient.ParseError: (u'wrong number of parts', 'HTTP/1.1 200')]

-------------------------- added by 12-08 --------------------------------

I have already solved it. I found that I did not use xlib.tx._newclient.py, but twisted.web._newlicent.py. So I changed

from scrapy.xlib.tx._newclient import HTTPClientParser, ParseError

to

from twisted.web._newclient import HTTPClientParser, ParseError

that is ok now ~

I added tonal's monkey patch, but I still receive the same error.

2015-12-07 23:47:19 [scrapy] DEBUG: Retrying GET https://api.octinn.com/partner/baidu3600/strategy_list/1 (failed 1 times): [twisted.python.failure.Failure twisted.web._newclient.ParseError: (u'wrong number of parts', 'HTTP/1.1 200')]

-------------------------- added by 12-08 --------------------------------

I have already solved it. I found that I did not use xlib.tx._newclient.py, but twisted.web._newlicent.py. So I changed

from scrapy.xlib.tx._newclient import HTTPClientParser, ParseError

to

from twisted.web._newclient import HTTPClientParser, ParseError

that is ok now ~

@liuwwei3

This comment has been minimized.

Show comment
Hide comment
@liuwwei3

liuwwei3 Dec 25, 2015

@sunhaowen Great thx, it works!

|----------------------
ps. 你是monkey请来的救兵吗?

@sunhaowen Great thx, it works!

|----------------------
ps. 你是monkey请来的救兵吗?

@leearic

This comment has been minimized.

Show comment
Hide comment
@leearic

leearic Dec 25, 2015

这里尽然有好多好多 CHINESE

leearic@126.com

From: liuwwei3
Date: 2015-12-25 14:59
To: scrapy/scrapy
Subject: Re: [scrapy] Error download page: twisted.python.failure.Failure <class 'scrapy.xlib.tx._newclient.ParseError'> (#345)
@sunhaowen Great thx, it works!你是monkey请来的救兵吗?

Reply to this email directly or view it on GitHub.

leearic commented Dec 25, 2015

这里尽然有好多好多 CHINESE

leearic@126.com

From: liuwwei3
Date: 2015-12-25 14:59
To: scrapy/scrapy
Subject: Re: [scrapy] Error download page: twisted.python.failure.Failure <class 'scrapy.xlib.tx._newclient.ParseError'> (#345)
@sunhaowen Great thx, it works!你是monkey请来的救兵吗?

Reply to this email directly or view it on GitHub.

@redapple redapple changed the title from Error download page: twisted.python.failure.Failure <class 'scrapy.xlib.tx._newclient.ParseError'> to Scrapy chokes on HTTP response status lines without a Reason phrase Sep 15, 2016

@redapple redapple added the bug label Sep 15, 2016

@redapple

This comment has been minimized.

Show comment
Hide comment
@redapple

redapple Sep 15, 2016

Contributor

I just found out about https://twistedmatrix.com/trac/ticket/7673
The twisted team is not ready to fix it, unless someone has a real webserver in the wild that does this.

Contributor

redapple commented Sep 15, 2016

I just found out about https://twistedmatrix.com/trac/ticket/7673
The twisted team is not ready to fix it, unless someone has a real webserver in the wild that does this.

@kmike

This comment has been minimized.

Show comment
Hide comment
@kmike

kmike Sep 15, 2016

Member

@redapple it could also be a bad proxy, not a bad server

Member

kmike commented Sep 15, 2016

@redapple it could also be a bad proxy, not a bad server

@redapple

This comment has been minimized.

Show comment
Hide comment
@redapple

redapple Sep 15, 2016

Contributor

True.

Contributor

redapple commented Sep 15, 2016

True.

@rmax

This comment has been minimized.

Show comment
Hide comment
@rmax

rmax Feb 22, 2017

Contributor

Here is a live example at this time:

$ curl -v "http://www.jindai.com.tw/"
> GET / HTTP/1.1
> Host: www.jindai.com.tw
> User-Agent: Mozilla/5.1 (MSIE; YB/9.5.1 MEGAUPLOAD 1.0)
> Accept: */*
> Referer:
>
< HTTP/1.1 200
< Status: 200
< Connection: close
Contributor

rmax commented Feb 22, 2017

Here is a live example at this time:

$ curl -v "http://www.jindai.com.tw/"
> GET / HTTP/1.1
> Host: www.jindai.com.tw
> User-Agent: Mozilla/5.1 (MSIE; YB/9.5.1 MEGAUPLOAD 1.0)
> Accept: */*
> Referer:
>
< HTTP/1.1 200
< Status: 200
< Connection: close
@lopuhin

This comment has been minimized.

Show comment
Hide comment
@lopuhin

lopuhin Feb 28, 2017

Contributor

Another example are 404 and 302 responses from okcupid (200 pages have "OK"):

$ curl -v https://www.okcupid.com/interests
> GET /interests HTTP/1.1
> Host: www.okcupid.com
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 404
Contributor

lopuhin commented Feb 28, 2017

Another example are 404 and 302 responses from okcupid (200 pages have "OK"):

$ curl -v https://www.okcupid.com/interests
> GET /interests HTTP/1.1
> Host: www.okcupid.com
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 404

njsmith added a commit to python-hyper/h11 that referenced this issue Feb 28, 2017

@njsmith njsmith referenced this issue in python-hyper/h11 Feb 28, 2017

Merged

Tolerate missing reason phrases #32

@rmax

This comment has been minimized.

Show comment
Hide comment
@rmax

rmax Feb 28, 2017

Contributor

It seems that this case is common with, for example, custom nginx modules which only set the response status and no reason.

Contributor

rmax commented Feb 28, 2017

It seems that this case is common with, for example, custom nginx modules which only set the response status and no reason.

@rmax

This comment has been minimized.

Show comment
Hide comment
@rmax

rmax Feb 28, 2017

Contributor

Twisted has a patch ready to fix this issue: https://twistedmatrix.com/trac/ticket/7673#comment:5 PR twisted/twisted#723 🎉

Contributor

rmax commented Feb 28, 2017

Twisted has a patch ready to fix this issue: https://twistedmatrix.com/trac/ticket/7673#comment:5 PR twisted/twisted#723 🎉

@kmike

This comment has been minimized.

Show comment
Hide comment
@kmike

kmike Jun 20, 2017

Member

Fixed in Twisted 17.5.0.

Example websites from this ticket work for me with Scrapy 1.4.0 and Twisted 17.5.0, so I'm closing it. Thanks everyone!

Member

kmike commented Jun 20, 2017

Fixed in Twisted 17.5.0.

Example websites from this ticket work for me with Scrapy 1.4.0 and Twisted 17.5.0, so I'm closing it. Thanks everyone!

@kmike kmike closed this Jun 20, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment