Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scrapy.http.Response has no body_as_unicode method #462

Closed
wants to merge 1 commit into from

Conversation

@llonchj
Copy link
Contributor

@llonchj llonchj commented Nov 17, 2013

No description provided.

@@ -18,8 +18,13 @@
def body_or_str(obj, unicode=True):
assert isinstance(obj, (Response, basestring)), \
"obj must be Response or basestring, not %s" % type(obj).__name__
if isinstance(obj, Response):
if isinstance(obj, TextResponse):

This comment has been minimized.

@dangra

dangra Nov 19, 2013
Member

This is correct.

body = obj.body
if isinstance(body, str):
return body.decode('utf-8') if unicode else body
return body if unicode else body.encode('utf-8')

This comment has been minimized.

@dangra

dangra Nov 19, 2013
Member

response.body is always an encoded string (strictly speaking a bytearray) but I am not sure default decoding from UTF-8 is a good way to deal with this case, as most of the times when the response is not TextResponse it can't be parsed as html/xml, and in case it is xml/html parseable there is a bug in the code that detects response types.

My vote is to raise ValueError instead of forcing UTF-8 decoding.

This comment has been minimized.

@llonchj

llonchj Nov 19, 2013
Author Contributor

I have a use case when the response is a Response object and XML is valid. @dangra, What's your suggestion?

This comment has been minimized.

@dangra

dangra Nov 19, 2013
Member

Debug why scrapy.responsetypes.ResponseTypes doesn't detect it as text/xml.

can you provide more details, an url would be ideal, but headers should be enough.

This comment has been minimized.

@llonchj

llonchj Nov 19, 2013
Author Contributor

Cache-Control: no-cache
Pragma: no-cache
Content-Type: application/octet-stream
Expires: -1
Server: Microsoft-IIS/7.5
X-AspNet-Version: 4.0.30319
X-SiteConHost: P508
X-Powered-By: ASP.NET
X-Powered-By: ARR/2.5
X-SiteConHost: P508
X-Powered-By: ASP.NET
Date: Tue, 19 Nov 2013 17:40:30 GMT
Transfer-Encoding: chunked
Connection: keep-alive
Connection: Transfer-Encoding
Set-Cookie: tj_recruiters_ssl#sc_wede=1; path=/
Set-Cookie:
AnonymousUser=MemberId=2a3bc8aa-64dc-4c1f-b382-22b2122cd891&IsAnonymous=True;
expires=Thu, 19-Nov-2043 00:00:00 GMT; path=/

2013/11/20 Daniel Graña notifications@github.com

In scrapy/utils/response.py:

     return obj.body_as_unicode() if unicode else obj.body
  • elif isinstance(obj, Response):
  •    body = obj.body
    
  •    if isinstance(body, str):
    
  •        return body.decode('utf-8') if unicode else body
    
  •    return body if unicode else body.encode('utf-8')
    

Triage why scrapy.responsetypes.ResponseTypes doesn't detect it as
text/xml.

can you provide more details, an url would be ideal, but headers should be
enough.


Reply to this email directly or view it on GitHubhttps://github.com//pull/462/files#r7764579
.

This comment has been minimized.

@dangra

dangra Nov 19, 2013
Member

I must admit there is nothing useful on those headers :)

scrapy.utils.response.body_or_str is used only by scrapy.utils.iterators, I am temped to move it there as a private function.

And in that case may be assume UTF-8 in case of unknown encoding for Response.

Another option is to retry detection based on body content with responsetypes.from_args(body=response.body), I am not sure why this is not done in http handlers.

This comment has been minimized.

@llonchj

llonchj Nov 19, 2013
Author Contributor

Will then be an option to :

  •    body = obj.body
    
  •    if isinstance(body, str):
    
  •        return body.decode('utf-8') if unicode else body
    
  •    return body if unicode else body.encode('utf-8')
    
  •    return body obj.body
    

Does this make any sense?

2013/11/20 Daniel Graña notifications@github.com

In scrapy/utils/response.py:

     return obj.body_as_unicode() if unicode else obj.body
  • elif isinstance(obj, Response):
  •    body = obj.body
    
  •    if isinstance(body, str):
    
  •        return body.decode('utf-8') if unicode else body
    
  •    return body if unicode else body.encode('utf-8')
    

I must admit there is nothing useful on those headers :)

body_as_str is only used in scrapy.utils.iterators, I am temped to move
it there as a private function.

And in that case may be assume UTF-8 in case of unknown encoding for
Response.

Another option is to retry detection based on body content with
responsetypes.from_args(body=response.body), I am not sure why this is
not done in http handlers.


Reply to this email directly or view it on GitHubhttps://github.com//pull/462/files#r7767231
.

This comment has been minimized.

@llonchj

llonchj Nov 19, 2013
Author Contributor

Sorry:

  •    return obj.body
    

2013/11/20 Jordi Llonch llonchj@gmail.com

Will then be an option to :

  •    body = obj.body
    
  •    if isinstance(body, str):
    
  •        return body.decode('utf-8') if unicode else body
    
  •    return body if unicode else body.encode('utf-8')
    
  •    return body obj.body
    

Does this make any sense?

2013/11/20 Daniel Graña notifications@github.com

In scrapy/utils/response.py:

     return obj.body_as_unicode() if unicode else obj.body
  • elif isinstance(obj, Response):
  •    body = obj.body
    
  •    if isinstance(body, str):
    
  •        return body.decode('utf-8') if unicode else body
    
  •    return body if unicode else body.encode('utf-8')
    

I must admit there is nothing useful on those headers :)

body_as_str is only used in scrapy.utils.iterators, I am temped to move
it there as a private function.

And in that case may be assume UTF-8 in case of unknown encoding for
Response.

Another option is to retry detection based on body content with
responsetypes.from_args(body=response.body), I am not sure why this is
not done in http handlers.


Reply to this email directly or view it on GitHubhttps://github.com//pull/462/files#r7767231
.

This comment has been minimized.

@dangra

dangra Nov 19, 2013
Member

It should be reduced to:

    if isinstance(obj, Response):
        return obj.body.decode('utf-8') if unicode else obj.body

This comment has been minimized.

@dangra

dangra Nov 19, 2013
Member

@llonchj : Do you want to update this PR or should I take care of it?

it should move body_or_str as private function to scrapy.utils.iterators and try to decode "binary" responses using utf-8 as best effort.

This comment has been minimized.

@llonchj

llonchj Nov 19, 2013
Author Contributor

@dangra, not sure about what's your idea. Kindly update PR yourself.

2013/11/20 Daniel Graña notifications@github.com

In scrapy/utils/response.py:

     return obj.body_as_unicode() if unicode else obj.body
  • elif isinstance(obj, Response):
  •    body = obj.body
    
  •    if isinstance(body, str):
    
  •        return body.decode('utf-8') if unicode else body
    
  •    return body if unicode else body.encode('utf-8')
    

@llonchj https://github.com/llonchj : Do you want to update this PR or
should I take care of it?

it should move body_or_str as private function to scrapy.utils.iteratorsand try to decode "binary" responses using
utf-8 as best effort.


Reply to this email directly or view it on GitHubhttps://github.com//pull/462/files#r7770963
.

@dangra dangra closed this in 3f156ad Nov 19, 2013
@dangra
Copy link
Member

@dangra dangra commented Nov 20, 2013

@llonchj hopes above fix works for you. let me know if it is not enough. thanks

@llonchj
Copy link
Contributor Author

@llonchj llonchj commented Nov 20, 2013

@dangra, it works. Thanks a lot!

2013/11/20 Daniel Graña notifications@github.com

@llonchj https://github.com/llonchj hopes above fix works for you. let
me know if it is not enough. thanks


Reply to this email directly or view it on GitHubhttps://github.com//pull/462#issuecomment-28857681
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants