Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disabling RedirectMiddleware results in HttpCompressionMiddleware errors #2145

Closed
barraponto opened this issue Jul 26, 2016 · 4 comments
Closed
Assignees
Labels
Milestone

Comments

@barraponto
Copy link
Contributor

@barraponto barraponto commented Jul 26, 2016

I wanted not to redirect 303 responses, but instead retry them.
From the docs, I thought I could achieve it through two settings:

REDIRECT_ENABLED = False
RETRY_HTTP_CODES = [301, 302, 307, 308, 500, 502, 503, 504, 408]

It ended up giving me errors on HttpCompressionMiddleware:

Traceback (most recent call last):
  File "twisted/internet/defer.py", line 1128, in _inlineCallbacks
    result = g.send(result)
  File "scrapy/core/downloader/middleware.py", line 53, in process_response
    spider=spider)
  File "scrapy/downloadermiddlewares/httpcompression.py", line 38, in process_response
    response = response.replace(**kwargs)
  File "scrapy/http/response/text.py", line 50, in replace
    return Response.replace(self, *args, **kwargs)
  File "scrapy/http/response/__init__.py", line 77, in replace
    return cls(*args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'encoding'
@kmike
Copy link
Member

@kmike kmike commented Jul 26, 2016

The error is cryptic. I wonder if #1895 fixes it.

@barraponto
Copy link
Contributor Author

@barraponto barraponto commented Jul 26, 2016

I just tried it, @kmike. The error persists.
From what I examined, I guess it stems from HttpCompressionMiddleware trying to decode a response with no body. The decoding itself goes fine, but there's something in the logic for guessing the response type that breaks. I added a if response.status == 303: return response in there and it worked...

@barraponto
Copy link
Contributor Author

@barraponto barraponto commented Jul 26, 2016

Here's a gist with my modified HttpCompressionMiddleware.process_response: https://gist.github.com/295df3142597fd1e2aa594b8f7bed75a

I'm not proud of it, since it's an ugly hack. But i guess it points to responsetypes.from_args as the bug source.

@redapple
Copy link
Contributor

@redapple redapple commented Nov 10, 2016

I found one possible case when this can happen.
The test below may not be super realistic but it shows the problem:

    def test_process_response_no_content_type_header(self):
        # no content-type
        headers = {
            'Content-Encoding': 'identity',
        }
        plainbody = b"""<html><head><title>Some page</title><meta http-equiv="Content-Type" content="text/html; charset=gb2312">"""
        response = HtmlResponse("http://www.example.com/index", headers=headers, body=plainbody)
        request = Request("http://www.example.com/index")

        newresponse = self.mw.process_response(request, response, self.spider)
        assert isinstance(newresponse, HtmlResponse)
        self.assertEqual(newresponse.body, plainbody)
        self.assertEqual(newresponse.encoding, resolve_encoding('gb2312'))

You get this when running the test:

_________________________________________________________ HttpCompressionTest.test_process_response_no_content_type_header _________________________________________________________

self = <tests.test_downloadermiddleware_httpcompression.HttpCompressionTest testMethod=test_process_response_no_content_type_header>

    def test_process_response_no_content_type_header(self):
        headers = {
            'Content-Encoding': 'identity',
        }
        plainbody = b"""<html><head><title>Some page</title><meta http-equiv="Content-Type" content="text/html; charset=gb2312">"""
        response = HtmlResponse("http;//www.example.com/index", headers=headers, body=plainbody)
        request = Request("http://www.example.com/index")

>       newresponse = self.mw.process_response(request, response, self.spider)

/home/paul/src/scrapy/tests/test_downloadermiddleware_httpcompression.py:146: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/paul/src/scrapy/scrapy/downloadermiddlewares/httpcompression.py:39: in process_response
    response = response.replace(**kwargs)
/home/paul/src/scrapy/scrapy/http/response/text.py:50: in replace
    return Response.replace(self, *args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <200 http;//www.example.com/index>, args = ()
kwargs = {'body': '<html><head><title>Some page</title><meta http-equiv="Content-Type" content="text/html; charset=gb2312">', 'encoding': 'gb18030', 'flags': [], 'headers': {'Content-Encoding': []}, ...}
x = 'flags', cls = <class 'scrapy.http.response.Response'>

    def replace(self, *args, **kwargs):
        """Create a new Response with the same attributes except for those
            given new values.
            """
        for x in ['url', 'status', 'headers', 'body', 'request', 'flags']:
            kwargs.setdefault(x, getattr(self, x))
        cls = kwargs.pop('cls', self.__class__)
>       return cls(*args, **kwargs)
E       TypeError: __init__() got an unexpected keyword argument 'encoding'

/home/paul/src/scrapy/scrapy/http/response/__init__.py:79: TypeError

What I think is happening is that HttpCompression is not using the decoded body to determine the response type to use for the replacement of the original response.
Thus, if the URL and response headers are not enough to determine a TextResponse, it can determine the basic Response as the new response class, which does not accept "encoding" argument.

Using the decoded body with responsetypes.from_args() fixes the test (and hopefully the issue):

$ git diff
diff --git a/scrapy/downloadermiddlewares/httpcompression.py b/scrapy/downloadermiddlewares/httpcompression.py
index bcf20f1..f847ef6 100644
--- a/scrapy/downloadermiddlewares/httpcompression.py
+++ b/scrapy/downloadermiddlewares/httpcompression.py
@@ -29,7 +29,7 @@ class HttpCompressionMiddleware(object):
                 encoding = content_encoding.pop()
                 decoded_body = self._decode(response.body, encoding.lower())
                 respcls = responsetypes.from_args(headers=response.headers, \
-                    url=response.url)
+                    url=response.url, body=decoded_body)
                 kwargs = dict(cls=respcls, body=decoded_body)
                 if issubclass(respcls, TextResponse):
                     # force recalculating the encoding until we make sure the
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants