Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GZip decoding regression #206

Closed
maxcountryman opened this issue Jul 19, 2013 · 24 comments
Closed

GZip decoding regression #206

maxcountryman opened this issue Jul 19, 2013 · 24 comments

Comments

@maxcountryman
Copy link

See this issue on Requests: https://github.com/kennethreitz/requests/issues/1472

This is the truncated traceback:

...
/project/virtualenv/lib/python2.6/requests/packages/urllib3/response.py", line 188, in read
    "failed to decode it." % content_encoding)
DecodeError: Received response with content-encoding: gzip, but failed to decode it.

cc @schlamar

@shazow
Copy link
Member

shazow commented Jul 19, 2013

@schlamar Looks like this is related to d00d305, could you look into this please?

@maxcountryman If you'd like to participate, a urllib3 test showing this failure (and making sure it doesn't regress in the future) would be helpful. :)

@maxcountryman
Copy link
Author

@shazow Photobucket's API will yield the error. I'm assuming any GZipped response. Perhaps httpbin.org could assist with this?

@shazow
Copy link
Member

shazow commented Jul 19, 2013

@maxcountryman For urllib3, we write tests that can be run without an internet connection. Take a look at some of the tests here for examples: https://github.com/shazow/urllib3/tree/master/test/with_dummyserver

We already have several GZip-related tests which are passing.

@maxcountryman
Copy link
Author

@shazow probably don't have time for that, sorry.

@shazow
Copy link
Member

shazow commented Jul 19, 2013

Nps. If anyone else wants to do this, help is welcome. :)

@schlamar
Copy link
Contributor

@maxcountryman gzip on httpbin works fine for me with urllib3 latest master:

>>> import urllib3
>>> http = urllib3.PoolManager()
>>> r = http.request('GET', 'http://httpbin.org/gzip')
>>> r.status
200
>>> r.data
'{\n  "headers": {\n    "Host": "httpbin.org",\n    "Accept-Encoding": "identity",\n    "Connection": "close"\n  },\n  "origin": "217.238.83.99",\n  "gzipped": true,\n  "method": "GET"\n}'

And requests 1.2.3:

>>> requests.get('http://httpbin.org/gzip')
<Response [200]>

Can you give me an exact url which is failing? Preferable without requirement to register for an API key.

@maxcountryman
Copy link
Author

Nope: you'll need to register for Photobucket and set up an OAuth application in order to access the API. This is a regression from requests 1.2.0 as previously stated.

@maxcountryman
Copy link
Author

Here's the response from curl, which is the best I can do for you. This exact URL yields the above error. To reproduce you may need to setup an application with Photobucket or model the response.

$ curl -v -L "http://api.photobucket.com/album/zentropa?<snip>"
* About to connect() to api.photobucket.com port 80 (#0)
*   Trying 209.17.80.224...
* connected
* Connected to api.photobucket.com (209.17.80.224) port 80 (#0)
> GET /album/zentropa?<snip> HTTP/1.1
> User-Agent: curl/7.24.0 (x86_64-apple-darwin12.0) libcurl/7.24.0 OpenSSL/0.9.8x zlib/1.2.5
> Host: api.photobucket.com
> Accept: */*
> 
* HTTP 1.0, assume close after body
< HTTP/1.0 301 Redirect
< Date: Sun, 21 Jul 2013 14:55:04 GMT
< Server: Apache
< Set-Cookie: flash=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; path=/; domain=.photobucket.com
< Location: http://api1260.photobucket.com/album/zentropa?<snip>
< Content-Length: 507
< Connection: close
< Content-Type: application/json
< 
* Closing connection #0
* Issue another request to this URL: 'http://api1260.photobucket.com/album/zentropa?<snip>'
* About to connect() to api1260.photobucket.com port 80 (#0)
*   Trying 209.17.68.33...
* connected
* Connected to api1260.photobucket.com (209.17.68.33) port 80 (#0)
> GET /album/zentropa?<snip>
HTTP/1.0
> User-Agent: curl/7.24.0 (x86_64-apple-darwin12.0) libcurl/7.24.0 OpenSSL/0.9.8x zlib/1.2.5
> Host: api1260.photobucket.com
> Accept: */*
> 
< HTTP/1.1 200 OK
< Date: Sun, 21 Jul 2013 14:55:04 GMT
< Server: Apache
< Set-Cookie: flash=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; path=/; domain=.photobucket.com
< Set-Cookie: dMess=%5B%5D; expires=Sun, 21-Jul-2013 16:55:04 GMT; path=/; domain=.photobucket.com
< Set-Cookie: persisted=YTowOnt9; path=/album
< Content-Length: 2422
< Connection: close
< Content-Type: application/json
< 
* Closing connection #0

The same URL raises the reported exception with Requests 1.2.3; this is a regression from Requests 1.2.0 which does not raise the error against the URL.

>>> r = requests.get('http://api.photobucket.com/album/zentropa?<snip>')
>>> r.status_code
200
>>> requests.__version__
'1.2.0'
>>> r = requests.get('http://api.photobucket.com/album/zentropa?<snip>')
Traceback (most recent call last):
...
requests.packages.urllib3.exceptions.DecodeError: Received response with content-encoding: gzip, but failed to decode it.
>>> requests.__version__
'1.2.3'

@t-8ch
Copy link
Contributor

t-8ch commented Jul 21, 2013

@maxcountryman Can you reproduce the issue on http://api.photobucket.com/ping?

Edit: removed stuff about proxy. Only the redirect is HTTP/1.0.

@maxcountryman
Copy link
Author

There is no proxy: photobucket redirects to silos. It's on the same machine and I don't have time to test the ping endpoint although I doubt it would reproduce it if it doesn't redirect. Now that's about all the help I can provide. I might suggest rolling back the commit that broke this in the meantime if the problem isn't obvious. Bear in mind that the browser and curl handle this same endpoint without issue so this is clearly a problem for any consumer expecting urllib3 to behave in a sane way. HTH.

@shazow
Copy link
Member

shazow commented Jul 21, 2013

@t-8ch Could it be something to do with switching from using gzip to the zlib module for decoding both?

@maxcountryman I rather not start rolling back commits from previous releases until we fully understand the issue. If it seems like this is photobucket-specific and not a wide-spread issue, we could wait until someone else runs into the bug who has more time to investigate it.

It's important to remember that urllib3 grows thanks to community contributions; we don't have any corporate sponsors paying for developer time to satisfy specific consumers (though this can be arranged if your employers is interested in sponsoring some specific development). :)

@t-8ch
Copy link
Contributor

t-8ch commented Jul 21, 2013

@shazow I am not sure, but I don't think so.

@maxcountryman
Your curl example doens't use gzip and browsers implement a huge number of workarounds for every type of brokeness one can imagine.
If the data itself is not confidental could you edit requests/packages/urllib3/response.py:HTTPResponse.CONTENT_DECODERS to be the empty list and then post requests.get(url).content here or to some place where we can get it.

You could also remove the try-catch around the decoding, so that the underlying exception is rethrown and we see a proper message.
I will open a pull request to keep the message in the rethrown exception.

It would also be great, if you could use git bisect to identify the exact source of the regression. Try requests first and if it's in a commit which updates urllib3 continue there.

@shazow
Copy link
Member

shazow commented Oct 18, 2013

Looks like this issue stalled. Feel free to reopen if there are new developments. :)

@shazow shazow closed this as completed Oct 18, 2013
@shazow
Copy link
Member

shazow commented Feb 13, 2014

Moving the conversation here from https://github.com/Runscope/requests-runscope/issues/6#issuecomment-34945852.

@maxcountryman wrote:

@shazow in particular, I think you should look at reversing this commit: d00d305

If I manually reverse those changes, it seems to correct the issue. i.e. this error: Error -3 while decompressing: incorrect header check

See also: http://stackoverflow.com/questions/3122145/zlib-error-error-3-while-decompressing-incorrect-header-check

I hope you will once again consider reversing the commit, as I had proposed previously. It would be greatly appreciated! :)

@maxcountryman It's not just a matter of "just revert this and everything is dandy."

  1. There are other cascading changes that could be affected.
  2. We could reintroduce a similar breaking change next week and unless there is a test, this exercise will be in vain.
  3. There could be other bugs in the reversed version.

@schlamar do you recall why we stopped using the gzip module?

@maxcountryman
Copy link
Author

Applied against master, this clearly illustrates the problem:

diff --git a/test/test_response.py b/test/test_response.py
index ecfcbee..2f95d40 100644
--- a/test/test_response.py
+++ b/test/test_response.py
@@ -8,20 +8,17 @@ from urllib3.exceptions import DecodeError

 from base64 import b64decode

-# A known random (i.e, not-too-compressible) payload generated with:
-#    "".join(random.choice(string.printable) for i in xrange(512))
-#    .encode("zlib").encode("base64")
-# Randomness in tests == bad, and fixing a seed may not be sufficient.
+# zlib-failing payload
 ZLIB_PAYLOAD = b64decode(b"""\
-eJwFweuaoQAAANDfineQhiKLUiaiCzvuTEmNNlJGiL5QhnGpZ99z8luQfe1AHoMioB+QSWHQu/L+
-lzd7W5CipqYmeVTBjdgSATdg4l4Z2zhikbuF+EKn69Q0DTpdmNJz8S33odfJoVEexw/l2SS9nFdi
-pis7KOwXzfSqarSo9uJYgbDGrs1VNnQpT9f8zAorhYCEZronZQF9DuDFfNK3Hecc+WHLnZLQptwk
-nufw8S9I43sEwxsT71BiqedHo0QeIrFE01F/4atVFXuJs2yxIOak3bvtXjUKAA6OKnQJ/nNvDGKZ
-Khe5TF36JbnKVjdcL1EUNpwrWVfQpFYJ/WWm2b74qNeSZeQv5/xBhRdOmKTJFYgO96PwrHBlsnLn
-a3l0LwJsloWpMbzByU5WLbRE6X5INFqjQOtIwYz5BAlhkn+kVqJvWM5vBlfrwP42ifonM5yF4ciJ
-auHVks62997mNGOsM7WXNG3P98dBHPo2NhbTvHleL0BI5dus2JY81MUOnK3SGWLH8HeWPa1t5KcW
-S5moAj5HexY/g/F8TctpxwsvyZp38dXeLDjSQvEQIkF7XR3YXbeZgKk3V34KGCPOAeeuQDIgyVhV
-nP4HF2uWHA==""")
+eyJzdGF0dXMiOiJFeGNlcHRpb24iLCJtZXNzYWdlIjoiUmVkaXJlY3QgcmVxdWlyZWQgdG8gYWNjZ
+XNzIHVzZXIgZGF0YSIsImNvZGUiOjMwMSwiY29udGVudCI6eyJzdWJkb21haW4iOiJodHRwOlwvXC
+9hcGkxMjYwLnBob3RvYnVja2V0LmNvbSIsInVybCI6Imh0dHA6XC9cL2FwaTEyNjAucGhvdG9idWN
+rZXQuY29tXC9hbGJ1bVwvemVudHJvcGE/b2F1dGhfbm9uY2U9MDM1ZTU1MWZmOTk1MDIyZTMwM2E5
+MGVmNWE5YWYxNDI5NDQwNzcyMSZmb3JtYXQ9anNvbiZvYXV0aF9jb25zdW1lcl9rZXk9MTQ5ODMyN
+DAwJm9hdXRoX3RpbWVzdGFtcD0xMzkyMjY0MjkwJm9hdXRoX3NpZ25hdHVyZV9tZXRob2Q9SE1BQy
+1TSEExJm9hdXRoX3ZlcnNpb249MS4wJm9hdXRoX3Rva2VuPTM0LjIyNDc1ODlfMTM5MjI2NDI4OSZ
+vYXV0aF9zaWduYXR1cmU9VGE2ZmRrUDljYlV6OWhLc05sczJJTko2NktFJTNEIn0sImZvcm1hdCI6
+Impzb24iLCJtZXRob2QiOiJHRVQiLCJ0aW1lc3RhbXAiOjEzOTIyNjQyOTB9""")


 class TestLegacyResponse(unittest.TestCase):

@maxcountryman
Copy link
Author

Looking at that, it seems like it's trying to decode non-compressed data, which leads me to believe that previously urllib3 was not considering this request's data as compressed and in later versions (e.g. now) it does.

In fact, in Requests 1.2.0, this data exits the read method here: https://github.com/kennethreitz/requests/blob/d06908d655eca4467521c28d93a68d329d48d524/requests/packages/urllib3/response.py#L158

@shazow
Copy link
Member

shazow commented Feb 13, 2014

which leads me to believe that previously urllib3 was not considering this request's data as compressed

Can you expand on what you mean by that? urllib3 decides whether the data is compressed by simply looking at the 'content-encoding header: https://github.com/shazow/urllib3/blob/master/urllib3/response.py#L169

Also I get the same error by replacing the zlib payload with any kind of garbage (e.g. aaaaaaaaa).

@schlamar
Copy link
Contributor

At that time requests did the decompression on its own. Can someone spot any difference?
https://github.com/kennethreitz/requests/blob/d06908d655eca4467521c28d93a68d329d48d524/requests/utils.py#L349:L378

@maxcountryman
Copy link
Author

@schlamar yes, the difference is that urllib3 falls out of the read method in the section I've highlighted above: it never tries to decompress the request at all given the input with Requests 1.2.0.

@schlamar
Copy link
Contributor

@maxcountryman Yes, that's obvious. However, requests at that time called the decompression method linked above. This method had a fallback to return the uncompressed data if decompression failed but this was flawed: https://github.com/kennethreitz/requests/issues/1249.

So I assume that photobucket does return non-gzipped data while sending a content-encoding=gzip header (which would be an error in their application/web server). You can try if response.raw.decode_content = False; response.raw.read() is working.

Maybe you could provide a raw HTTP response (including headers) from photobucket API, that would be really helpful I guess.

@schlamar
Copy link
Contributor

Maybe you could provide a raw HTTP response (including headers) from photobucket API, that would be really helpful I guess.

I have just seen the curl example above. Interestingly, curl doesn't get gzip as content-encoding. Maybe there is some notable difference in the headers sent on the request which might explains this behavior.

@schlamar
Copy link
Contributor

Curl does not send an Accept-Encoding header per default. What happens if you run curl with --compressed flag?

Alternatively, you can try to use a requests session and remove Accept-Encoding from its headers member (https://github.com/kennethreitz/requests/blob/c042c08179cae74e7b2c520b33ea19881c0ec842/requests/sessions.py#L213).

@schlamar
Copy link
Contributor

@maxcountryman Please check out https://github.com/kennethreitz/requests/pull/1944, this should resolve this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants