-
Notifications
You must be signed in to change notification settings - Fork 9.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
okhttp fails with IOException: gzip finished without exhausting source but GZIPInputStream works #3759
Comments
This isn't easily reproducible without a test case or url to test against. I am going to guess it is one of two cases a) that the server is incorrectly sending the original content-length in the response. Is the content compressed on the fly? If so it seems strange that the server can correctly predict the content length. [edit: this case is unlikely since you tested removing the length header] content-length -> [3384998203] b) it is just serving a previously compressed file, which contains some trailing bytes (or server is incorrectly adding something). You should be able to debug this case by looking at what is left in the buffer at the point Okio is about to throw. In this case Okio is doing an additional valid check that happens to expose a slightly corrupted file. |
The hot take on this is that it seems like the server is sending back a file, that just happens to be compressed. As such adding "Content-Encoding = gzip" while convenient for you, is testing an edge case and more likely to break some normal web assumptions. Why not just handle this as you are doing via GZIPInputStream on the original response? |
That's what I've been forced to do, but when something doesn't work, I like to understand why, not just write it off as bad luck 😃
The
I've collected exactly that using Netty (I've been trying multiple clients to rule out a potential okhttp bug). I can post that, or if you tell me how to collect what's left in the buffer using okhttp, I will do that. |
That makes it sounds like the content length is incorrect. If it's gzip-encoded then the length should be the gzip'd size and not the uncompressed size. This seems like a pretty straightforward unit test to set up from the template so we can see exactly what you're doing. |
I meant the compressed size only, not uncompressed. When I download the file, the archive (businesses.20180104.json.gz) size on disk is consistent with the reported one. I have a unit test; it doesn't fail. Of course, if I report wrong content length in the response, it might, but I'm not testing |
Last packet I captured from Netty. Wikipedia says:
Commons compress supports this:
But I don't see anything in
Due to okhttp code reaching into the guts of other classes by abusing package access, it's not possible for me to easily update the code and find out if my theory is correct. |
If the problem is the multiple streams the I think you are hitting a subtle difference between using gzip to compress and store content like tar archives of files on disk, and usage in a webserver to compress a single resource being served on the fly. These headers are no longer valid after OkHttp decompresses the content automatically etc.
I would argue that adding the Content-Encoding header to cause OkHttp to decompress these previously compressed files on the fly is the problem. It's been a great learning experience, but I don't think we should try to handle this case as it doesn't appear to be a normal situation in the wild with real webservers. |
I'd grudgingly agree, but I still don't know the problem. My goal here is to understand the issue, and get help from people who understand low-level networking better than I do. If the issue turns out to be GZIP non-compliance, I'll happily walk away knowing that I learned something. But without knowing the problem, the following statement can't be proven.
|
You've pointed to a real edge case of the Gzip spec that (I'm assuming) makes sense for a number of reasons when used for tar archives e.g. .tar.gz. append only additions of additional content, or something similar. I don't specifically know and I can't see the stream marker (1f 8b) in the packet you captured from Netty to confirm. But pragmatically it doesn't appear to happen without a contrived situation e.g. manually adding a header.
|
@yschimke
Put a breakpoint on line 3 of
That produced the attached dump, and I see that it starts with |
I introduced this behavior and can explain it. Gzip is a self-terminating format. The content of the stream itself indicates where you've read everything. If ever there's data beyond the self-reported end, this data is effectively unreachable. This is potentially problematic for two reasons:
I made things strict to help to detect problems like this. It's possible this check is too strict and we should silently ignore the extra data. |
@swankjesse It's the strict check that put me on the right track for debugging, so it's certainly useful. Other clients (Netty/Apache) silently ignore the data, causing a lot of hair tearing and "WTF is going on?" flying around. I suggest that like Commons Compress, a variable How does that sound? |
(sounding like a broken record) I don't think we should change OkHttp behaviour here. I think what should probably happen is that OkHttp continues to only process the first stream until we find a mainstream Http server that actually encodes multiple streams on the fly. It isn't clear to me what the semantics of multiple streams is? Raw concatenation? expose each stream for processing? Clients should continue to post process these Gzip responses, so the one question for me is whether Okio should support. That seems useful for completeness, and Okio could make a nice API for clients processing each stream in turn. |
Why don’t you think that okhttp should complete the job, rather than doing it half-a$$ed? It makes no sense for clients to process |
Because you haven't shown a non contrived real world example. As I argued before you are causing this by fudging the response headers, without that everything is working in the wild. |
Sorry, but this is as real as it gets. You may wish for all servers to be well-behaving HTTP citizens,but the real world doesn't work that way. Who sets the header is completely irrelevant here; what is relevant, is okhttp parsing the GZIP stream in an incomplete manner. BTW, here's more real stuff, from RFC 1952
We don't have to agree with each other, that's fine. But let's not undermine the issue here. |
I've been a bit blunt in my responses and obviously caused you some offense or frustration. This wasn't my aim. Stating the obvious it's really up to Okio whether to handle this automatically. So there isn't really a fix for OkHttp here. I'm sure they would welcome a PR, and it seems like something that a GzipSource in Okio should handle. At which point the whole argument is invalid. |
Yes, simple concatenation. A concatenation of valid gzip streams is also a valid gzip stream. Only decompressing the first gzip member and ignoring the subsequent data is non-compliant with the standard. It's not my place to say where in the code this should be fixed. However waiting until a known bug causes a problem before you will considering fixing it does not seem like a good strategy to me. |
Can fix. Probably in Okio. I'm under the impression this is also broken in Android’s GzipInputStream, from which GzipSource was derived. |
@yschimke Commons Compress has test cases for this scenario: GZipTestCase.java The file they seem to be using is multiple.gz |
I'm recording a mental note to test this issue #1517 (comment) (specifically sourceforge test url) after a fix lands in Okio. |
https://gist.github.com/swankjesse/025118cadfa93d7f0904cb96d80aaf68 I tried Chrome, Safari, Firefox, and curl. All but Chrome read only the first gzip member. Chrome also includes the subsequent member, but doesn’t properly to Gzip-decompress it and so the resulting page is malformed. Because major web browsers don’t support this, I don’t think OkHttp should support it either. |
Admittedly,
|
The major web browsers are not compliant with the standard. Their bugs should be fixed as well. |
@swankjesse Would you accept a PR for Okio that fixes it there? I'm not offering, just giving an option if it's hampering your usage. But I agree this isn't a priority for OkHttp directly. |
It's a pretty tough thing to get right. For example, do we provide a dictionary for the 2nd member? Would be worthwhile tracking what GzipInputStream does. |
@swankjesse I don't get the "Because major web browsers don’t support this, I don’t think OkHttp should support it either" argument. OkHttp isn't a Browser, and even if it were, "I'll keep doing the wrong thing, even though I know it's wrong, just because others do it" don't make a lot of sense. FWIW, I submitted a PR for this bug in Netty, which was merged. I also reported it to Apache HttpClient, and they introduced tests to prove that they are compliant with the RFC. If you knowingly leave this bug in at this point, OkHttp will the the only major HTTP client to be non-compliant. |
Dictionary? There are no dictionaries in gzip streams. It is trivial to get right. If you get more data after the end of the first gzip member, then simply do exactly the same thing you did at the start of the first member -- decode a gzip member. Repeat until the input is exhausted. gzip is not a self-terminating format, since there is no indication of the last member other than the exhaustion of the data. |
Alright, I’m on board. We’ll need to fix in Okio. |
Also: are there tracking bugs for Chrome, Firefox, and Safari? They aught to fix this too, and if they don’t we should learn why not. |
My use case is a Java client, so I reported to all of the Java clients that I found having this issue (and fixed one, so I'm not merely complaining). I don't care about Browsers, just like I don't care if some Python HTTP client in the wild has this issue too. Not saying those shouldn't be fixed, but I'm not gonna chase them for a fix. |
Are concatenated gzip stream still one the table for okio? I'm trying to write an aggregating gateway which sequentually streams gzipped data from multiple sources to client . The problem is that my client wouldn't be able to read it with okhttp/okio. |
@Dkhusainov can you confirm Chrome or Firefox would interpret the data as you'd prefer here? The gist above will get you started. |
Browsers have nothing to do with the issue. As mentioned above - gzip spec dictates that multiple consecutive gzip payloads is a valid gzip payload. To not break backwards compatability this can be done with a separate This issue hanging in unresolved state isn't helping anyone. Either close it so people who need it know it's not coming and will do it themselves, or give some estimation on when it will be implemented. |
As you wish! We want to be consistent with browsers and so here we are. |
It’s unfortunate that mere laziness had to be wrapped into a senseless excuse “we want to be consistent with the browsers” in order to avoid doing the right thing. OkHttp isn’t a browser. The author of the GZ protocol clearly stated that the behavior is wrong; the authors of OkHttp don’t know better than him. |
Don't make personal attacks in this big tracker. I've blocked @asarkar for 24 hours for accusing us of being lazy. Further violations of our code of conduct will result in a permanent block. |
Recently I intermittently get IOException: CRC checkEqual failure at okio.GzipSource.checkEqual(GzipSource.kt:197) when testing an Android app (very frustrating as it can't be fixed from my side), and the issue discussed here seems the cause, so I really wish this known issue to be fixed by OkHttp, considering how widely used it's in Android community - "As of Android 5.0, OkHttp is part of the Android platform and is used for all HTTP calls." It doesn't sound right to me that OkHttp should follow the behavior of browsers in this case. @swankjesse, could you guys please reconsider it? |
@rsandx The OkHttp that is built into the Android platform is patched under the AOSP. So fixing it here is unlikely to fix that one. OkHttp 3 & 4 is opt in, give a lot of additional features, and designed to set a higher goal for security and correctness. This includes following web standards (both formal RFC and browser based). |
@rsandx Another option here if you feel so strongly about this. The transparent gzip is disabled if you take care of this yourself. Maybe you can publish an open source library for better Gzip support, it's obviously just not yourself affected by these issues. This is an example external compression interceptor that supports gzip and brotli But unfortunately it still suffers with the current issue you face. |
@yschimke Thanks for the info. My app uses OkHttp 3 with Retrofit, does have communication issues like this and "unexpected end of stream error" from time to time, especially when the response body contains data more than a simple message. I'm really struggling to find a way to make the app working stably. |
@rsandx what server are you using? Might be a bug where it isn't sending all the bytes it needs to send. |
@swankjesse I'm using Flask for testing, and aware people saying "unexpected end of stream error" is a server issue, but the same URL returns data correctly in browsers like Firefox and tools like Postman; of course I've tried varies solutions (both server and app sides) people suggested on stackoverflow and the issue still persists. |
What kind of issue is this?
get fixed. Here’s an example: https://gist.github.com/swankjesse/981fcae102f513eb13ed
This issue can't be reproduced in a test. I'll do my best to explain.
As you can see, the server doesn't set a
Content-Encoding = gzip
header, so I do that in an interceptor.Each record is newline delimited JSON string that is inserted into Couchbase. There are around 12 million records in total. Using okhttp, processing fails after about 130000 with the following exception:
However, if I don't set the
Content-Encoding
header (thus skippingGzipSource
), and wrap the input stream withGZIPInputStream
, everything works as expected. I've also tried settingTransfer-Encoding = chunked
on the response and removing theContent-Length
header, but to no avail.So, question is, if
GZIPInputStream
doesn't have a problem, why doesGzipSource
? And since it does, why won't it report what it thinks is the issue? I've a test that runs on a smaller file, 100 records, and it works.I've seen #3457, but unlike the reporter, it's not possible for me to capture the hex body of a 3.4 GB stream.
The text was updated successfully, but these errors were encountered: