Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gzip issue with jruby and Rack::ETag #571

Closed
jillesvangurp opened this issue Jun 6, 2013 · 7 comments
Closed

gzip issue with jruby and Rack::ETag #571

jillesvangurp opened this issue Jun 6, 2013 · 7 comments

Comments

@jillesvangurp
Copy link

I'm running in a weird issue that seems to be jruby specific where there is some interaction between Rack::ETag and Rack::Deflater.

The application below returns invalid gzip content:

use Rack::ETag
use Rack::Deflater

class TestApp
def call(env)
[200, {},["hello\n"]]
end
end

run TestApp.new

I test this using
$ rackup
$ wget -q -O- --header 'accept-Encoding: gzip' 'http://localhost:9292/' | gzcat

gzip: stdin: not in gzip format

As soon as I remove the Rack::ETag line everything is fine. All works fine on mri, it's just jruby that's problematic.

I'm using jruby-1.7.4 and have these gems installed:
jruby-launcher (1.0.16 java)
rack (1.5.2)
rack-protection (1.5.0)
rake (10.0.3)
sinatra (1.4.2)
tilt (1.4.1)

@srawlins
Copy link

srawlins commented Jul 8, 2013

This is happening during Rack::ETag's #digest_body right here:

parts = []
body.each { |part| parts << part }

This code assumes that body.each is not going to reuse an object that is passed in as |part|. However, Rack::Deflater::GzipStream combined with JRuby's Zlib::GzipWriter does exactly that.

What this means is:

  1. body.each passes the block as a Proc along to other code (Zlib::GzipWriter, and Rack::Deflater::GzipStream#write).
  2. This other code calls the block a first time. Now parts has one element, the first part that was yielded to the block.
  3. Before the block is called again, "other code" modifies the object that was passed in as part, which means the first and only element of parts is now different! (mind blown) It is (approximately) the next content that will be passed into the block.

Here is an example stand-alone script, so you can play around with JRuby and MRI's GzipStreams.

Three solutions exist, that I see:

  1. Change line 58 to instead be: body.each { |part| parts << part.dup }. There should not be a performance hit except in these really weird cases where body.each is doing bogus stuff like this.
  2. #digest_body could return [digest, body] instead of [digest, parts], but it seems like this return might be intentional, maybe to avoid computation twice.
  3. Wait for JRuby to maybe fix this, assuming it is a legitimate JRuby bug.

@jillesvangurp
Copy link
Author

Good catch. I wonder if it is necessary for Rack::ETag to duplicate the body as a string in memory like that at all. It seems rather wasteful, especially with large responses. Would it be possible to incrementally update the md5 digest like is done here: http://www.artima.com/forums/flat.jsp?forum=123&thread=40956. Something like:

    incr_digest = Digest::MD5.new()
    body.each { |part| incr_digest << part }
    digest = incr_digest.hexdigest

The only issue that I see with this is that digest_body currently returns both the digest and the parts array, which would need refactoring to just return the digest. The caller already has a ref to the body so it could just call .each again on it.

@srawlins
Copy link

srawlins commented Jul 9, 2013

I wonder if it is necessary for Rack::ETag to duplicate the body as a string in memory like that at all.

MRI (and others, I'm sure) use a copy-on-write policy when dup'ing Strings. I don't think there is hardly any performance hit here ( except in the case of Deflater preceding ETag). I would like someone else to confirm though.

The only issue that I see with this is that digest_body currently returns both the digest and the parts array, which would need refactoring to just return the digest. The caller already has a ref to the body so it could just call .each again on it.

Right, here I will defer to a Rack maintainer to explain whether and why parts was returned. My guess is that Rack::ETag just assumes that body.each might be doing real work (if body is an Enumerable of Strings, then no work is being done, but in generator-type body objects, like in this case, real work might be done). Since Rack::ETag knows that body.each will be run at some point in the future, it just passes its own result on up.

@srawlins
Copy link

srawlins commented Jul 9, 2013

Oops, my bad. Solution 1 (body.each { |part| parts << part.dup }) doesn't work. JRuby's GzipWriter is doing something really weird so that neither #dup, #clone work. The only solutions that work are along the lines of:

  • body.each { |part| parts << "" + part }
  • body.each { |part| parts << "%s" % part }

Which are way too round-about to commit to Rack::ETag. I think a bug needs to be opened against JRuby.

@jillesvangurp
Copy link
Author

Your understanding of this issue is much better than mine, perhaps best if you file the bug against Jruby?

My current understanding of the issue is that with etags the header needs to go out before the body. So, if jruby modifies the parts while the etag is also consuming the parts and keeping a reference, I can easily see some conflict arising. That's why I brought up incremental hashing above, since there'd be no need for the logic to keep references to the parts and it would be alright for Jruby to do its thing (still assumes the etag logic sees the part first). Also, the current logic concatenates the parts into a new string after all parts have been yielded before calculating the hash, which sounds redundant to me even if ruby does it pretty efficiently.

In any case, the issue not pressing for me anymore. I now handle compression in nginx.

@raggi
Copy link
Member

raggi commented Dec 28, 2013

Reusing buffers in #each is not valid. The specification for body#each clearly states that it must yield usable strings for the body.

As far as some of the other proposals here, none of them address the issue that in this case the fault is outside of rack.

@jillesvangurp
Copy link
Author

Filed a bug against jruby for this: jruby/jruby#1371

ecbypi added a commit to Vermonster/wmsb that referenced this issue Jan 28, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants