Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add failing spec for parsing invalid query bytes #624

Closed
wants to merge 1 commit into from
Closed

Add failing spec for parsing invalid query bytes #624

wants to merge 1 commit into from

Conversation

sorentwo
Copy link

One of our mobile client's has an intermittent bug which causes it to send requests with binary uploads, but with the Content-Type header incorrectly set to application/x-www-form-urlencoded. As a result when Rack::Utils attempts to parse the query params it throws an ArgumentError: invalid byte sequence in UTF-8 exception. Here is the relevant slice of the stack trace:

[GEM_ROOT]/gems/rack-1.4.5/lib/rack/utils.rb:104:in `normalize_params`
[GEM_ROOT]/gems/rack-1.4.5/lib/rack/utils.rb:96:in `block in parse_nested_query`
[GEM_ROOT]/gems/rack-1.4.5/lib/rack/utils.rb:93:in `each`
[GEM_ROOT]/gems/rack-1.4.5/lib/rack/utils.rb:93:in `parse_nested_query`
[GEM_ROOT]/gems/rack-1.4.5/lib/rack/request.rb:332:in `parse_query`
[GEM_ROOT]/gems/actionpack-3.2.15/lib/action_dispatch/http/request.rb:275:in `parse_query`
[GEM_ROOT]/gems/rack-1.4.5/lib/rack/request.rb:209:in `POST`

I've looked into forcing an alternate encoding or swallowing the error, but there are a lot of different approaches possible, all with different support and performance considerations. Considering Rack supports Ruby 1.8 through 2.1 I would like some feedback on how you would approach the matter.

Some possible solutions I see are:

  1. Strip any characters that aren't UTF-8 using either String.encode or Iconv, depending on Ruby version.
  2. Catch the exception and skip parsing the params entirely.

Unfortunately I've seen this error thousands of times, and there is no way for me to stop the exception from happening inside my application's stack.

@sorentwo
Copy link
Author

I've inserted a middleware before ParamsParser to force a legitimate Content-Type. It is a bit of a hack, but it prevents the underlying problem.

I'm not sure how common a problem this is, but it is really easy to replicate. Without investigating the stack trace and headers it would appear to be an obscure encoding error.

@tenderlove
Copy link
Member

Hi, I'm not 100% sure how to handle this. I don't think invalid unicode character should make Rack blow up. At the very least, we should hand the data to the app somehow.

Second, I'm not sure the test is 100% accurate. In the string literal will be tagged as the file's encoding, but I think (I'm not 100% sure) that the strings passed to this function come off the socket and will be tagged as ASCII-8BIT. So the test should probably be this:

should "parse nested queries with invalid bytes correctly" do
  x = "fo\255[]=1"
  x.force_encoding Encoding::ASCII_8BIT
  Rack::Utils.parse_nested_query(x)
end

It still blows up, just not in the same place. Here is a potential fix:

diff --git a/lib/rack/utils.rb b/lib/rack/utils.rb
index 43bbef3..d4b730a 100644
--- a/lib/rack/utils.rb
+++ b/lib/rack/utils.rb
@@ -91,7 +91,7 @@ module Rack
       params = KeySpaceConstrainedParams.new

       (qs || '').split(d ? /[#{d}] */n : DEFAULT_SEP).each do |p|
-        k, v = p.split('=', 2).map { |s| unescape(s) }
+        k, v = p.split('=', 2).map { |s| unescape(s, Encoding::ASCII_8BIT) }

         normalize_params(params, k, v)
       end
@@ -107,9 +107,12 @@ module Rack

       return if k.empty?

+      k.force_encoding Encoding::UTF_8 if k
+
       if after == ""
         params[k] = v
       elsif after == "[]"
+        v.force_encoding Encoding::UTF_8 if v
         params[k] ||= []
         raise TypeError, "expected Array (got #{params[k].class.name}) for param `#{k}'" unless params[k].is_a?(Array)
         params[k] << v

We tag the strings with the encoding after they've been parsed. It won't blow up in Rack anymore. You'll be able to get the data, but the encoding may potentially be invalid:

require 'rack/utils'

x = "fo\255[]=1"
x.force_encoding Encoding::ASCII_8BIT
z = Rack::Utils.parse_nested_query(x)
p z
z.each do |k,v|
  p [k, k.encoding, k.valid_encoding?]
  v.each do |v|
    p [v, v.encoding, v.valid_encoding?]
  end
end

output:

[aaron@higgins rack (master)]$ ruby -I lib x.rb 
{"fo\xAD"=>["1"]}
["fo\xAD", #<Encoding:UTF-8>, false]
["1", #<Encoding:UTF-8>, true]
[aaron@higgins rack (master)]$

It may be invalid, but at least the application has a chance to do something with the data. For example:

require 'rack/utils'

x = "fo\255[]=1"
x.force_encoding Encoding::ASCII_8BIT
z = Rack::Utils.parse_nested_query(x)
p z
z.each do |k,v|
  unless k.valid_encoding? # clean up invalid strings
    k = k.dup.encode!(:invalid => :replace)
  end
  p [k, k.encoding, k.valid_encoding?]
  v.each do |v|
    p [v, v.encoding, v.valid_encoding?]
  end
end

Output:

[aaron@higgins rack (master)]$ ruby -I lib x.rb 
{"fo\xAD"=>["1"]}
["fo�", #<Encoding:UTF-8>, true]
["1", #<Encoding:UTF-8>, true]
[aaron@higgins rack (master)]$

Or the application could just choose to raise an exception.

I think this is the way to go, but I'm open to suggestions! :-)

@leahneukirchen
Copy link
Member

So, how do I write an app which requires latin-1 encoded parameter strings?

@tenderlove
Copy link
Member

@chneukirchen We should probably allow people to pass an encoding to parse_nested_params (which would then get passed to unescape, and then to the method on URI), but I think it's a different problem than this issue. Even if you're allowed to specify the encoding, that does not mean that bytes coming off the wire will be valid for that encoding.

@sorentwo
Copy link
Author

sorentwo commented Dec 6, 2013

@tenderlove With the patch applied and the original test an ArgumentError is still thrown:

ArgumentError: invalid byte sequence in UTF-8
    /Users/pselbert/Work/Code/rack/lib/rack/utils.rb:93:in `split': Rack::Utils - should parse nested queries with invalid bytes correctly

Changing the test over to force the encoding, as you proposed in turn throws a Bacon::Error for an empty specification. Adding an actual expectation fixed the problem:

  should "parse nested queries with invalid bytes correctly" do
    x = "fo\255[]=1".force_encoding(Encoding::ASCII_8BIT)
    Rack::Utils.parse_nested_query(x).should.equal("fo\xAD" => ["1"])
  end

The effected Rails app has a request spec that covers the scenario (and passes with the Content-Type correcting middleware. I went ahead and applied the patch to the bundled Rack gem, removed the middleware, and ran the spec again. Right back to the invalid byte sequence in UTF-8 error again.

In the request spec the body is pure binary data, it is a portion of a png file. It may not be possible to force encoding—perhaps Rack should avoid parameter parsing entirely in this situation?

@tenderlove
Copy link
Member

Rack should avoid parameter parsing entirely in this situation?

After thinking about it for a while, I came to the same conclusion. If we don't know the encoding of the parameters, we can't parse it at all. force_encoding to binary, then parsing is the wrong solution because the characters we use in the parser could have a different meaning depending on the actual encoding of the string (e.g. a codepage where the ascii character value of [ means something else).

@sorentwo
Copy link
Author

@tenderlove I haven't forgotten about this. I'll take a stab at bypassing parsing altogether sometime today or tomorrow.

@bf4
Copy link

bf4 commented Dec 13, 2013

fyi whitequark/rack-utf8_sanitizer#7

and note that Rack specifies

"The input stream is an IO-like object which contains the raw HTTP POST data. When applicable, its external encoding must be “ASCII-8BIT” and it must be opened in binary mode, for Ruby 1.9 compatibility. The input stream must respond to gets, each, read and rewind."

@@ -204,6 +204,10 @@ def kcodeu
message.should.equal "expected Array (got String) for param `y'"
end

should "parse nested queries with invalid bytes correctly" do
Rack::Utils.parse_nested_query("fo\255[]=1")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this missing the test that it doesn't raise an ArgumentError. I recently wrote something like this for a Rails app I work on, though in rspec

# coding: utf-8
require 'spec_helper'
require_dependency Rails.root.join('lib/string_formatter').to_s

describe StringFormatter do
  it '#hash_values_as_utf8' do
    h = {
      :foo => "Schäfbernt\xED",
      'bar' => 1,
      'baz' => :qux,
    }.with_indifferent_access
    expect {
      h[:foo].present?
    }.to raise_error(ArgumentError) {|e|
      e.message == "invalid byte sequence in UTF-8"
    }
    StringFormatter.hash_values_as_utf8!(h)
    expect(h[:foo].present?).to be_true
    expect(h[:foo]).to eq("Schäfbernt")
    expect(h[:bar]).to eq(1)
    expect(h[:baz]).to eq(:qux)
  end
end

and the impl

# coding: utf-8
module StringFormatter
  module_function

  def hash_values_as_utf8!(hash)
    hash.each do |key, value|
      next unless value.is_a?(String)
      hash[key] = StringFormatter.as_utf8(value, '')
    end
  end

  def as_utf8(str, replace = '_')
    if str.respond_to?(:encode!)
      if defined?(JRUBY_VERSION)
        # don't blow up https://github.com/jruby/jruby/issues/375
        str = as_ascii(str)
      else
        # Converting it to a higher higher character set (UTF-16) and then
        #  back (to UTF-8) ensures that you will strip away invalid or undefined byte sequences.
        str.encode!(Encoding::UTF_16LE, :invalid => :replace, :undef => :replace, :replace => replace)
      end
      str.encode!(Encoding::UTF_8)
    else
      str
    end
  rescue => e
    Airbrake.notify(e)
    force_utf8_encoding(str)
  end

  # help out copy and pasting errors of good-looking email addresses
  # by stripping out non-ASCII characters
  def as_ascii(str)
    if str.respond_to?(:to_ascii)
      str.to_ascii
    else
      # avoids invalid multi-byte escape error
      ascii_text = str.encode( Encoding::ASCII, invalid: :replace, undef: :replace, replace: '' )
      # see http://www.ruby-forum.com/topic/183413
      pattern = Regexp.new('[\x80-\xff]', nil, 'n')
      ascii_text.gsub(pattern, '')
    end
  end
end

For 1.8, I have seen something like the below using BINARY aka ASCII-8BIT

  # Encodes a string from encoding "from" to encoding "to" in
  # a way that works for both ruby 1.8 and 1.9
  def convert_string_encoding(to, from, str)
    if "1.9".respond_to?(:force_encoding)
      str = str.dup if str.frozen?
      str.encode(to, from, :undef => :replace)
    else
      require 'iconv'
      Iconv.conv(to, from, str)
    end
  end

@raggi raggi added this to the Rack 1.6 milestone Jul 12, 2014
@raggi
Copy link
Member

raggi commented Nov 26, 2014

RE: "Unfortunately I've seen this error thousands of times, and there is no way for me to stop the exception from happening inside my application's stack." @sorentwo

Yes, you can write a middleware that peeks at the request body to see if it is indeed form data, or if it's arbitrary binary. A magic bytes scanner should get you most of the way there, and you can do this before anyone ever creates a Rack::Request. If you can classify the mime type from the request data, just correct the header and pass it on to the app, if you find form data that isn't valid form data, fire a 400.

https://github.com/minad/mimemagic or https://github.com/blackwinter/ruby-filemagic should get you started.

@sorentwo
Copy link
Author

I should have followed this up (months ago), but that is exactly what I did. Our app only saw this for certain endpoints so my solution was a bit more ad-hoc.

Clearly my efforts here aren't getting anywhere, and there are workarounds available, so I'm going to close this out.

@sorentwo sorentwo closed this Nov 27, 2014
@raggi
Copy link
Member

raggi commented Nov 27, 2014

There's just not a lot we can do about invalid requests without breaking users apps. It's not valid for us, being a generic middle-of-the-stack library to enforce corrections on bad requests. Do you have some recommendation?

@sorentwo
Copy link
Author

No recommendations really. I was being serious that I wasn't getting anywhere, I hadn't touched the issue in nearly a year.

With middleware it is possible to prevent this issue. Also, the tools available to handle encoding issues vary so much between Ruby versions that it seems easier to apply for a particular application.

@raggi
Copy link
Member

raggi commented Nov 28, 2014

I think we should add this to the known-issues file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants