Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URI.unescape "extension" fails with Unicode input #32183

Merged
merged 1 commit into from Mar 7, 2018

Conversation

@kivikakk
Copy link
Contributor

@kivikakk kivikakk commented Mar 7, 2018

Summary

This is a fix for a bug @tenderlove and I have been stepping through together.

URI.unescape in Rails 4.2 throws an Encoding::CompatibilityError if the (UTF-8 tagged) argument contains actual Unicode characters. This doesn't happen on 3.x; compare the monkey-patches: Turns out this was a patch made in our own application. Looks like we should be able to just pull this across.

3.x patched

      def unescape(str, escaped = /%[a-fA-F\d]{2}/)
        # TODO: Are we actually sure that ASCII == UTF-8?
        # YK: My initial experiments say yes, but let’s be sure please
        enc = str.encoding
        enc = Encoding::UTF_8 if enc == Encoding::US_ASCII
        str.dup.force_encoding(Encoding::ASCII_8BIT).gsub(escaped) { [$&[1, 2].hex].pack(‘C’) }.force_encoding(enc)
      end

4.2

    def unescape(str, escaped = /%[a-fA-F\d]{2}/)
      # TODO: Are we actually sure that ASCII == UTF-8?
      # YK: My initial experiments say yes, but let’s be sure please
      enc = str.encoding
      enc = Encoding::UTF_8 if enc == Encoding::US_ASCII
      str.gsub(escaped) { [$&[1, 2].hex].pack(‘C’) }.force_encoding(enc)
    end

The issue is that [$&[1, 2].hex].pack('C') returns an ASCII-8BIT tagged string, which we then fail to gsub into str. This wasn't a problem in the 3.x patched variant where the string was tagged as ASCII-8BIT anyway.

This PR opens by correcting the test; parser.escape(str) returns an US-ASCII (!) tagged string, so parser.unescape succeeds for similar reasons as why the 3.x patched variant succeeded. This corrects the test to resemble the actual use-case: passing UTF-8 tagged strings into URI.unescape.

@kivikakk kivikakk changed the title URI.unescape "extension" URI.unescape "extension" fails with Unicode input Mar 7, 2018
@matthewd
Copy link
Member

@matthewd matthewd commented Mar 7, 2018

str.dup.force_encoding(Encoding::ASCII_8BIT)

I can't find that in 3-2-stable 😕

@kivikakk
Copy link
Contributor Author

@kivikakk kivikakk commented Mar 7, 2018

I can't find that in 3-2-stable 😕

Ugh, don't tell me this is in one of our own patches to activesupport … investigating.

@kivikakk
Copy link
Contributor Author

@kivikakk kivikakk commented Mar 7, 2018

Not only is it from a patch in activesupport made in our application, it was me who made the change nearly two years ago. My goodness.

capture d ecran 2018-03-07 a 12 31 30

So this has been in Rails forever — I think it's a valid bug nonetheless.

@matthewd
Copy link
Member

@matthewd matthewd commented Mar 7, 2018

Yeah, I agree it sounds like a sensible change.. that's why I tried to blame-chase why it'd gone away 😅

@kivikakk
Copy link
Contributor Author

@kivikakk kivikakk commented Mar 7, 2018

Haha, right? I'll let the latest commit fail CI so we can red–green this properly, then push the fix commit.

Previously, URI.enscape could handle Unicode input (without any actual
escaped characters), or input with escaped characters (but no actual
Unicode characters) - not both.

    URI.unescape("\xe3\x83\x90")  # => "バ"
    URI.unescape("%E3%83%90")  # => "バ"
    URI.unescape("\xe3\x83\x90%E3%83%90")  # =>
                                         # Encoding::CompatibilityError

We need to let `gsub` handle this for us, and then force back to the
original encoding of the input.  The result String will be mangled if
the percent-encoded characters don't conform to the encoding of the
String itself, but that goes without saying.

Signed-off-by: Ashe Connor <ashe@kivikakk.ee>
@kivikakk
Copy link
Contributor Author

@kivikakk kivikakk commented Mar 7, 2018

This is ready for review. There's two failing builds (one required), but it looks like it's a timeout (and I can't restart it).

@kivikakk
Copy link
Contributor Author

@kivikakk kivikakk commented Mar 7, 2018

Someone reran the failing builds! Thanks! ❤️

@eileencodes eileencodes merged commit 6a54932 into rails:master Mar 7, 2018
2 checks passed
@eileencodes
Copy link
Member

@eileencodes eileencodes commented Mar 7, 2018

Thanks @kivikakk ❤️

@tenderlove
Copy link
Member

@tenderlove tenderlove commented Mar 7, 2018

@kivikakk it seems like this code is almost entirely lifted from upstream. Does this seem like something we should upstream so we can rm the monkeypatch in Rails?

@eileencodes
Copy link
Member

@eileencodes eileencodes commented Mar 7, 2018

@tenderlove this is Rails 😉

@eileencodes
Copy link
Member

@eileencodes eileencodes commented Mar 7, 2018

Oh gosh you meant upstream to Ruby. 😳

@kivikakk kivikakk deleted the uri-ext-fix branch Mar 7, 2018
@kivikakk
Copy link
Contributor Author

@kivikakk kivikakk commented Mar 7, 2018

I dug into it a bit — URI.unescape was deprecated in 2009 (lol), see ruby/ruby@238b979. CGI.uenscape is recommended instead, which doesn't have this bug (it calls string.tr(…).b before gsub. Should we still consider submitting a patch?

@kivikakk
Copy link
Contributor Author

@kivikakk kivikakk commented Mar 8, 2018

Patch opened upstream: https://bugs.ruby-lang.org/issues/14586

After discussion with @tenderlove, it appears the following is the case:

  • This problem exists in upstream Ruby. No-one is currently relying on the behaviour introduced by this PR, since it's always been broken.
  • Since we're monkey-patching Ruby itself, we should try to make the fix there instead of monkey-patching.

So the best course of action looks like:

  • Back this PR out before anyone comes to rely on the newly-fixed behaviour.
  • Try to get the patch across the line in Ruby.
  • Once (if) it's in Ruby-next, reintroduce this monkey-patch; this way the following combinations will work: old Rails, new Ruby; new Rails, old Ruby; new Rails, new Ruby. (Only old Ruby, old Rails won't work.)
  • Keep the monkey-patch in our application. We can remove it if we get a new Rails with the monkey-patch.

Alternatively, we can just keep this PR in and hope Ruby might take the patch, but keeping the diff of core libraries between Ruby and Rails seems like something to aim for.

How does this sound?

@matthewd
Copy link
Member

@matthewd matthewd commented Mar 8, 2018

CGI.unescape is recommended instead

Do we call this monkey-patched method internally? If so, should we maybe be using CGI.unescape instead?

@kivikakk
Copy link
Contributor Author

@kivikakk kivikakk commented Mar 8, 2018

Do we call this monkey-patched method internally? If so, should we maybe be using CGI.unescape instead?

Nope — this is only a convenience patch that our users might be relying on.

@kivikakk
Copy link
Contributor Author

@kivikakk kivikakk commented Mar 8, 2018

This has been patched in Ruby! 🎉

So, I think we can keep this PR in.

@tenderlove
Copy link
Member

@tenderlove tenderlove commented Mar 8, 2018

@kivikakk I think even with the patch in Ruby, this line will never un-monkeypatch URI because the default encoding is UTF-8, but parser.escape will return a US-ASCII string (causing parser.unescape to return a wrongly encoded US-ASCII string). Would you mind fixing that?

@kivikakk
Copy link
Contributor Author

@kivikakk kivikakk commented Mar 9, 2018

Good catch, I'll open a PR shortly. 👍

@kivikakk
Copy link
Contributor Author

@kivikakk kivikakk commented Mar 9, 2018

#32210

jasonroelofs added a commit to jasonroelofs/rails that referenced this issue Aug 9, 2018
Previously, URI.enscape could handle Unicode input (without any actual
escaped characters), or input with escaped characters (but no actual
Unicode characters) - not both.

    URI.unescape("\xe3\x83\x90")  # => "バ"
    URI.unescape("%E3%83%90")  # => "バ"
    URI.unescape("\xe3\x83\x90%E3%83%90")  # =>
                                         # Encoding::CompatibilityError

We need to let `gsub` handle this for us, and then force back to the
original encoding of the input.  The result String will be mangled if
the percent-encoded characters don't conform to the encoding of the
String itself, but that goes without saying.

Signed-off-by: Ashe Connor <ashe@kivikakk.ee>

This is a backport of rails#32183
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants