Skip to content
Browse files

Small optimization of 1.9 unescape. We should make sure that inbound …

…ASCII always means UTF-8. It seems so based on a quick survey of common browsers, but let's be sure
  • Loading branch information...
1 parent b8af484 commit 16ee4b4d1b125bd3edb5c191d58c7afdf6d3232e @wycats wycats committed Jun 4, 2010
Showing with 6 additions and 2 deletions.
  1. +6 −2 activesupport/lib/active_support/core_ext/uri.rb
8 activesupport/lib/active_support/core_ext/uri.rb
@@ -6,11 +6,15 @@
str = "\xE6\x97\xA5\xE6\x9C\xAC\xE8\xAA\x9E" # Ni-ho-nn-go in UTF-8, means Japanese.
parser =
unless str == parser.unescape(parser.escape(str))
URI::Parser.class_eval do
remove_method :unescape
- def unescape(str, escaped = @regexp[:ESCAPED])
- enc = (str.encoding == Encoding::US_ASCII) ? Encoding::UTF_8 : str.encoding
+ def unescape(str, escaped = /%[a-fA-F\d]{2}/)
+ # TODO: Are we actually sure that ASCII == UTF-8?
mat813 added a note Jun 5, 2010

It's actually utf-8's definition :

UTF-8 encodes each character (code point) in 1 to 4 octets (8-bit bytes), with the single–octet encoding used only for the 128 US-ASCII characters.

Meaning that if ruby says it's US_ASCII, it should only have 7 bits characters, and that it's safe to change the encoding to UTF-8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ # YK: My initial experiments say yes, but let's be sure please
+ enc = str.encoding
+ enc = Encoding::UTF_8 if enc == Encoding::US_ASCII
str.gsub(escaped) { [$&[1, 2].hex].pack('C') }.force_encoding(enc)

0 comments on commit 16ee4b4

Please sign in to comment.
Something went wrong with that request. Please try again.