Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

`invalid multibyte escape` with regexp literals #134

Closed
yujinakayama opened this Issue · 8 comments

4 participants

@yujinakayama

bbatsov/rubocop#796

I've tried to fix this, but I'm not sure the best way.

Here's the points I've investigated so far:

@whitequark
Owner

@yujinakayama Could you please compose a table indicating which combinations of regexp bodies, regexp encoding options and source encoding options result in a deviation of parser's behavior from ruby's behavior? This would speed up solving of this issue a lot.

@yujinakayama

I'm really confused by the combinations of the parameters. :scream:

Note that Parser was run on utf-8 source in this verification.

https://gist.github.com/yujinakayama/9511027

@whitequark
Owner

@yujinakayama Awesome, thanks! I'll take a look, but do not expect quick solution. It's such a horrible mess it'll take a while to untangle it.

@mbj
Collaborator

Noise: I pray for ruby dropping non UTF-8 encoded source. I really have hope for rubinius-x here.

@whitequark whitequark added the bug label
@whitequark
Owner

@yujinakayama Fascinating. I've removed all the cases where Ruby and parser have identical behavior and categorized the leftover ones:

# Magic comment Regexp Ruby (Re encoding or error) Parser (Re.new() encoding or error) Parser regexp node
1 utf-8 /abc/u UTF-8 US-ASCII (str "abc") (regopt :u)
1 us-ascii /abc/u UTF-8 US-ASCII (str "abc") (regopt :u)
1 ascii-8bit /abc/u UTF-8 US-ASCII (str "abc") (regopt :u)
1 koi8-r /abc/u UTF-8 US-ASCII (str "abc") (regopt :u)
1 utf-8 /À/n option 'n' != source 'UTF-8' U+00C0 from UTF-8 to ASCII-8BIT (str "À") (regopt :n)
1 utf-8 /д/n option 'n' != source 'UTF-8' U+0434 from UTF-8 to ASCII-8BIT (str "д") (regopt :n)
2 ascii-8bit /À/u option 'u' != source 'ASCII-8BIT' "\xC3" from ASCII-8BIT to UTF-8 (str "\xC3\x80") (regopt :u)
2 ascii-8bit /д/u option 'u' != source 'ASCII-8BIT' "\xD0" from ASCII-8BIT to UTF-8 (str "\xD0\xB4") (regopt :u)
3 koi8-r /д/u option 'u' != source 'KOI8-R' UTF-8 (str "д") (regopt :u)
3 koi8-r /д/n option 'n' != source 'KOI8-R' U+0434 from UTF-8 to ASCII-8BIT (str "д") (regopt :n)
3 koi8-r /д/ KOI8-R UTF-8 (str "д") (regopt)
3 koi8-r /\xff/ KOI8-R invalid multibyte escape: /\xff/ (str "\\xff") (regopt)
4 us-ascii /\xff/ ASCII-8BIT invalid multibyte escape: /\xff/ (str "\\xff") (regopt)
@whitequark
Owner

So the cases are:

  1. /abc/u and /abc/n. Ruby would always encode a /u regexp in UTF-8. What do we do here? Nothing. Parser doesn't interpret regexp options, and Regexp.new must be passed an UTF-8 string in order to emulate /u, by the consumer. Note that you must use Regexp::FIXEDENCODING in order to reproduce Ruby parser's behavior; just encoding the source in UTF-8 is irrelevant. Identical for /n, which, by the way, is US-ASCII for some obscure reason.
  2. This is more tricky, but probably should be handled by consumer, too. ASCII-8BIT inputs will produce ASCII-8BIT strings in the source, so it can be detected and indicated.
  3. This cannot and will not be reproduced with current architecture of Parser, as Parser internally works only with Unicode or binary strings.
  4. This seems to be a bug.
@whitequark
Owner

@yujinakayama Actually, this is not a bug. You're parsing that file in the wrong mode.

Look:

$ ruby -v
ruby 2.1.1p76 (2014-02-24 revision 45161) [x86_64-linux]
$ ruby -e '/\xff/'
-e:1: invalid multibyte escape: /\xff/
$ ./bin/ruby-parse --19 -e '/xff/'
(regexp
  (str "xff")
  (regopt))
$ ./bin/ruby-parse --19 -e 'if /\xff/ =~ foo; end'
(if
  (match-with-lvasgn
    (regexp
      (str "\\xff")
      (regopt))
    (send nil :foo)) nil nil)
$ ./bin/ruby-parse --20 -e '/xff/'
(regexp
  (str "xff")
  (regopt))
$ ./bin/ruby-parse --20 -e 'if /\xff/ =~ foo; end'
Failed on: (fragment:0)
/home/whitequark/Work/parser/lib/parser/builders/default.rb:726:in `initialize': invalid multibyte escape: /\xff/ (RegexpError)
        from /home/whitequark/Work/parser/lib/parser/builders/default.rb:726:in `new'
        [snip]

That file wouldn't actually run under 2.0, and parser in 1.9 mode handles it just fine.

@whitequark whitequark closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.