Skip to content

`invalid multibyte escape` with regexp literals #134

yujinakayama opened this Issue Feb 28, 2014 · 8 comments

4 participants



I've tried to fix this, but I'm not sure the best way.

Here's the points I've investigated so far:


@yujinakayama Could you please compose a table indicating which combinations of regexp bodies, regexp encoding options and source encoding options result in a deviation of parser's behavior from ruby's behavior? This would speed up solving of this issue a lot.


I'm really confused by the combinations of the parameters. 😱

Note that Parser was run on utf-8 source in this verification.


@yujinakayama Awesome, thanks! I'll take a look, but do not expect quick solution. It's such a horrible mess it'll take a while to untangle it.

mbj commented Mar 12, 2014

Noise: I pray for ruby dropping non UTF-8 encoded source. I really have hope for rubinius-x here.

@whitequark whitequark added the bug label Apr 13, 2014

@yujinakayama Fascinating. I've removed all the cases where Ruby and parser have identical behavior and categorized the leftover ones:

# Magic comment Regexp Ruby (Re encoding or error) Parser ( encoding or error) Parser regexp node
1 utf-8 /abc/u UTF-8 US-ASCII (str "abc") (regopt :u)
1 us-ascii /abc/u UTF-8 US-ASCII (str "abc") (regopt :u)
1 ascii-8bit /abc/u UTF-8 US-ASCII (str "abc") (regopt :u)
1 koi8-r /abc/u UTF-8 US-ASCII (str "abc") (regopt :u)
1 utf-8 /À/n option 'n' != source 'UTF-8' U+00C0 from UTF-8 to ASCII-8BIT (str "À") (regopt :n)
1 utf-8 /д/n option 'n' != source 'UTF-8' U+0434 from UTF-8 to ASCII-8BIT (str "д") (regopt :n)
2 ascii-8bit /À/u option 'u' != source 'ASCII-8BIT' "\xC3" from ASCII-8BIT to UTF-8 (str "\xC3\x80") (regopt :u)
2 ascii-8bit /д/u option 'u' != source 'ASCII-8BIT' "\xD0" from ASCII-8BIT to UTF-8 (str "\xD0\xB4") (regopt :u)
3 koi8-r /д/u option 'u' != source 'KOI8-R' UTF-8 (str "д") (regopt :u)
3 koi8-r /д/n option 'n' != source 'KOI8-R' U+0434 from UTF-8 to ASCII-8BIT (str "д") (regopt :n)
3 koi8-r /д/ KOI8-R UTF-8 (str "д") (regopt)
3 koi8-r /\xff/ KOI8-R invalid multibyte escape: /\xff/ (str "\\xff") (regopt)
4 us-ascii /\xff/ ASCII-8BIT invalid multibyte escape: /\xff/ (str "\\xff") (regopt)

So the cases are:

  1. /abc/u and /abc/n. Ruby would always encode a /u regexp in UTF-8. What do we do here? Nothing. Parser doesn't interpret regexp options, and must be passed an UTF-8 string in order to emulate /u, by the consumer. Note that you must use Regexp::FIXEDENCODING in order to reproduce Ruby parser's behavior; just encoding the source in UTF-8 is irrelevant. Identical for /n, which, by the way, is US-ASCII for some obscure reason.
  2. This is more tricky, but probably should be handled by consumer, too. ASCII-8BIT inputs will produce ASCII-8BIT strings in the source, so it can be detected and indicated.
  3. This cannot and will not be reproduced with current architecture of Parser, as Parser internally works only with Unicode or binary strings.
  4. This seems to be a bug.

@yujinakayama Actually, this is not a bug. You're parsing that file in the wrong mode.


$ ruby -v
ruby 2.1.1p76 (2014-02-24 revision 45161) [x86_64-linux]
$ ruby -e '/\xff/'
-e:1: invalid multibyte escape: /\xff/
$ ./bin/ruby-parse --19 -e '/xff/'
  (str "xff")
$ ./bin/ruby-parse --19 -e 'if /\xff/ =~ foo; end'
      (str "\\xff")
    (send nil :foo)) nil nil)
$ ./bin/ruby-parse --20 -e '/xff/'
  (str "xff")
$ ./bin/ruby-parse --20 -e 'if /\xff/ =~ foo; end'
Failed on: (fragment:0)
/home/whitequark/Work/parser/lib/parser/builders/default.rb:726:in `initialize': invalid multibyte escape: /\xff/ (RegexpError)
        from /home/whitequark/Work/parser/lib/parser/builders/default.rb:726:in `new'

That file wouldn't actually run under 2.0, and parser in 1.9 mode handles it just fine.

@whitequark whitequark closed this Apr 17, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.