`invalid multibyte escape` with regexp literals #134

yujinakayama · 2014-02-28T11:53:36Z

I've tried to fix this, but I'm not sure the best way.

Here's the points I've investigated so far:

Regexp literals that include invalid character code with the source's encoding raise error. (e.g. /\xff/ in utf-8 source)
This Ruby documentation describes that US_ASCII encoding rejects character codes that have non-zero 8th bit, but actually it strangely accepts such regexp and convert it to ASCII-8BIT encoding.
- https://gist.github.com/yujinakayama/ac6b20a2b34e0d215f38
My trial-and-error branch: https://github.com/yujinakayama/parser/tree/fix-regexp-encoding

The text was updated successfully, but these errors were encountered:

whitequark · 2014-02-28T11:59:58Z

@yujinakayama Could you please compose a table indicating which combinations of regexp bodies, regexp encoding options and source encoding options result in a deviation of parser's behavior from ruby's behavior? This would speed up solving of this issue a lot.

bbatsov · 2014-03-12T07:45:30Z

@yujinakayama Ping.

yujinakayama · 2014-03-12T17:06:27Z

I'm really confused by the combinations of the parameters. 😱

Note that Parser was run on utf-8 source in this verification.

https://gist.github.com/yujinakayama/9511027

whitequark · 2014-03-12T17:08:44Z

@yujinakayama Awesome, thanks! I'll take a look, but do not expect quick solution. It's such a horrible mess it'll take a while to untangle it.

mbj · 2014-03-12T17:51:11Z

Noise: I pray for ruby dropping non UTF-8 encoded source. I really have hope for rubinius-x here.

whitequark · 2014-04-17T03:53:58Z

@yujinakayama Fascinating. I've removed all the cases where Ruby and parser have identical behavior and categorized the leftover ones:

#	Magic comment	Regexp	Ruby (Re encoding or error)	Parser (`Re.new()` encoding or error)	Parser regexp node
1	utf-8	`/abc/u`	UTF-8	US-ASCII	`(str "abc")` `(regopt :u)`
1	us-ascii	`/abc/u`	UTF-8	US-ASCII	`(str "abc")` `(regopt :u)`
1	ascii-8bit	`/abc/u`	UTF-8	US-ASCII	`(str "abc")` `(regopt :u)`
1	koi8-r	`/abc/u`	UTF-8	US-ASCII	`(str "abc")` `(regopt :u)`
1	utf-8	`/À/n`	option 'n' != source 'UTF-8'	U+00C0 from UTF-8 to ASCII-8BIT	`(str "À")` `(regopt :n)`
1	utf-8	`/д/n`	option 'n' != source 'UTF-8'	U+0434 from UTF-8 to ASCII-8BIT	`(str "д")` `(regopt :n)`
2	ascii-8bit	`/À/u`	option 'u' != source 'ASCII-8BIT'	"\xC3" from ASCII-8BIT to UTF-8	`(str "\xC3\x80")` `(regopt :u)`
2	ascii-8bit	`/д/u`	option 'u' != source 'ASCII-8BIT'	"\xD0" from ASCII-8BIT to UTF-8	`(str "\xD0\xB4")` `(regopt :u)`
3	koi8-r	`/д/u`	option 'u' != source 'KOI8-R'	UTF-8	`(str "д")` `(regopt :u)`
3	koi8-r	`/д/n`	option 'n' != source 'KOI8-R'	U+0434 from UTF-8 to ASCII-8BIT	`(str "д")` `(regopt :n)`
3	koi8-r	`/д/`	KOI8-R	UTF-8	`(str "д")` `(regopt)`
3	koi8-r	`/\xff/`	KOI8-R	invalid multibyte escape: /\xff/	`(str "\\xff")` `(regopt)`
4	us-ascii	`/\xff/`	ASCII-8BIT	invalid multibyte escape: /\xff/	`(str "\\xff")` `(regopt)`

whitequark · 2014-04-17T04:12:39Z

So the cases are:

/abc/u and /abc/n. Ruby would always encode a /u regexp in UTF-8. What do we do here? Nothing. Parser doesn't interpret regexp options, and Regexp.new must be passed an UTF-8 string in order to emulate /u, by the consumer. Note that you must use Regexp::FIXEDENCODING in order to reproduce Ruby parser's behavior; just encoding the source in UTF-8 is irrelevant. Identical for /n, which, by the way, is US-ASCII for some obscure reason.
This is more tricky, but probably should be handled by consumer, too. ASCII-8BIT inputs will produce ASCII-8BIT strings in the source, so it can be detected and indicated.
This cannot and will not be reproduced with current architecture of Parser, as Parser internally works only with Unicode or binary strings.
This seems to be a bug.

whitequark · 2014-04-17T04:24:19Z

@yujinakayama Actually, this is not a bug. You're parsing that file in the wrong mode.

Look:

$ ruby -v
ruby 2.1.1p76 (2014-02-24 revision 45161) [x86_64-linux]
$ ruby -e '/\xff/'
-e:1: invalid multibyte escape: /\xff/
$ ./bin/ruby-parse --19 -e '/xff/'
(regexp
  (str "xff")
  (regopt))
$ ./bin/ruby-parse --19 -e 'if /\xff/ =~ foo; end'
(if
  (match-with-lvasgn
    (regexp
      (str "\\xff")
      (regopt))
    (send nil :foo)) nil nil)
$ ./bin/ruby-parse --20 -e '/xff/'
(regexp
  (str "xff")
  (regopt))
$ ./bin/ruby-parse --20 -e 'if /\xff/ =~ foo; end'
Failed on: (fragment:0)
/home/whitequark/Work/parser/lib/parser/builders/default.rb:726:in `initialize': invalid multibyte escape: /\xff/ (RegexpError)
        from /home/whitequark/Work/parser/lib/parser/builders/default.rb:726:in `new'
        [snip]

That file wouldn't actually run under 2.0, and parser in 1.9 mode handles it just fine.

whitequark added the bug label Apr 13, 2014

whitequark closed this as completed Apr 17, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`invalid multibyte escape` with regexp literals #134

`invalid multibyte escape` with regexp literals #134

yujinakayama commented Feb 28, 2014

whitequark commented Feb 28, 2014

bbatsov commented Mar 12, 2014

yujinakayama commented Mar 12, 2014

whitequark commented Mar 12, 2014

mbj commented Mar 12, 2014

whitequark commented Apr 17, 2014

whitequark commented Apr 17, 2014

whitequark commented Apr 17, 2014

invalid multibyte escape with regexp literals #134

invalid multibyte escape with regexp literals #134

Comments

yujinakayama commented Feb 28, 2014

whitequark commented Feb 28, 2014

bbatsov commented Mar 12, 2014

yujinakayama commented Mar 12, 2014

whitequark commented Mar 12, 2014

mbj commented Mar 12, 2014

whitequark commented Apr 17, 2014

whitequark commented Apr 17, 2014

whitequark commented Apr 17, 2014

`invalid multibyte escape` with regexp literals #134

`invalid multibyte escape` with regexp literals #134