Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

invalid multibyte escape with regexp literals #134

Closed
yujinakayama opened this issue Feb 28, 2014 · 8 comments
Closed

invalid multibyte escape with regexp literals #134

yujinakayama opened this issue Feb 28, 2014 · 8 comments
Labels

Comments

@yujinakayama
Copy link
Contributor

rubocop/rubocop#796

I've tried to fix this, but I'm not sure the best way.

Here's the points I've investigated so far:

@whitequark
Copy link
Owner

@yujinakayama Could you please compose a table indicating which combinations of regexp bodies, regexp encoding options and source encoding options result in a deviation of parser's behavior from ruby's behavior? This would speed up solving of this issue a lot.

@bbatsov
Copy link
Contributor

bbatsov commented Mar 12, 2014

@yujinakayama Ping.

@yujinakayama
Copy link
Contributor Author

I'm really confused by the combinations of the parameters. 😱

Note that Parser was run on utf-8 source in this verification.

https://gist.github.com/yujinakayama/9511027

@whitequark
Copy link
Owner

@yujinakayama Awesome, thanks! I'll take a look, but do not expect quick solution. It's such a horrible mess it'll take a while to untangle it.

@mbj
Copy link
Collaborator

mbj commented Mar 12, 2014

Noise: I pray for ruby dropping non UTF-8 encoded source. I really have hope for rubinius-x here.

@whitequark whitequark added the bug label Apr 13, 2014
@whitequark
Copy link
Owner

@yujinakayama Fascinating. I've removed all the cases where Ruby and parser have identical behavior and categorized the leftover ones:

# Magic comment Regexp Ruby (Re encoding or error) Parser (Re.new() encoding or error) Parser regexp node
1 utf-8 /abc/u UTF-8 US-ASCII (str "abc") (regopt :u)
1 us-ascii /abc/u UTF-8 US-ASCII (str "abc") (regopt :u)
1 ascii-8bit /abc/u UTF-8 US-ASCII (str "abc") (regopt :u)
1 koi8-r /abc/u UTF-8 US-ASCII (str "abc") (regopt :u)
1 utf-8 /À/n option 'n' != source 'UTF-8' U+00C0 from UTF-8 to ASCII-8BIT (str "À") (regopt :n)
1 utf-8 /д/n option 'n' != source 'UTF-8' U+0434 from UTF-8 to ASCII-8BIT (str "д") (regopt :n)
2 ascii-8bit /À/u option 'u' != source 'ASCII-8BIT' "\xC3" from ASCII-8BIT to UTF-8 (str "\xC3\x80") (regopt :u)
2 ascii-8bit /д/u option 'u' != source 'ASCII-8BIT' "\xD0" from ASCII-8BIT to UTF-8 (str "\xD0\xB4") (regopt :u)
3 koi8-r /д/u option 'u' != source 'KOI8-R' UTF-8 (str "д") (regopt :u)
3 koi8-r /д/n option 'n' != source 'KOI8-R' U+0434 from UTF-8 to ASCII-8BIT (str "д") (regopt :n)
3 koi8-r /д/ KOI8-R UTF-8 (str "д") (regopt)
3 koi8-r /\xff/ KOI8-R invalid multibyte escape: /\xff/ (str "\\xff") (regopt)
4 us-ascii /\xff/ ASCII-8BIT invalid multibyte escape: /\xff/ (str "\\xff") (regopt)

@whitequark
Copy link
Owner

So the cases are:

  1. /abc/u and /abc/n. Ruby would always encode a /u regexp in UTF-8. What do we do here? Nothing. Parser doesn't interpret regexp options, and Regexp.new must be passed an UTF-8 string in order to emulate /u, by the consumer. Note that you must use Regexp::FIXEDENCODING in order to reproduce Ruby parser's behavior; just encoding the source in UTF-8 is irrelevant. Identical for /n, which, by the way, is US-ASCII for some obscure reason.
  2. This is more tricky, but probably should be handled by consumer, too. ASCII-8BIT inputs will produce ASCII-8BIT strings in the source, so it can be detected and indicated.
  3. This cannot and will not be reproduced with current architecture of Parser, as Parser internally works only with Unicode or binary strings.
  4. This seems to be a bug.

@whitequark
Copy link
Owner

@yujinakayama Actually, this is not a bug. You're parsing that file in the wrong mode.

Look:

$ ruby -v
ruby 2.1.1p76 (2014-02-24 revision 45161) [x86_64-linux]
$ ruby -e '/\xff/'
-e:1: invalid multibyte escape: /\xff/
$ ./bin/ruby-parse --19 -e '/xff/'
(regexp
  (str "xff")
  (regopt))
$ ./bin/ruby-parse --19 -e 'if /\xff/ =~ foo; end'
(if
  (match-with-lvasgn
    (regexp
      (str "\\xff")
      (regopt))
    (send nil :foo)) nil nil)
$ ./bin/ruby-parse --20 -e '/xff/'
(regexp
  (str "xff")
  (regopt))
$ ./bin/ruby-parse --20 -e 'if /\xff/ =~ foo; end'
Failed on: (fragment:0)
/home/whitequark/Work/parser/lib/parser/builders/default.rb:726:in `initialize': invalid multibyte escape: /\xff/ (RegexpError)
        from /home/whitequark/Work/parser/lib/parser/builders/default.rb:726:in `new'
        [snip]

That file wouldn't actually run under 2.0, and parser in 1.9 mode handles it just fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants