Perl - Allow any non-whitespace character to delimit regexes #974

dblessing · 2018-08-15T15:26:00Z

In Perl, any non-whitespace character can be a regex delimiter. Obviously this doesn't allow
for that, but it does allow for a specific case seen recently where some Perl code had
a comma for a delimiter. In absence of this fix the lexer can get stuck later on other
rules.

dblessing · 2018-08-15T16:16:00Z

Before merging this I want to think deeper about how we can support any non-whitespace character. It would be far safer to follow that Perl rule so we don't get into trouble later on in the lexer.

Maybe rule /m(\S)[^\1]*?\1[egimosx]*/, re_tok?

It's a bit naive - doesn't allow for escaping the delimiter character. But I'm not even sure if you can escape all delimiters in Perl. Obviously you can when using / or other special characters. We can leave the other rules in place for both backward compatibility and to allow for the obvious characters to be escaped. This would be the final rule so I think it's OK for it to be slightly imperfect.

That said, need to consider a bit more whether there are any negative implications here.

stanhu · 2018-08-15T16:23:44Z

@dblessing Thanks! Should we add this case to the Perl example?

dblessing · 2018-08-15T16:31:21Z

@stanhu I'd like to. It's a bit difficult to reproduce without a big example. If we only provided a small string with this then we'd never catch it as the catastrophic backtracking doesn't occur. How much of the example should we paste in?

Also, I'm interested if you have any opinions on the regex in my last comment. Does it look safe?

dblessing · 2018-08-15T16:33:28Z

Hmm, that rule is still seems a little sketchy. It will assuming anything starting with m is a regex. 🤔
But maybe that's OK. It's not like Ruby where a variable name could start with an m (would have $ prefixed.

stanhu · 2018-08-15T19:05:53Z

@dblessing I agree, we should modify the rules to be able to handle any non-whitespace character. Right now if we change the delimiter to a or some random other value, we still run into the same problem.

Hmm, that rule is still seems a little sketchy. It will assuming anything starting with m is a regex.

I think that rule is sketchy. It matches way too much: http://rubular.com/r/Rryq4uMANA

I'm not sure why you have the [^\1]. What about m(\S).*\1[egimosx]* instead?
http://rubular.com/r/U0bv7zoUtk

In addition, looking at http://perldoc.perl.org/perlop.html, I see that the pattern m/PATTERN/msixpodualngc has this text:

If "/" is the delimiter then the initial m is optional. With the m you can use any pair of non-whitespace characters as delimiters. This is particularly useful for matching path names that contain "/", to avoid LTS (leaning toothpick syndrome). If "?" is the delimiter, then the match-only-once rule of ?PATTERN? applies. If "'" is the delimiter, no interpolation is performed on the PATTERN. When using a character valid in an identifier, whitespace is required after the m.

I see that we use gimosx, so I'm wondering if we need to expand this to msixpodualngc as well.

stanhu · 2018-08-15T19:08:05Z

With Perl, it's tricky that we can't tell a variable like matesta is a match expression or a variable name.

dblessing · 2018-08-15T19:24:26Z

@stanhu Yeah, that should work. I don't think we need to worry about the rest of those regex rules in Perl because I'm thinking we should just leave the other rules in this lexer to handle those cases for now. That way 1) we don't break backward compatibility on anything (there are lots of different cases, and some are multiline and some aren't and idk why) 2) we don't have to handle escaped delimiter characters here if we're only handling this final case.

I agree we should update the egimosx to include all the options. I noticed that when I was looking at Perl docs. The lexer uses egimosx in all the other rules.

I'll update the PR.

stanhu · 2018-08-15T19:28:44Z

I also see that the regex rules are tried first before the variable rules. We may need to exclude a leading $ as well.

dblessing · 2018-08-15T19:35:00Z

👍 I'll test that case.

stanhu · 2018-08-15T19:37:02Z

~~^[^\$](\S).*\1[egimosx]* seems to work for me: http://rubular.com/r/IWh6EJouLh~~

UPDATE: Oops, I forgot the m. That doesn't work.

^[^$]?m(\S).*\1[egimosx]* seems to work: http://rubular.com/r/IWh6EJouLh

stanhu · 2018-08-15T20:04:53Z

lib/rouge/lexers/perl.rb

+
+        # Perl allows any non-whitespace character to delimit
+        # a regex when `m` is used.
+        rule %r(m(\S).*\1[msixpodualngc]*), re_tok


^[^$]?m(\S).*\1[msixpodualngc]* seems to work: http://rubular.com/r/leuEDyo2s4

lulalala · 2018-08-16T10:56:37Z

pygments was able to parse this because it took same extra steps:

https://bitbucket.org/birkenfeld/pygments-main/src/7941677dc77d4f2bf0bbd6140ade85a9454b8b80/pygments/lexers/perl.py?at=default&fileviewer=file-view-default#perl.py-526

Line 526 first catches normal name tokens, and in line 527 a brackets_callback is triggered. This method will look for the next location when the separator appears again.

Perl is more complex than I though @@ (with the m/ms/rx syntax)

dblessing · 2018-08-16T14:13:19Z

@stanhu No matter what, I couldn't get it to work with the ^[^$]?. It seems like that would be necessary. I added some examples to the sample file and it highlights correctly. Rubular says it matches - http://rubular.com/r/eTUZ1vONbS, but it doesn't in Rouge:

lexer: perl
stack: [:root]
stream: "$moduloOperation = 2"
  trying #<Rule /#.*?$/>
  trying #<Rule /^=[a-zA-Z0-9]+\s+.*?\n=cut/m>
  trying #<Rule /(?:case|continue|do|else|elsif|for|foreach|if|last|my|next|our|redo|reset|then|unless|until|while|use|print|new|BEGIN|CHECK|INIT|END|return)\b/>
  trying #<Rule /(format)(\s+)([a-zA-Z0-9_]+)(\s*)(=)(\s*\n)/>
  trying #<Rule /(?:eq|lt|gt|le|ge|ne|not|and|or|cmp)\b/>
  trying #<Rule /s\/(\\\\|\\\/|[^\/])*\/(\\\\|\\\/|[^\/])*\/[egimosx]*/>
  trying #<Rule /s!(\\\\|\\!|[^!])*!(\\\\|\\!|[^!])*![egimosx]*/>
  trying #<Rule /s\\(\\\\|[^\\])*\\(\\\\|[^\\])*\\[egimosx]*/>
  trying #<Rule /s@(\\\\|\\@|[^@])*@(\\\\|\\@|[^@])*@[egimosx]*/>
  trying #<Rule /s%(\\\\|\\%|[^%])*%(\\\\|\\%|[^%])*%[egimosx]*/>
  trying #<Rule /s{(\\\\|\\}|[^}])*}\s*/>
  trying #<Rule /s<(\\\\|\\>|[^>])*>\s*/>
  trying #<Rule /s\[(\\\\|\\\]|[^\]])*\]\s*/>
  trying #<Rule /s\((\\\\|\\\)|[^\)])*\)\s*/>
  trying #<Rule /m?\/(\\\\|\\\/|[^\/\n])*\/[gcimosx]*/>
  trying #<Rule /m(?=[\/!\\{<\[\(@%\$])/>
  trying #<Rule /m(\S).*\1[msixpodualngc]*/>
  trying #<Rule /((?<==~)|(?<=\())\s*\/(\\\\|\\\/|[^\/])*\/[msixpodualngc]*/>
  trying #<Rule /\s+/>
  trying #<Rule /(?:abs|accept|alarm|atan2|bind|binmode|bless|caller|chdir|chmod|chomp|chop|chown|chr|chroot|close|closedir|connect|continue|cos|crypt|dbmclose|dbmopen|defined|delete|die|dump|each|endgrent|endhostent|endnetent|endprotoent|endpwent|endservent|eof|eval|exec|exists|exit|exp|fcntl|fileno|flock|fork|format|formline|getc|getgrent|getgrgid|getgrnam|gethostbyaddr|gethostbyname|gethostent|getlogin|getnetbyaddr|getnetbyname|getnetent|getpeername|getpgrp|getppid|getpriority|getprotobyname|getprotobynumber|getprotoent|getpwent|getpwnam|getpwuid|getservbyname|getservbyport|getservent|getsockname|getsockopt|glob|gmtime|goto|grep|hex|import|index|int|ioctl|join|keys|kill|last|lc|lcfirst|length|link|listen|local|localtime|log|lstat|map|mkdir|msgctl|msgget|msgrcv|msgsnd|my|next|no|oct|open|opendir|ord|our|pack|package|pipe|pop|pos|printf|prototype|push|quotemeta|rand|read|readdir|readline|readlink|readpipe|recv|redo|ref|rename|require|reverse|rewinddir|rindex|rmdir|scalar|seek|seekdir|select|semctl|semget|semop|send|setgrent|sethostent|setnetent|setpgrp|setpriority|setprotoent|setpwent|setservent|setsockopt|shift|shmctl|shmget|shmread|shmwrite|shutdown|sin|sleep|socket|socketpair|sort|splice|split|sprintf|sqrt|srand|stat|study|substr|symlink|syscall|sysopen|sysread|sysseek|system|syswrite|tell|telldir|tie|tied|time|times|tr|truncate|uc|ucfirst|umask|undef|unlink|unpack|unshift|untie|utime|values|vec|wait|waitpid|wantarray|warn|write)\b/>
  trying #<Rule /((__(DATA|DIE|WARN)__)|(STD(IN|OUT|ERR)))\b/>
  trying #<Rule /<<([\'"]?)([a-zA-Z_][a-zA-Z0-9_]*)\1;?\n.*?\n\2\n/m>
  trying #<Rule /__END__\b/>
  trying #<Rule /\$\^[ADEFHILMOPSTWX]/>
  trying #<Rule /\$[\\"'\[\]&`+*.,;=%~?@$!<>(^\|\/-](?!\w)/>
  trying #<Rule /[-+\/*%=<>&^\|!\\~]=?/>
  trying #<Rule /[$@%#]+/>
    got "$"
    yielding Name.Variable, "$"
    pushing :varname
lexer: perl
stack: [:root, :varname]
stream: "moduloOperation = 2;"
  trying #<Rule /\s+/>
  trying #<Rule /\{/>
  trying #<Rule /\)|,/>
  entering mixin name_common
  trying #<Rule /\w+::/>
  trying #<Rule /[\w:]+/>
    got "moduloOperation"
    yielding Name.Variable, "moduloOperation"
    popping stack: 1

dblessing · 2018-08-16T14:17:50Z

Ah, of course. It's because the rule doesn't match with the $ at the beginning so it keeps going. Then, the varname rule catches it and pushes the varname state. It means the regex rule doesn't ever get a chance to eval on the moduloOperation piece again.

Cool 👍

stanhu · 2018-08-16T15:15:25Z

@dblessing I'm a little confused by your last comment. Is ^[^$] necessary or not?

dblessing · 2018-08-16T15:26:36Z

@stanhu No, it's not required, and actually causes things to break. The reason it's not required is because of what I said in my last comment.

stanhu · 2018-08-16T16:04:30Z

@dblessing Great, thanks.

Perl allows regexes to be delimited by any non-whitespace character when `m` is used. The new rule covers that case. Previously only certain special characters were recognized.

dblessing · 2018-08-16T18:13:06Z

Released as 3.2.1

This reverts commit 27a4bfb, reversing changes made to 7f027b1.

The previous rules for strings used a regular expression that combined different elements of the string. This could, in a pathological case, cause Rouge to get stuck trying to lex the code. This problem can be fixed by lexing a quoted string in a separate state to `:root`. In addition to adding states for quoted string, this commit also adds support for basic string interpolation. The basic code is lifted from the Ruby lexer. This fixes rouge-ruby#974.

The previous rules for strings used a regular expression that combined different elements of the string. This could, in a pathological case, cause Rouge to get stuck trying to lex the code. This problem can be fixed by lexing a quoted string in a separate state to `:root`. In addition to adding states for quoted string, this commit also adds support for basic string interpolation. The basic code is lifted from the Ruby lexer. This fixes #974.

dblessing force-pushed the dblessing-patch-3 branch from e5b00db to 7b76cf7 Compare August 15, 2018 19:35

stanhu reviewed Aug 15, 2018

View reviewed changes

dblessing force-pushed the dblessing-patch-3 branch from 7b76cf7 to e304731 Compare August 16, 2018 14:09

dblessing force-pushed the dblessing-patch-3 branch from e304731 to f4e6051 Compare August 16, 2018 17:55

Adds a new rule covering Perl Regexes delimited by non-whitespace

aa42510

Perl allows regexes to be delimited by any non-whitespace character when `m` is used. The new rule covers that case. Previously only certain special characters were recognized.

dblessing force-pushed the dblessing-patch-3 branch from f4e6051 to aa42510 Compare August 16, 2018 17:57

dblessing changed the title ~~Add comma as a regex delimiter~~ Perl - Allow any non-whitespace character to delimit regexes Aug 16, 2018

Release 3.2.1

3ba03cc

dblessing merged commit 27a4bfb into master Aug 16, 2018

miparnisari mentioned this pull request Nov 3, 2018

Perl highlighter broken for object with many keys #981

Closed

zoidyzoidzoid added a commit to zoidyzoidzoid/rouge that referenced this pull request Nov 23, 2018

Revert "Merge pull request rouge-ruby#974 from jneen/dblessing-patch-3"

da57450

This reverts commit 27a4bfb, reversing changes made to 7f027b1.

pyrmont mentioned this pull request Jun 9, 2019

Fix backtracking issues, add string interpolation in Perl lexer #1161

Merged

tancnle deleted the dblessing-patch-3 branch September 22, 2021 12:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perl - Allow any non-whitespace character to delimit regexes #974

Perl - Allow any non-whitespace character to delimit regexes #974

dblessing commented Aug 15, 2018

dblessing commented Aug 15, 2018 •

edited

Loading

stanhu commented Aug 15, 2018

dblessing commented Aug 15, 2018

dblessing commented Aug 15, 2018 •

edited

Loading

stanhu commented Aug 15, 2018

stanhu commented Aug 15, 2018

dblessing commented Aug 15, 2018

stanhu commented Aug 15, 2018 •

edited

Loading

dblessing commented Aug 15, 2018

stanhu commented Aug 15, 2018 •

edited

Loading

stanhu Aug 15, 2018

lulalala commented Aug 16, 2018

dblessing commented Aug 16, 2018

dblessing commented Aug 16, 2018

stanhu commented Aug 16, 2018

dblessing commented Aug 16, 2018

stanhu commented Aug 16, 2018

dblessing commented Aug 16, 2018

Perl - Allow any non-whitespace character to delimit regexes #974

Perl - Allow any non-whitespace character to delimit regexes #974

Conversation

dblessing commented Aug 15, 2018

dblessing commented Aug 15, 2018 • edited Loading

stanhu commented Aug 15, 2018

dblessing commented Aug 15, 2018

dblessing commented Aug 15, 2018 • edited Loading

stanhu commented Aug 15, 2018

stanhu commented Aug 15, 2018

dblessing commented Aug 15, 2018

stanhu commented Aug 15, 2018 • edited Loading

dblessing commented Aug 15, 2018

stanhu commented Aug 15, 2018 • edited Loading

stanhu Aug 15, 2018

Choose a reason for hiding this comment

lulalala commented Aug 16, 2018

dblessing commented Aug 16, 2018

dblessing commented Aug 16, 2018

stanhu commented Aug 16, 2018

dblessing commented Aug 16, 2018

stanhu commented Aug 16, 2018

dblessing commented Aug 16, 2018

dblessing commented Aug 15, 2018 •

edited

Loading

dblessing commented Aug 15, 2018 •

edited

Loading

stanhu commented Aug 15, 2018 •

edited

Loading

stanhu commented Aug 15, 2018 •

edited

Loading