Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perl - Allow any non-whitespace character to delimit regexes #974

Merged
merged 2 commits into from
Aug 16, 2018

Conversation

dblessing
Copy link
Collaborator

In Perl, any non-whitespace character can be a regex delimiter. Obviously this doesn't allow
for that, but it does allow for a specific case seen recently where some Perl code had
a comma for a delimiter. In absence of this fix the lexer can get stuck later on other
rules.

@dblessing
Copy link
Collaborator Author

dblessing commented Aug 15, 2018

Before merging this I want to think deeper about how we can support any non-whitespace character. It would be far safer to follow that Perl rule so we don't get into trouble later on in the lexer.

Maybe rule /m(\S)[^\1]*?\1[egimosx]*/, re_tok?

It's a bit naive - doesn't allow for escaping the delimiter character. But I'm not even sure if you can escape all delimiters in Perl. Obviously you can when using / or other special characters. We can leave the other rules in place for both backward compatibility and to allow for the obvious characters to be escaped. This would be the final rule so I think it's OK for it to be slightly imperfect.

That said, need to consider a bit more whether there are any negative implications here.

@stanhu
Copy link
Contributor

stanhu commented Aug 15, 2018

@dblessing Thanks! Should we add this case to the Perl example?

@dblessing
Copy link
Collaborator Author

@stanhu I'd like to. It's a bit difficult to reproduce without a big example. If we only provided a small string with this then we'd never catch it as the catastrophic backtracking doesn't occur. How much of the example should we paste in?

Also, I'm interested if you have any opinions on the regex in my last comment. Does it look safe?

@dblessing
Copy link
Collaborator Author

dblessing commented Aug 15, 2018

Hmm, that rule is still seems a little sketchy. It will assuming anything starting with m is a regex. 🤔
But maybe that's OK. It's not like Ruby where a variable name could start with an m (would have $ prefixed.

@stanhu
Copy link
Contributor

stanhu commented Aug 15, 2018

@dblessing I agree, we should modify the rules to be able to handle any non-whitespace character. Right now if we change the delimiter to a or some random other value, we still run into the same problem.

Hmm, that rule is still seems a little sketchy. It will assuming anything starting with m is a regex.

I think that rule is sketchy. It matches way too much: http://rubular.com/r/Rryq4uMANA

I'm not sure why you have the [^\1]. What about m(\S).*\1[egimosx]* instead?
http://rubular.com/r/U0bv7zoUtk

In addition, looking at http://perldoc.perl.org/perlop.html, I see that the pattern m/PATTERN/msixpodualngc has this text:

If "/" is the delimiter then the initial m is optional. With the m you can use any pair of non-whitespace characters as delimiters. This is particularly useful for matching path names that contain "/", to avoid LTS (leaning toothpick syndrome). If "?" is the delimiter, then the match-only-once rule of ?PATTERN? applies. If "'" is the delimiter, no interpolation is performed on the PATTERN. When using a character valid in an identifier, whitespace is required after the m.

I see that we use gimosx, so I'm wondering if we need to expand this to msixpodualngc as well.

@stanhu
Copy link
Contributor

stanhu commented Aug 15, 2018

With Perl, it's tricky that we can't tell a variable like matesta is a match expression or a variable name.

@dblessing
Copy link
Collaborator Author

@stanhu Yeah, that should work. I don't think we need to worry about the rest of those regex rules in Perl because I'm thinking we should just leave the other rules in this lexer to handle those cases for now. That way 1) we don't break backward compatibility on anything (there are lots of different cases, and some are multiline and some aren't and idk why) 2) we don't have to handle escaped delimiter characters here if we're only handling this final case.

I agree we should update the egimosx to include all the options. I noticed that when I was looking at Perl docs. The lexer uses egimosx in all the other rules.

I'll update the PR.

@stanhu
Copy link
Contributor

stanhu commented Aug 15, 2018

I also see that the regex rules are tried first before the variable rules. We may need to exclude a leading $ as well.

@dblessing
Copy link
Collaborator Author

👍 I'll test that case.

@stanhu
Copy link
Contributor

stanhu commented Aug 15, 2018

^[^\$](\S).*\1[egimosx]* seems to work for me: http://rubular.com/r/IWh6EJouLh

UPDATE: Oops, I forgot the m. That doesn't work.

^[^$]?m(\S).*\1[egimosx]* seems to work: http://rubular.com/r/IWh6EJouLh


# Perl allows any non-whitespace character to delimit
# a regex when `m` is used.
rule %r(m(\S).*\1[msixpodualngc]*), re_tok
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^[^$]?m(\S).*\1[msixpodualngc]* seems to work: http://rubular.com/r/leuEDyo2s4

@lulalala
Copy link

pygments was able to parse this because it took same extra steps:

https://bitbucket.org/birkenfeld/pygments-main/src/7941677dc77d4f2bf0bbd6140ade85a9454b8b80/pygments/lexers/perl.py?at=default&fileviewer=file-view-default#perl.py-526

Line 526 first catches normal name tokens, and in line 527 a brackets_callback is triggered. This method will look for the next location when the separator appears again.

Perl is more complex than I though @@ (with the m/ms/rx syntax)

@dblessing
Copy link
Collaborator Author

@stanhu No matter what, I couldn't get it to work with the ^[^$]?. It seems like that would be necessary. I added some examples to the sample file and it highlights correctly. Rubular says it matches - http://rubular.com/r/eTUZ1vONbS, but it doesn't in Rouge:

lexer: perl
stack: [:root]
stream: "$moduloOperation = 2"
  trying #<Rule /#.*?$/>
  trying #<Rule /^=[a-zA-Z0-9]+\s+.*?\n=cut/m>
  trying #<Rule /(?:case|continue|do|else|elsif|for|foreach|if|last|my|next|our|redo|reset|then|unless|until|while|use|print|new|BEGIN|CHECK|INIT|END|return)\b/>
  trying #<Rule /(format)(\s+)([a-zA-Z0-9_]+)(\s*)(=)(\s*\n)/>
  trying #<Rule /(?:eq|lt|gt|le|ge|ne|not|and|or|cmp)\b/>
  trying #<Rule /s\/(\\\\|\\\/|[^\/])*\/(\\\\|\\\/|[^\/])*\/[egimosx]*/>
  trying #<Rule /s!(\\\\|\\!|[^!])*!(\\\\|\\!|[^!])*![egimosx]*/>
  trying #<Rule /s\\(\\\\|[^\\])*\\(\\\\|[^\\])*\\[egimosx]*/>
  trying #<Rule /s@(\\\\|\\@|[^@])*@(\\\\|\\@|[^@])*@[egimosx]*/>
  trying #<Rule /s%(\\\\|\\%|[^%])*%(\\\\|\\%|[^%])*%[egimosx]*/>
  trying #<Rule /s{(\\\\|\\}|[^}])*}\s*/>
  trying #<Rule /s<(\\\\|\\>|[^>])*>\s*/>
  trying #<Rule /s\[(\\\\|\\\]|[^\]])*\]\s*/>
  trying #<Rule /s\((\\\\|\\\)|[^\)])*\)\s*/>
  trying #<Rule /m?\/(\\\\|\\\/|[^\/\n])*\/[gcimosx]*/>
  trying #<Rule /m(?=[\/!\\{<\[\(@%\$])/>
  trying #<Rule /m(\S).*\1[msixpodualngc]*/>
  trying #<Rule /((?<==~)|(?<=\())\s*\/(\\\\|\\\/|[^\/])*\/[msixpodualngc]*/>
  trying #<Rule /\s+/>
  trying #<Rule /(?:abs|accept|alarm|atan2|bind|binmode|bless|caller|chdir|chmod|chomp|chop|chown|chr|chroot|close|closedir|connect|continue|cos|crypt|dbmclose|dbmopen|defined|delete|die|dump|each|endgrent|endhostent|endnetent|endprotoent|endpwent|endservent|eof|eval|exec|exists|exit|exp|fcntl|fileno|flock|fork|format|formline|getc|getgrent|getgrgid|getgrnam|gethostbyaddr|gethostbyname|gethostent|getlogin|getnetbyaddr|getnetbyname|getnetent|getpeername|getpgrp|getppid|getpriority|getprotobyname|getprotobynumber|getprotoent|getpwent|getpwnam|getpwuid|getservbyname|getservbyport|getservent|getsockname|getsockopt|glob|gmtime|goto|grep|hex|import|index|int|ioctl|join|keys|kill|last|lc|lcfirst|length|link|listen|local|localtime|log|lstat|map|mkdir|msgctl|msgget|msgrcv|msgsnd|my|next|no|oct|open|opendir|ord|our|pack|package|pipe|pop|pos|printf|prototype|push|quotemeta|rand|read|readdir|readline|readlink|readpipe|recv|redo|ref|rename|require|reverse|rewinddir|rindex|rmdir|scalar|seek|seekdir|select|semctl|semget|semop|send|setgrent|sethostent|setnetent|setpgrp|setpriority|setprotoent|setpwent|setservent|setsockopt|shift|shmctl|shmget|shmread|shmwrite|shutdown|sin|sleep|socket|socketpair|sort|splice|split|sprintf|sqrt|srand|stat|study|substr|symlink|syscall|sysopen|sysread|sysseek|system|syswrite|tell|telldir|tie|tied|time|times|tr|truncate|uc|ucfirst|umask|undef|unlink|unpack|unshift|untie|utime|values|vec|wait|waitpid|wantarray|warn|write)\b/>
  trying #<Rule /((__(DATA|DIE|WARN)__)|(STD(IN|OUT|ERR)))\b/>
  trying #<Rule /<<([\'"]?)([a-zA-Z_][a-zA-Z0-9_]*)\1;?\n.*?\n\2\n/m>
  trying #<Rule /__END__\b/>
  trying #<Rule /\$\^[ADEFHILMOPSTWX]/>
  trying #<Rule /\$[\\"'\[\]&`+*.,;=%~?@$!<>(^\|\/-](?!\w)/>
  trying #<Rule /[-+\/*%=<>&^\|!\\~]=?/>
  trying #<Rule /[$@%#]+/>
    got "$"
    yielding Name.Variable, "$"
    pushing :varname
lexer: perl
stack: [:root, :varname]
stream: "moduloOperation = 2;"
  trying #<Rule /\s+/>
  trying #<Rule /\{/>
  trying #<Rule /\)|,/>
  entering mixin name_common
  trying #<Rule /\w+::/>
  trying #<Rule /[\w:]+/>
    got "moduloOperation"
    yielding Name.Variable, "moduloOperation"
    popping stack: 1

screen shot 2018-08-16 at 9 09 20 am

@dblessing
Copy link
Collaborator Author

Ah, of course. It's because the rule doesn't match with the $ at the beginning so it keeps going. Then, the varname rule catches it and pushes the varname state. It means the regex rule doesn't ever get a chance to eval on the moduloOperation piece again.

Cool 👍

@stanhu
Copy link
Contributor

stanhu commented Aug 16, 2018

@dblessing I'm a little confused by your last comment. Is ^[^$] necessary or not?

@dblessing
Copy link
Collaborator Author

@stanhu No, it's not required, and actually causes things to break. The reason it's not required is because of what I said in my last comment.

@stanhu
Copy link
Contributor

stanhu commented Aug 16, 2018

@dblessing Great, thanks.

Perl allows regexes to be delimited by any non-whitespace
character when `m` is used. The new rule covers that case. Previously
only certain special characters were recognized.
@dblessing dblessing changed the title Add comma as a regex delimiter Perl - Allow any non-whitespace character to delimit regexes Aug 16, 2018
@dblessing dblessing merged commit 27a4bfb into master Aug 16, 2018
@dblessing
Copy link
Collaborator Author

Released as 3.2.1

zoidyzoidzoid added a commit to zoidyzoidzoid/rouge that referenced this pull request Nov 23, 2018
pyrmont added a commit to pyrmont/rouge that referenced this pull request Jun 9, 2019
The previous rules for strings used a regular expression that combined
different elements of the string. This could, in a pathological case,
cause Rouge to get stuck trying to lex the code. This problem can be
fixed by lexing a quoted string in a separate state to `:root`.

In addition to adding states for quoted string, this commit also adds
support for basic string interpolation. The basic code is lifted from
the Ruby lexer. This fixes rouge-ruby#974.
pyrmont added a commit that referenced this pull request Jun 9, 2019
The previous rules for strings used a regular expression that combined
different elements of the string. This could, in a pathological case,
cause Rouge to get stuck trying to lex the code. This problem can be
fixed by lexing a quoted string in a separate state to `:root`.

In addition to adding states for quoted string, this commit also adds
support for basic string interpolation. The basic code is lifted from
the Ruby lexer. This fixes #974.
@tancnle tancnle deleted the dblessing-patch-3 branch September 22, 2021 12:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants