Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perl highlighter broken for object with many keys #981

Closed
miparnisari opened this issue Sep 6, 2018 · 19 comments · Fixed by #1160
Closed

Perl highlighter broken for object with many keys #981

miparnisari opened this issue Sep 6, 2018 · 19 comments · Fixed by #1160

Comments

@miparnisari
Copy link
Contributor

perl_bug

@miparnisari
Copy link
Contributor Author

miparnisari commented Sep 8, 2018

Ugh it's because it is interpreting the last sentence as a regex.

# Perl allows any non-whitespace character to delimit a regex when `m` is used.
rule %r(m(\S).*\1[msixpodualngc]*), re_tok

@dblessing
Copy link
Collaborator

dblessing commented Sep 8, 2018 via email

@miparnisari
Copy link
Contributor Author

This was introduced by #974

@zoidyzoidzoid
Copy link
Contributor

Yeah, reverting that PR fixes it. Let me know if there's anyway I can help fix it.

@miparnisari
Copy link
Contributor Author

I propose that we take a practical approach - who uses letters as delimiters? Nobody in their right mind. So, let's only allow symbols.

I.e. instead of having

rule %r(m(\S).*\1[msixpodualngc]*), re_tok

We could have

rule %r(m([\/\\!\{\}\(\)@<>,;%&]).*\1[msixpodualngc]*), re_tok

@dblessing
Copy link
Collaborator

dblessing commented Nov 24, 2018 via email

@miparnisari
Copy link
Contributor Author

miparnisari commented Nov 25, 2018

I see. What if we move this section towards the very end of the root?

        # Perl allows any non-whitespace character to delimit
        # a regex when `m` is used.
        rule %r(m(\S).*\1[msixpodualngc]*), re_tok
        rule %r(((?<==~)|(?<=\())\s*/(\\\\|\\/|[^/])*/[msixpodualngc]*),
          re_tok, :balanced_regex

I just tried this and it fixes the problem.

EDIT: ah no wait it doesn't, it shows the regexes incorrectly...

@zoidyzoidzoid
Copy link
Contributor

So the current bug with incorrect syntax highlighting is super minor compared to the backtrack error, but it's affecting us at Booking.com using GitLab too.

I agree that avoiding all kinds of catastrophic backtracks would be great though.

I think @miparnisari's suggestion is great, and I've locally done the following

-        rule %r(m(\S).*\1[msixpodualngc]*), re_tok
+        rule %r(m([^A-Za-z_]).*\1[msixpodualngc]*), re_tok

which fixes the following example

my $eg => {
    hello => "world",
    mo => "money"
}

zoidyzoidzoid added a commit to zoidyzoidzoid/rouge that referenced this issue Nov 27, 2018
Related fun reading: https://perldoc.perl.org/perlop.html#Regexp-Quote-Like-Operators

> With the m you can use any pair of non-whitespace (ASCII) characters as
> delimiters. This is particularly useful for matching path names that
> contain "/" , to avoid LTS (leaning toothpick syndrome). If "?" is the
> delimiter, then a match-only-once rule applies, described in m?PATTERN?
> below. If "'" (single quote) is the delimiter, no variable interpolation is
> performed on the PATTERN. When using a delimiter character valid in an
> identifier, whitespace is required after the m.

I tried to find some code in the Perl source code to understand how
they're differentiated from, but didn't have much luck.

https://metacpan.org/pod/PPI#Background

I'll add some local samples for the Perl code though.

Fixes: rouge-ruby#981

Signed-off-by: William Stewart <zoidbergwill@gmail.com>
@miparnisari
Copy link
Contributor Author

I don't see the point of moving these rules around if they break syntax highlighting for regexes. The whole point of those rules is to highlight regexes! :P

@zoidyzoidzoid
Copy link
Contributor

So my suggestion, based on your first suggestion, still highlights regexes, and doesn't move things around too much.

I still need to see how badly it fails if someone uses a really strange regex delimiter with a file as big as the one that caused the Gitlab issue, like a letter someone can use for a variable, which I think is what it's trying to highlight in your original version.

@henrebotha
Copy link

I'd much rather have no regex highlighting than have broken hash highlighting.

@labster
Copy link
Contributor

labster commented Jun 8, 2019

As far as I can tell, the current rule matches all sorts of things are interpreted as starting a regex:

  • The built-in functions map and mkdir
  • A method call, with the method name beginning with the letter m
  • Any hash key written as a bareword that begins with the letter m
  • Any bareword in any context beginning with the letter m, which are definitely not regular expressions interpreted in a Perl compiler.

While Perl does allow any characters to delimit regexes, notice this from perlop:

When using a character valid in an identifier, whitespace is required after the m.

For here:
rule %r(m(\S).*\1[msixpodualngc]*), re_tok
So maybe you meant to say:

rule %r(m\s*([^\w\s]).*\1[msixpodualngc]*), re_tok
rule %r(m\s+(\w).*\1[msixpodualngc]*), re_tok

And in fact, the whitespace is optional on all of the quote-like operators, like q, qq, qw, qx (quoting), tr and y(transliteration), and s (substitution) as well as m (match), so adding \s* should apply to other lines like...
rule /(q|qq|qw|qr|qx)([^a-zA-Z0-9])(.|\n)*?\2/, Str::Other
rule %r(s/(\\\\|\\/|[^/])*/(\\\\|\\/|[^/])*/[msixpodualngc]*), re_tok
And the non-existent transliteration support.

When working on long legacy code files, the current state of highlighting is much more likely to be wrong than correct as errors tend to pile up or misquote hundreds of lines. The current state here provides so much more noise than signal, that removing Perl highlighting support entirely would make my code review work easier. But I think it's fixable.

@pyrmont
Copy link
Contributor

pyrmont commented Jun 8, 2019

Hi all :) I wanted to apologise for having this be outstanding for so long. We've recently got things moving again on the project but have generally been focusing on PRs. I'm afraid I hadn't noticed this issue until @labster's comment.

I will say the current behaviour sounds pretty bad. I'll have a look myself and try to report back ASAP on what approach looks best to take. Thanks for all the points raised so far; it's all been helpful for me in understanding better what's gone wrong.

@jneen
Copy link
Member

jneen commented Jun 8, 2019

I should also note that in a lexer pretty much every + or * should be made non-greedy (+? or *?), since it's very rare that we want to look all the way to the end of a file for the ending delimiter.

pyrmont added a commit to pyrmont/rouge that referenced this issue Jun 9, 2019
Perl allows arbitrary non-whitespace delimiters in regular expressions.
This commit fixes the rule to capture these regular expressions so that
it does not capture identifiers, hash keys and other elements of syntax
that begin with 'm' but are not regular expressions. It fixes rouge-ruby#981.
@pyrmont
Copy link
Contributor

pyrmont commented Jun 9, 2019

OK, there's a lot going on here.

TLDR: I basically think @labster's suggestion is correct and it's been submitted as PR #1160.


Problem 1: Regular expressions with arbitrary delimiters

As has been discussed, Perl allows for regular expressions to be defined with arbitrary non-whitespace delimiters. The initial fix for this led to broken behaviour where it would capture certain elements of syntax that began with m and were immediately followed by a word character (such as the example given in the OP).

One way of addressing this problem is to not extend Rouge's syntax highlighting to cover all these possible expressions. In particular, Rouge could only cover delimiters that are not word characters. This has the virtue of simplicity. Given how rarely these types of regular expressions are used, this seems like it might be a worthwhile trade-off. Chroma, the popular syntax highlighter written in Go, takes this approach.

However, this is really an admission of defeat and, moreover, an unnecessary one. The difficulty of lexing this kind of construct is being able to distinguish between regular expressions and other syntax. But this concern is based on a misunderstanding of how these regular expressions work in Perl.

As @labster noted, delimiters that are valid characters for an identifier must be separated by whitespace from the m character that begins a regular expression. The problem can be solved by having two rules: one for delimiters that are valid characters in an identifier and one for delimiters that aren't.

PR #1160 follows @labster's suggestion. It modifies that code to allow for backslash escape characters, to use non-greedy matches and to match across lines.

Problem 2: Rouge hanging on parsing backtick-quoted strings

As @dblessing noted, the impetus for the original fix was a report made to GitLab about source code that was causing Rouge to hang indefinitely. Honestly, that discussion is quite long and so I might be wrong about this but it looked to me like the analysis misunderstood what was causing the problem.

The file that caused Rouge to hang was scripts.pm, a file that featured a regular expression delimited by a ,. The problem is not actually the lack of regular expression support. This makes the problem visible but the issue is that the Rouge rule for backticks can, in a pathological case, cause Ruby to hang.

As best as I can determine, this problem is caused by the combination of character classes in the rule for backtick-quoted strings being mashed together in a single regular expression. The problem can be solved by using a separate state for these strings. This is common in other lexers but is not the approach that the Perl lexer takes (neither does the one in Pygments on which Rouge's is based).

This problem is not merely theoretical. If the PR proposed above is accepted, a pathological case can be fed through to Rouge by removing the space between a regular expression and a character that can be part of a valid identifier (eg. ma).

For this reason, PR #1161 has been submitted that fixes this problem. I have not been able to craft a pathological case for single-quoted strings and double-quotes strings but I would presume they can be similarly affected.

This also raises the question of whether any rule that tries to capture an expression using delimiters (whether a regular expression or a quoted string) would suffer from this issue. I'm not sure. I worry that it would but if anyone else is confident about the answer to this, please let us know.

Conclusion

I have submitted PR #1160 to address the issue identified by @miparnisari and #1161 to provide a more comprehensive fix for #981. These PRs are intended to work, and be accepted, together.

@labster
Copy link
Contributor

labster commented Jun 9, 2019

Thanks for the quick response, at least from my point of view. I'd have proposed the PR myself, but I still need to learn Ruby, and I'm not quite up for that level of yak shaving today. I think that in general, there are still improvements to be made to the Perl lexer, which is still about half the length of the Ruby lexer. I may try out some small PRs there in the future, or maybe even attempt a Perl 6 lexer. But great progress for now, thanks!

pyrmont added a commit that referenced this issue Jun 9, 2019
Perl allows arbitrary non-whitespace delimiters in regular expressions.
This commit fixes the rule to capture these regular expressions so that
it does not capture identifiers, hash keys and other elements of syntax
that begin with 'm' but are not regular expressions. It fixes #981.
@pyrmont
Copy link
Contributor

pyrmont commented Jun 9, 2019

@labster Yeah, definitely think Perl doesn't get much love when it comes to a lot of the syntax highlighting libraries. Hopefully we can make it better :)

My knowledge of Perl is undoubtedly worse than your knowledge of Ruby but please feel free to suggest any changes you think of!

@zoidyzoidzoid
Copy link
Contributor

Thank you so much for the fixes, folks. So excited to see this issue closed. 🙏

Can we get a release, so I can make an MR to GitLab to upgrade their version of rouge?

@pyrmont
Copy link
Contributor

pyrmont commented Jun 11, 2019

@zoidbergwill New release just went out :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants