Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elpi lexer has pathological backtracking #2053

Closed
kurtmckee opened this issue Jan 28, 2022 · 10 comments · Fixed by #2061
Closed

Elpi lexer has pathological backtracking #2053

kurtmckee opened this issue Jan 28, 2022 · 10 comments · Fixed by #2061
Labels
S-major severity: major T-bug type: a bug
Milestone

Comments

@kurtmckee
Copy link
Contributor

The Elpi lexer has pathological backtracking that can be triggered with the following minimal input:

from pygments.lexers.elpi import ElpiLexer

list(ElpiLexer().get_tokens("a" * 30))

Only lowercase characters trigger pathological backtracking. Uppercase characters and digits will not.

The culprit is at line 62.

@gares I'm not familiar with Elpi, so if you can jump in to address this, great! Otherwise I'll be able to take a crack at it after this weekend.

@gares
Copy link
Contributor

gares commented Jan 28, 2022

Thanks for pinning this down. What I'm trying to do is to make it so that, for c matched by constant_re:

  • c takes Text; unless
  • c starts with a capital, in which case it takes Name.Variable, line 61; or
  • c is followed by \, in which case it takes Name.Variable, line 62.

My lexer is probably silly, but I don't know how to improve it. So help is welcome.

Examples:

  • foobar is Text
  • -> is Text
  • Foobar is Name.Variable
  • foobar\ is Name.Variable
  • Foobar\ is Name.Variable as well (so line 62 is anyway wrong, the lookahead should be [A-Za-z_])
  • ->\ is Text

@birkenfeld
Copy link
Member

The problem is that constant_re has an alternative {}{}* with lcase_re and idcharstarns_re, the latter of which contains repetitions as well. So in effect you have (something+)* which has catastrophic backtracking properties.

Can the outer * be just removed here, or at least replaced by a ??

@gares
Copy link
Contributor

gares commented Jan 28, 2022

I would say it's legit, but I would put something* then, not something+?.

I read between the lines that these regexp are not compiled to an DFA. If they were all these expressions would collapse to the same. :-/

@birkenfeld
Copy link
Member

No; DFA regex engines don't support features like backreferences or look-behind assertions, which Python's re does.

@jeanas
Copy link
Contributor

jeanas commented Feb 3, 2022

idcharstarns_re reads (reformatted):

idcharstarns_re = rf"({idchar_re}+|\.{lcase_re}+)"

The relevant alternative of constant_re is

{lcase_re}{idcharstarns_re}*

This sounds like it's not matching what it intends. For example, it can match “a.a5 where 5 matches idchar_re; the . works as long as there is a lowercase letter after it but it doesn't require more. What is the intended rule here?

@gares
Copy link
Contributor

gares commented Feb 3, 2022

it should NOT match foo. but should match foo.bar and foo

@jeanas
Copy link
Contributor

jeanas commented Feb 3, 2022

OK, but should it match foo.5?

@gares
Copy link
Contributor

gares commented Feb 3, 2022

no, that should be lexed as foo followed by another token .5 (a number)

@gares
Copy link
Contributor

gares commented Feb 3, 2022

As is most PL, idents can contain number and some other special chars, but can't begin with a number.
Namespaces are idents separated by . (which cannot occur inside an ident).

@jeanas
Copy link
Contributor

jeanas commented Feb 3, 2022

I've opened #2061, but I'm not very sure about it. Could you take a look, please? Thank you.

@Anteru Anteru added this to the 2.12.0 milestone Feb 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-major severity: major T-bug type: a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants