Elpi lexer has pathological backtracking #2053

kurtmckee · 2022-01-28T13:43:59Z

The Elpi lexer has pathological backtracking that can be triggered with the following minimal input:

from pygments.lexers.elpi import ElpiLexer

list(ElpiLexer().get_tokens("a" * 30))

Only lowercase characters trigger pathological backtracking. Uppercase characters and digits will not.

The culprit is at line 62.

@gares I'm not familiar with Elpi, so if you can jump in to address this, great! Otherwise I'll be able to take a crack at it after this weekend.

gares · 2022-01-28T14:24:53Z

Thanks for pinning this down. What I'm trying to do is to make it so that, for c matched by constant_re:

c takes Text; unless
c starts with a capital, in which case it takes Name.Variable, line 61; or
c is followed by \, in which case it takes Name.Variable, line 62.

My lexer is probably silly, but I don't know how to improve it. So help is welcome.

Examples:

foobar is Text
-> is Text
Foobar is Name.Variable
foobar\ is Name.Variable
Foobar\ is Name.Variable as well (so line 62 is anyway wrong, the lookahead should be [A-Za-z_])
->\ is Text

birkenfeld · 2022-01-28T15:01:07Z

The problem is that constant_re has an alternative {}{}* with lcase_re and idcharstarns_re, the latter of which contains repetitions as well. So in effect you have (something+)* which has catastrophic backtracking properties.

Can the outer * be just removed here, or at least replaced by a ??

gares · 2022-01-28T17:33:18Z

I would say it's legit, but I would put something* then, not something+?.

I read between the lines that these regexp are not compiled to an DFA. If they were all these expressions would collapse to the same. :-/

birkenfeld · 2022-01-28T18:20:20Z

No; DFA regex engines don't support features like backreferences or look-behind assertions, which Python's re does.

jeanas · 2022-02-03T11:15:49Z

idcharstarns_re reads (reformatted):

idcharstarns_re = rf"({idchar_re}+|\.{lcase_re}+)"

The relevant alternative of constant_re is

{lcase_re}{idcharstarns_re}*

This sounds like it's not matching what it intends. For example, it can match “a.a5 where 5 matches idchar_re; the . works as long as there is a lowercase letter after it but it doesn't require more. What is the intended rule here?

gares · 2022-02-03T15:57:25Z

it should NOT match foo. but should match foo.bar and foo

jeanas · 2022-02-03T16:01:44Z

OK, but should it match foo.5?

gares · 2022-02-03T16:46:45Z

no, that should be lexed as foo followed by another token .5 (a number)

gares · 2022-02-03T16:47:41Z

As is most PL, idents can contain number and some other special chars, but can't begin with a number.
Namespaces are idents separated by . (which cannot occur inside an ident).

jeanas · 2022-02-03T18:22:15Z

I've opened #2061, but I'm not very sure about it. Could you take a look, please? Thank you.

kurtmckee mentioned this issue Jan 28, 2022

Support comments in JSON #2049

Merged

birkenfeld mentioned this issue Feb 1, 2022

Remove now redundant re.UNICODE and (?u) #2058

Merged

birkenfeld added T-bug type: a bug S-major severity: major labels Feb 1, 2022

jeanas mentioned this issue Feb 3, 2022

Elpi: fix catastrophic backtracking #2061

Merged

birkenfeld closed this as completed in #2061 Feb 3, 2022

Anteru added this to the 2.12.0 milestone Feb 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elpi lexer has pathological backtracking #2053

Elpi lexer has pathological backtracking #2053

kurtmckee commented Jan 28, 2022

gares commented Jan 28, 2022 •

edited

birkenfeld commented Jan 28, 2022

gares commented Jan 28, 2022

birkenfeld commented Jan 28, 2022

jeanas commented Feb 3, 2022

gares commented Feb 3, 2022 •

edited

jeanas commented Feb 3, 2022

gares commented Feb 3, 2022

gares commented Feb 3, 2022

jeanas commented Feb 3, 2022

Elpi lexer has pathological backtracking #2053

Elpi lexer has pathological backtracking #2053

Comments

kurtmckee commented Jan 28, 2022

gares commented Jan 28, 2022 • edited

birkenfeld commented Jan 28, 2022

gares commented Jan 28, 2022

birkenfeld commented Jan 28, 2022

jeanas commented Feb 3, 2022

gares commented Feb 3, 2022 • edited

jeanas commented Feb 3, 2022

gares commented Feb 3, 2022

gares commented Feb 3, 2022

jeanas commented Feb 3, 2022

gares commented Jan 28, 2022 •

edited

gares commented Feb 3, 2022 •

edited