Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better Lexer.analyse_text #2005

Closed

Conversation

not-my-profile
Copy link
Contributor

While playing with the new demo I introduced in #1999 I noticed that the lexer guessing solely based on text very much leaves things to be desired. It works fine for clear cut cases like shebangs and DOCTYPES but when there isn't such a dead giveaway and you actually have to guess it's very often wrong.

pygments.lexers.guess_lexer(text) is currently powered by Lexer.analyse_text(text) which returns a float from 0 to 1 indicating how much the lexer thinks it recognizes its language (1 meaning it's certainly the right lexer and 0 meaning it's certainly not the right lexer).

IMO the clear problem with this is that most patterns can occur anywhere in the text and the position they occur in currently isn't used for lexer ranking. To illustrate this with an example:

  • #include <stdio.h> is correctly recognized as C
  • <test></test> is correctly recognized as XML
  • The following is also correctly recognized as XML:
    <test>
    #include <stdio.h>
    </test>
  • The following however is wrongly recognized as XML:
    #include <stdio.h>
    
    void main(){
            printf("<test></test>");
    }

That is because the code currently looks like:

class CLexer(CFamilyLexer):
    def analyse_text(text):
        if re.search(r'^\s*#include [<"]', text, re.MULTILINE):
            return 0.1

class XmlLexer(RegexLexer):
    def analyse_text(text):
        if looks_like_xml(text):
            return 0.45  # less than HTML

Since the analyse_text API doesn't propagate the positions of the matches only one of the last two examples will ever be guessed correctly.

Please note that C and XML were only examples. You can have the same problem with pretty much any pair of Pygment's >500 languages. Another example would be:

  • import sqlite3
    # ...
    cur.execute("INSERT INTO items (text) VALUES (?)", (text,))
    which should be recognized as Python, versus
  • INSERT INTO items (text) VALUES ("import sqlite")
    which should be recognized as SQL.

So my idea is to improve the language guessing in Pygments by deprecating Lexer.analse_text in favor of a new method Lexer.recognize_text that yields the indexes of matches so that they can be factored into the scoring.

To give you two examples of how analyse_text would be converted to recognize_text:

# before:
def analyse_text(text):
    if re.search(r'^\s*#include [<"]', text, re.MULTILINE):
        return 0.1
# after:
def recognize_text(text):
    yield re.search(r'^\s*#include [<"]', text, re.MULTILINE), 0.1
# before:
def analyse_text(text):
    rv = 0
    if 'import ' in text:
        rv += 0.5
    if 'print ' in text:
        rv += 0.3
    return rv
# after:
def recognize_text(text):
    yield text.find('import '), 0.5
    yield text.find('print '), 0.3

Since there currently are about 150 analyse_text implementations, converting them all at once isn't feasible. This PR Draft demonstrates how we could introduce recognize_text in a backwards compatible manner, allowing us to convert the analyse_text methods one by one, improving lexer guessing incrementally.

What do you think?

@Anteru
Copy link
Collaborator

Anteru commented Dec 29, 2021

Yes, analyse_text is quite broken. I did discuss the issues here: #1702 and I think something like guesslang (mentioned #1950) might be a better direction by now. I'm not sure how much it's worth to "keep investing" in analyse_text improvements vs. "just" using guesslang and accepting the fact that most of the >500 languages we know about won't be reliably recognized. Maybe @birkenfeld has an opinion here?

@not-my-profile
Copy link
Contributor Author

I'm not sure how much it's worth to "keep investing" in analyse_text improvements

As far as I can see the analyse_text API hasn't been changed since 2006 when it was introduced in c9330c8. Of course trying to get reasonable matches with an API that maps the existence of some pattern to a fixed floating point number between 0 and 1 is doomed to fail. I think factoring in match positions would lead to a major improvement of guess quality.

Yes guesslang is neat but it only supports ~50 languages, which is just 10% of the languages Pygments can highlight. While just adding guesslang would certainly improve the guessing for the popular languages, Pygments supports many more exotic languages for which getting the necessary amount of data for machine learning might be difficult. And I feel like one of the appeals of Pygments is that it's very easy to add support for another language.

@Anteru
Copy link
Collaborator

Anteru commented Dec 29, 2021

Don't get me wrong, I do see your point, I'm just looking at the state we're in right now which is a "mostly useless" analyse_text implementation (judging by the number of bugs) which needs to get maintained separately from the lexer itself. There's also the problem that we're making a global guess which means stacking each lexer against each other instead of having a reasonable subset to select from (for instance, only disambiguating between all languages using .cl as the file ending.)

Looking a bit closer at the method you propose, I wonder if we can pick expressions stochastically from the lexer itself and use those for ranking. The main reason to not use the lexer itself and see if it produces errors is because it's too slow (is that still true if we abort at the first error token?), and maybe there's a way to pick a "reasonably selective" regex and use that for recognize_text?

Note that some languages don't do regex matches to decide what to do, though. I wrote a few analyse_text implementations which look at the distributions of characters (to recognize Brainfuck) or certain characters. How would those work with recognize_text?

@not-my-profile
Copy link
Contributor Author

not-my-profile commented Dec 31, 2021

There's also the problem that we're making a global guess [..] instead of having a reasonable subset to select from (for instance, only disambiguating between all languages using .cl as the file ending.)

I'm confused ... isn't that what pygments.lexers.guess_lexer_for_filename is doing? (Having pygments.lexers.guess_lexer still makes sense for when you don't know the file extension.)

I wonder if we can pick expressions stochastically from the lexer itself and use those for ranking.

Of course we could do that but I seriously doubt that it would result in high quality guesses. Because e.g. keywords are just a soft indicator ... they can always occur in strings. I think our best bet is to hand write regexes that are likely to occur at the top of the file (with some score) and then pick whichever lexer thinks it recognizes something first in the text (also factoring in said score).

The main reason to not use the lexer itself and see if it produces errors is because it's too slow (is that still true if we abort at the first error token?)

import pygments.lexers
import pygments.token

with open('pygments/lexer.py') as f:
    text = f.read()

error = []
ok = []

for cls in pygments.lexers._iter_lexerclasses():
    for (ttype, _) in cls().get_tokens(text):
        if ttype == pygments.token.Error:
            error.append(cls.__name__)
            break
    else:
        ok.append(cls.__name__)

print('ok', len(ok))
print('error', len(error))

Running that took 7 seconds for me and produced:

ok 125
error 397

meaning 125 lexers did not produce any error for pygments/lexer.py which IMO shows that we can forget that approach.

maybe there's a way to pick a "reasonably selective" regex and use that for recognize_text?

I think good guessing requires knowledge about how files in the programming language are usually structured. That information cannot be extracted from lexers.

I wrote a few analyse_text implementations which look at the distributions of characters (to recognize Brainfuck) or certain characters.

>>> from pygments.lexers.esoteric import BrainfuckLexer
>>> BrainfuckLexer().analyse_text('<!DOCTYPE html>++++++++++[>+++++++>++++++++++>+++>+<<<<-]>++.>+.+++++++..+++.>++.<<+++++++++++++++.>.+++.------.--------.>+.>.')
1.0

That's exactly what I mean. analyse_text should not return 1 if it's not absolutely certain.

I think simply using a regex like ([-+<>[\]]\s*){10,} would be better. This way guess_lexer would have access to the start position and be able to give other lexers that recognized something earlier (e.g. an import statement) precedence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants