New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better Lexer.analyse_text
#2005
Better Lexer.analyse_text
#2005
Conversation
d1ded86
to
5c2c6c9
Compare
Yes, |
As far as I can see the Yes guesslang is neat but it only supports ~50 languages, which is just 10% of the languages Pygments can highlight. While just adding guesslang would certainly improve the guessing for the popular languages, Pygments supports many more exotic languages for which getting the necessary amount of data for machine learning might be difficult. And I feel like one of the appeals of Pygments is that it's very easy to add support for another language. |
Don't get me wrong, I do see your point, I'm just looking at the state we're in right now which is a "mostly useless" Looking a bit closer at the method you propose, I wonder if we can pick expressions stochastically from the lexer itself and use those for ranking. The main reason to not use the lexer itself and see if it produces errors is because it's too slow (is that still true if we abort at the first error token?), and maybe there's a way to pick a "reasonably selective" regex and use that for Note that some languages don't do regex matches to decide what to do, though. I wrote a few |
I'm confused ... isn't that what
Of course we could do that but I seriously doubt that it would result in high quality guesses. Because e.g. keywords are just a soft indicator ... they can always occur in strings. I think our best bet is to hand write regexes that are likely to occur at the top of the file (with some score) and then pick whichever lexer thinks it recognizes something first in the text (also factoring in said score).
import pygments.lexers
import pygments.token
with open('pygments/lexer.py') as f:
text = f.read()
error = []
ok = []
for cls in pygments.lexers._iter_lexerclasses():
for (ttype, _) in cls().get_tokens(text):
if ttype == pygments.token.Error:
error.append(cls.__name__)
break
else:
ok.append(cls.__name__)
print('ok', len(ok))
print('error', len(error)) Running that took 7 seconds for me and produced:
meaning 125 lexers did not produce any error for
I think good guessing requires knowledge about how files in the programming language are usually structured. That information cannot be extracted from lexers.
>>> from pygments.lexers.esoteric import BrainfuckLexer
>>> BrainfuckLexer().analyse_text('<!DOCTYPE html>++++++++++[>+++++++>++++++++++>+++>+<<<<-]>++.>+.+++++++..+++.>++.<<+++++++++++++++.>.+++.------.--------.>+.>.')
1.0 That's exactly what I mean. I think simply using a regex like |
While playing with the new demo I introduced in #1999 I noticed that the lexer guessing solely based on text very much leaves things to be desired. It works fine for clear cut cases like shebangs and DOCTYPES but when there isn't such a dead giveaway and you actually have to guess it's very often wrong.
pygments.lexers.guess_lexer(text)
is currently powered byLexer.analyse_text(text)
which returns a float from 0 to 1 indicating how much the lexer thinks it recognizes its language (1 meaning it's certainly the right lexer and 0 meaning it's certainly not the right lexer).IMO the clear problem with this is that most patterns can occur anywhere in the text and the position they occur in currently isn't used for lexer ranking. To illustrate this with an example:
#include <stdio.h>
is correctly recognized as C<test></test>
is correctly recognized as XMLThat is because the code currently looks like:
Since the
analyse_text
API doesn't propagate the positions of the matches only one of the last two examples will ever be guessed correctly.Please note that C and XML were only examples. You can have the same problem with pretty much any pair of Pygment's >500 languages. Another example would be:
So my idea is to improve the language guessing in Pygments by deprecating
Lexer.analse_text
in favor of a new methodLexer.recognize_text
that yields the indexes of matches so that they can be factored into the scoring.To give you two examples of how
analyse_text
would be converted torecognize_text
:Since there currently are about 150
analyse_text
implementations, converting them all at once isn't feasible. This PR Draft demonstrates how we could introducerecognize_text
in a backwards compatible manner, allowing us to convert theanalyse_text
methods one by one, improving lexer guessing incrementally.What do you think?