-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Extract common lexer code into helpers #79214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -21,6 +19,8 @@ protected AbstractLexer(SourceText text) | |||
this.TextWindow = new SlidingTextWindow(text); | |||
} | |||
|
|||
protected int LexemeStartPosition => this.TextWindow.LexemeStartPosition; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the intent is to move LexemeStartPosition into lexer, so that only the lexer cares about lexemes, and the textwindow only cares about being a fast streaming sequence of chars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a lexeme? Is that like a token?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort of, and i can probably doc. It's "the entity the lexer is currently producing". This is commonly the text of BOTH trivias AND tokens (without its trivia).
It's what you generally expect to get back if you ask the Token/Trivia for its .Text
property (not .FullText
, and not .ValueText
).
Ignoring things like directives, the lexer generally is pointing at some position in the source. And it will 'start' lexing a 'lexeme' at that point. It consumes forward, based on certain rules about what it is currently consuming, until it 'finishes' that lexeme. At which point it generates a result (token or trivia in the majority case). That result is given a Kind
, Text
, and potentially other bits and bobs attached to it.
The goal here is to make the sliding-text-window care absolutely not one whit about lexer concepts, and keep itself only in the domain of making character-retrieval efficient. So lexemes and the like move up entirely to the lexer. This actually simplifies a bunch, and makes it harder to get things wrong.
FOr example, in the last year, there was a tweak to the sliding text window to allow it to look backwards. However, because the window itself was tracking lexemes, it could get into a corrupt state when it did that, leading to bad results being returned upwards in edge-case scenarios. THis split would help avoid that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR:
It's the smallest piece of Text
hte lexer grabs out as an individual string to jam into either a Token
or Trivia
. it is indivisible.
=> TextWindow.GetText(intern: false); | ||
|
||
protected string GetInternedLexemeText() | ||
=> TextWindow.GetText(intern: true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these helpers are here because GetText implicitly uses LexemeStartPosition. Once that is removed from the text window itself, it will need to be passed in (as the start position to read from, up to the text window's current position). So this means instead of having to update a huge number of sites, only this site is updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: i wanted all lexeme-oriented operations to have that in their name. It's not at all evident what "TextWindow.GetText" or "TextWindow.Width" even means. Names like "CurrentLexemeWidth" are much clearer that it refers to the length of the current token being lexed out.
@@ -467,7 +467,7 @@ private void ScanSyntaxToken(ref TokenInfo info) | |||
{ | |||
var atDotPosition = this.TextWindow.Position; | |||
if (atDotPosition >= 1 && | |||
atDotPosition == this.TextWindow.LexemeStartPosition) | |||
atDotPosition == this.LexemeStartPosition) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mechanical updates of TextWindow.LexemeStartPosition to this.LexemeStartPosition
@@ -636,12 +636,12 @@ private void ScanSyntaxToken(ref TokenInfo info) | |||
this.AddError(TextWindow.Position + 1, width: 1, ErrorCode.ERR_ExpectedVerbatimLiteral); | |||
|
|||
this.ScanToEndOfLine(); | |||
info.Text = TextWindow.GetText(false); | |||
info.Text = this.GetNonInternedLexemeText(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mechanical update of TextWindow.GetText(true/false)
to GetInternedLexemeText()/GetNonInternedLexemeText()
@@ -60,7 +60,7 @@ private void ScanStringLiteral(ref TokenInfo info, bool inDirective) | |||
//String and character literals can contain any Unicode character. They are not limited | |||
//to valid UTF-16 characters. So if we get the SlidingTextWindow's sentinel value, | |||
//double check that it was not real user-code contents. This will be rare. | |||
Debug.Assert(TextWindow.Width > 0); | |||
Debug.Assert(this.CurrentLexemeWidth > 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mechanical update of TextWindow.Width to this.CurrentLexemeWidth
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (commit 26). Kindly reminder to squash, thanks!
Extracted from #79205 to make that PR simpler.
That PR ends up making a few changes that end up cleaning up lexing a lot: