Skip to content

Extract common lexer code into helpers #79214

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Jul 1, 2025

Conversation

CyrusNajmabadi
Copy link
Member

@CyrusNajmabadi CyrusNajmabadi commented Jul 1, 2025

Extracted from #79205 to make that PR simpler.

That PR ends up making a few changes that end up cleaning up lexing a lot:

  1. responsibility for lexeme tracking moves to the lexer, from the text window. the text window now just concerns itself with being a fast stream of characters from teh original source text. This also simplifies a bunch of code to boot.
  2. text window gets less mutable state (with data showing why that is ok), simplifying lifetimes and array management.
  3. because of '2', text window can move to nicer abstractions (like ArraySegment/Span/etc) to make segment processing operations simpler. This helps ensure less mistakes and makes code simpler.

@@ -21,6 +19,8 @@ protected AbstractLexer(SourceText text)
this.TextWindow = new SlidingTextWindow(text);
}

protected int LexemeStartPosition => this.TextWindow.LexemeStartPosition;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the intent is to move LexemeStartPosition into lexer, so that only the lexer cares about lexemes, and the textwindow only cares about being a fast streaming sequence of chars.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a lexeme? Is that like a token?

Copy link
Member Author

@CyrusNajmabadi CyrusNajmabadi Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of, and i can probably doc. It's "the entity the lexer is currently producing". This is commonly the text of BOTH trivias AND tokens (without its trivia).

It's what you generally expect to get back if you ask the Token/Trivia for its .Text property (not .FullText, and not .ValueText).

Ignoring things like directives, the lexer generally is pointing at some position in the source. And it will 'start' lexing a 'lexeme' at that point. It consumes forward, based on certain rules about what it is currently consuming, until it 'finishes' that lexeme. At which point it generates a result (token or trivia in the majority case). That result is given a Kind, Text, and potentially other bits and bobs attached to it.

The goal here is to make the sliding-text-window care absolutely not one whit about lexer concepts, and keep itself only in the domain of making character-retrieval efficient. So lexemes and the like move up entirely to the lexer. This actually simplifies a bunch, and makes it harder to get things wrong.

FOr example, in the last year, there was a tweak to the sliding text window to allow it to look backwards. However, because the window itself was tracking lexemes, it could get into a corrupt state when it did that, leading to bad results being returned upwards in edge-case scenarios. THis split would help avoid that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR:

It's the smallest piece of Text hte lexer grabs out as an individual string to jam into either a Token or Trivia. it is indivisible.

=> TextWindow.GetText(intern: false);

protected string GetInternedLexemeText()
=> TextWindow.GetText(intern: true);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these helpers are here because GetText implicitly uses LexemeStartPosition. Once that is removed from the text window itself, it will need to be passed in (as the start position to read from, up to the text window's current position). So this means instead of having to update a huge number of sites, only this site is updated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: i wanted all lexeme-oriented operations to have that in their name. It's not at all evident what "TextWindow.GetText" or "TextWindow.Width" even means. Names like "CurrentLexemeWidth" are much clearer that it refers to the length of the current token being lexed out.

@@ -467,7 +467,7 @@ private void ScanSyntaxToken(ref TokenInfo info)
{
var atDotPosition = this.TextWindow.Position;
if (atDotPosition >= 1 &&
atDotPosition == this.TextWindow.LexemeStartPosition)
atDotPosition == this.LexemeStartPosition)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mechanical updates of TextWindow.LexemeStartPosition to this.LexemeStartPosition

@@ -636,12 +636,12 @@ private void ScanSyntaxToken(ref TokenInfo info)
this.AddError(TextWindow.Position + 1, width: 1, ErrorCode.ERR_ExpectedVerbatimLiteral);

this.ScanToEndOfLine();
info.Text = TextWindow.GetText(false);
info.Text = this.GetNonInternedLexemeText();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mechanical update of TextWindow.GetText(true/false) to GetInternedLexemeText()/GetNonInternedLexemeText()

@@ -60,7 +60,7 @@ private void ScanStringLiteral(ref TokenInfo info, bool inDirective)
//String and character literals can contain any Unicode character. They are not limited
//to valid UTF-16 characters. So if we get the SlidingTextWindow's sentinel value,
//double check that it was not real user-code contents. This will be rare.
Debug.Assert(TextWindow.Width > 0);
Debug.Assert(this.CurrentLexemeWidth > 0);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mechanical update of TextWindow.Width to this.CurrentLexemeWidth

@CyrusNajmabadi CyrusNajmabadi marked this pull request as ready for review July 1, 2025 16:28
@CyrusNajmabadi CyrusNajmabadi requested a review from a team as a code owner July 1, 2025 16:28
@jcouv jcouv self-assigned this Jul 1, 2025
Copy link
Member

@jcouv jcouv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (commit 26). Kindly reminder to squash, thanks!

@CyrusNajmabadi CyrusNajmabadi merged commit f10909a into dotnet:main Jul 1, 2025
24 checks passed
@CyrusNajmabadi CyrusNajmabadi deleted the lexerHelpers branch July 1, 2025 23:58
@dotnet-policy-service dotnet-policy-service bot added this to the Next milestone Jul 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants