Extract common lexer code into helpers #79214

CyrusNajmabadi · 2025-07-01T16:04:20Z

Extracted from #79205 to make that PR simpler.

That PR ends up making a few changes that end up cleaning up lexing a lot:

responsibility for lexeme tracking moves to the lexer, from the text window. the text window now just concerns itself with being a fast stream of characters from teh original source text. This also simplifies a bunch of code to boot.
text window gets less mutable state (with data showing why that is ok), simplifying lifetimes and array management.
because of '2', text window can move to nicer abstractions (like ArraySegment/Span/etc) to make segment processing operations simpler. This helps ensure less mistakes and makes code simpler.

CyrusNajmabadi · 2025-07-01T16:20:53Z

src/Compilers/CSharp/Portable/Parser/AbstractLexer.cs

@@ -21,6 +19,8 @@ protected AbstractLexer(SourceText text)
            this.TextWindow = new SlidingTextWindow(text);
        }

+        protected int LexemeStartPosition => this.TextWindow.LexemeStartPosition;


the intent is to move LexemeStartPosition into lexer, so that only the lexer cares about lexemes, and the textwindow only cares about being a fast streaming sequence of chars.

What is a lexeme? Is that like a token?

Sort of, and i can probably doc. It's "the entity the lexer is currently producing". This is commonly the text of BOTH trivias AND tokens (without its trivia).

It's what you generally expect to get back if you ask the Token/Trivia for its .Text property (not .FullText, and not .ValueText).

Ignoring things like directives, the lexer generally is pointing at some position in the source. And it will 'start' lexing a 'lexeme' at that point. It consumes forward, based on certain rules about what it is currently consuming, until it 'finishes' that lexeme. At which point it generates a result (token or trivia in the majority case). That result is given a Kind, Text, and potentially other bits and bobs attached to it.

The goal here is to make the sliding-text-window care absolutely not one whit about lexer concepts, and keep itself only in the domain of making character-retrieval efficient. So lexemes and the like move up entirely to the lexer. This actually simplifies a bunch, and makes it harder to get things wrong.

FOr example, in the last year, there was a tweak to the sliding text window to allow it to look backwards. However, because the window itself was tracking lexemes, it could get into a corrupt state when it did that, leading to bad results being returned upwards in edge-case scenarios. THis split would help avoid that.

TLDR:

It's the smallest piece of Text hte lexer grabs out as an individual string to jam into either a Token or Trivia. it is indivisible.

CyrusNajmabadi · 2025-07-01T16:22:14Z

src/Compilers/CSharp/Portable/Parser/AbstractLexer.cs

+            => TextWindow.GetText(intern: false);
+
+        protected string GetInternedLexemeText()
+            => TextWindow.GetText(intern: true);


these helpers are here because GetText implicitly uses LexemeStartPosition. Once that is removed from the text window itself, it will need to be passed in (as the start position to read from, up to the text window's current position). So this means instead of having to update a huge number of sites, only this site is updated.

Note: i wanted all lexeme-oriented operations to have that in their name. It's not at all evident what "TextWindow.GetText" or "TextWindow.Width" even means. Names like "CurrentLexemeWidth" are much clearer that it refers to the length of the current token being lexed out.

CyrusNajmabadi · 2025-07-01T16:22:34Z

src/Compilers/CSharp/Portable/Parser/Lexer.cs

@@ -467,7 +467,7 @@ private void ScanSyntaxToken(ref TokenInfo info)
                    {
                        var atDotPosition = this.TextWindow.Position;
                        if (atDotPosition >= 1 &&
-                            atDotPosition == this.TextWindow.LexemeStartPosition)
+                            atDotPosition == this.LexemeStartPosition)


mechanical updates of TextWindow.LexemeStartPosition to this.LexemeStartPosition

CyrusNajmabadi · 2025-07-01T16:23:02Z

src/Compilers/CSharp/Portable/Parser/Lexer.cs

@@ -636,12 +636,12 @@ private void ScanSyntaxToken(ref TokenInfo info)
                            this.AddError(TextWindow.Position + 1, width: 1, ErrorCode.ERR_ExpectedVerbatimLiteral);

                            this.ScanToEndOfLine();
-                            info.Text = TextWindow.GetText(false);
+                            info.Text = this.GetNonInternedLexemeText();


mechanical update of TextWindow.GetText(true/false) to GetInternedLexemeText()/GetNonInternedLexemeText()

CyrusNajmabadi · 2025-07-01T16:23:40Z

src/Compilers/CSharp/Portable/Parser/Lexer_StringLiteral.cs

@@ -60,7 +60,7 @@ private void ScanStringLiteral(ref TokenInfo info, bool inDirective)
                    //String and character literals can contain any Unicode character. They are not limited
                    //to valid UTF-16 characters. So if we get the SlidingTextWindow's sentinel value,
                    //double check that it was not real user-code contents. This will be rare.
-                    Debug.Assert(TextWindow.Width > 0);
+                    Debug.Assert(this.CurrentLexemeWidth > 0);


mechanical update of TextWindow.Width to this.CurrentLexemeWidth

jcouv

LGTM (commit 26). Kindly reminder to squash, thanks!

CyrusNajmabadi and others added 23 commits June 30, 2025 11:18

Move field into lexer

40ad7f0

more work

0ea5fbc

Update cache

a472f31

All done

3d1fc7f

Remove

fd45aec

Doc

16573f2

Fix

e1eb1b4

Fix

362c1cb

Make sure we disable tracing

d8a8748

Fix test

712f2f5

Fixup

a852cfd

Simplify

8f278a3

Add tracing

15fbaec

Add tracing

daf9f80

Add docs

ebf24ab

remove

54e6e27

remove

284cf83

Fixup test

e9a740c

Simplify

f6817ee

Check both ends

1f0ec4a

Docs

e2885e4

cleanup

7fc3b85

Inline method

cedf899

github-actions bot added the Area-Compilers label Jul 1, 2025

CyrusNajmabadi added 2 commits July 1, 2025 09:06

revert

f58146f

revert

dcae96b

CyrusNajmabadi commented Jul 1, 2025

View reviewed changes

revert

00d74c8

CyrusNajmabadi commented Jul 1, 2025

View reviewed changes

CyrusNajmabadi marked this pull request as ready for review July 1, 2025 16:28

CyrusNajmabadi requested a review from a team as a code owner July 1, 2025 16:28

RikkiGibson approved these changes Jul 1, 2025

View reviewed changes

jcouv self-assigned this Jul 1, 2025

jcouv approved these changes Jul 1, 2025

View reviewed changes

CyrusNajmabadi merged commit f10909a into dotnet:main Jul 1, 2025
24 checks passed

CyrusNajmabadi deleted the lexerHelpers branch July 1, 2025 23:58

dotnet-policy-service bot added this to the Next milestone Jul 1, 2025

dotnet-bot mentioned this pull request Jul 4, 2025

[Automated] PRs inserted in VS build main-10803.87 #79253

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extract common lexer code into helpers #79214

Extract common lexer code into helpers #79214

Uh oh!

CyrusNajmabadi commented Jul 1, 2025 •

edited

Loading

Uh oh!

CyrusNajmabadi Jul 1, 2025

Uh oh!

RikkiGibson Jul 1, 2025

Uh oh!

CyrusNajmabadi Jul 1, 2025 •

edited

Loading

Uh oh!

RikkiGibson Jul 1, 2025

Uh oh!

CyrusNajmabadi Jul 1, 2025

Uh oh!

CyrusNajmabadi Jul 1, 2025

Uh oh!

CyrusNajmabadi Jul 1, 2025

Uh oh!

CyrusNajmabadi Jul 1, 2025

Uh oh!

CyrusNajmabadi Jul 1, 2025

Uh oh!

CyrusNajmabadi Jul 1, 2025

Uh oh!

jcouv left a comment

Uh oh!

Uh oh!

Uh oh!

Extract common lexer code into helpers #79214

Extract common lexer code into helpers #79214

Uh oh!

Conversation

CyrusNajmabadi commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CyrusNajmabadi Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcouv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

CyrusNajmabadi commented Jul 1, 2025 •

edited

Loading

CyrusNajmabadi Jul 1, 2025 •

edited

Loading