Add location attributes to token object #374

inferrinizzard · 2022-08-04T03:47:43Z

adds the following location attributes to tokens:

begin: raw index in overall query string
end: where token ends in query string

nene

We should record both start and end locations of a token. For example babel parser uses:

loc: {
  start: {line, column, index},
  end: {line, column, index},
},

While Esprima parser uses a simple array of start-end indexes:

range: [15, 20],

I'm also not sure whether we actually need line and column numbers. Do you have a concrete plan on how you're planning to use this info?

The line and column number can also be calculated afterwards when index is known: one just needs to count lines until the index in original source code.

.editorconfig

src/lexer/regexUtil.ts

src/lexer/TokenizerEngine.ts

src/lexer/token.ts

nene

You haven't responded to my question of how are you planning to use this location information. Currently this is all effectively dead code, as we don't use this info anywhere. To judge whether this is a good approach, I'd need to know where exactly are you planning to head with this.

I know that the start/end indexes would be useful for associating comments with AST nodes. So I'm on board with that.

For the line and column numbers I can currently only think of one usage: reporting errors. For example if our Tokenizer currently fails, we report Parse error: Unexpected "some text". Would be nice to say at which line and column the problem happened. But for that we don't really need to store the line and column positions inside each token - we could instead compute that on-demand, which would save us the overhead of computing line & col positions which we wouldn't use most of the time at all. But perhaps you had some different use case in mind?

I also think it would be better to group this data inside one field of a Token instead of adding 4 new fields. That should also simplify doing operations with this location info later. Like instead of having to extract 4 fields from a token to pass them to doSomethingWithLocation() function, one would only need to extract a single field.

test/unit/tokenPosition.test.ts

src/lexer/regexUtil.ts

nene · 2022-08-08T08:37:23Z

src/lexer/TokenizerEngine.ts

+        this.line++;
+        lastIndex = LINEBREAK_REGEX.lastIndex;
+      }
+      this.col = token.length - lastIndex + 1;


This code is assuming that line breaks are always one character long, which isn't true. Notably there's also no test for \r\n line breaks.

I also find the logic in this code pretty hard to wrap my head around. Like, why do we increase line number before the loop and then also inside the loop... wouldn't it lead to counting one line too many? And why do we set lastIndex to 1 before the loop? Perhaps it's all correct... but it's really hard to figure it out.

How about a different approach:

Use String.split() to split the string to multiple lines.

You'd get the number of newlines from array length - 1.

You'd get the column position from length of the first item in array.

the + 1 is actually for the 1-based col index, not for the line match length
setting the lastIndex to 1 is to account for the line char but you're right that it doesn't account for \r\n

the initial line increment is because REGEX.test and REGEX.exec have an interaction where the test counts as the first exec - ie. if there is only 1 line break, the exec loop would never start

src/lexer/token.ts

test/unit/tokenPosition.test.ts

inferrinizzard · 2022-08-08T19:55:36Z

the line and col attributes could be used for more easily formatting in preserve modes, such as that one other PR regarding the placement of line comments that were previously inline - the raw start and end attributes wouldn't encode which line the comment was originally on

nene · 2022-08-08T21:27:47Z

I suppose you're referring to this comment-placement issue: #365

Another whitespace-preserving related issue is #329

I think these both might be better solved by bringing back the precedingWhitespace field to tokens. Though, it might also be that storing line numbers is a better approach. However the current implementation won't work for that as we'd need to also store the line number of the token end (so we could compare it with next token start to determine if there are newlines in between).

Perhaps though it would be better to leave this line/column number part completely out for now. As we don't yet know if and how exactly we're going to use it - it's hard to design an API for a future unknown usage. I think it might be better to add this line/col-numbers support together with the feature that actually uses it. It's not really that complex of a functionality to bundle together with another feature, and perhaps we end up not using it at all.

src/lexer/token.ts

inferrinizzard added 7 commits August 3, 2022 17:28

add raw string index location to tokens

f3bf8c9

update editorconfig

5159b38

fix token insertion cases

8960511

add line + col token position

52c4d24

import LINEBREAK_REGEX from regexUtils

b426b5f

add basic token position test

46656c3

fix line counter bug

01166b9

inferrinizzard added the enhancement label Aug 4, 2022

inferrinizzard self-assigned this Aug 4, 2022

inferrinizzard changed the title ~~add raw string index location to tokens~~ Add location attributes to token object Aug 4, 2022

nene reviewed Aug 4, 2022

View reviewed changes

.editorconfig Outdated Show resolved Hide resolved

src/lexer/regexUtil.ts Outdated Show resolved Hide resolved

src/lexer/TokenizerEngine.ts Outdated Show resolved Hide resolved

src/lexer/token.ts Outdated Show resolved Hide resolved

Merge branch 'master' into lexer/tokenizer-location

5b5cf10

inferrinizzard changed the base branch from lexer/rename-token-attributes to master August 4, 2022 14:16

inferrinizzard added 9 commits August 4, 2022 07:27

reorder LINEBEAK_REGEX priority

31b71b6

update token attribute comments

9eb302b

rename index to start

edada2d

add end attribute to token position

f7e1944

add helper function to update line and col numbers

8b5e021

move position updates to after matchedToken creation

39cd207

add basic multi-line position test

5ff4499

make line and col 1-based

c07c10c

rm use_tabs

ef8a95e

nene reviewed Aug 8, 2022

View reviewed changes

inferrinizzard added 5 commits August 9, 2022 16:32

Merge branch 'master' into lexer/tokenizer-location

b0c95e5

remove line and col attributes

2eaf905

update EOF token

ed20517

rm tokenPosition.test.ts

1aa7f20

rm LINEBREAK_REGEX

1ba93b0

fix position for bigquery nested angle tokens

41fbe74

inferrinizzard force-pushed the lexer/tokenizer-location branch from cb37ca7 to 41fbe74 Compare August 10, 2022 00:29

inferrinizzard marked this pull request as ready for review August 10, 2022 00:30

nene approved these changes Aug 10, 2022

View reviewed changes

src/lexer/token.ts Outdated Show resolved Hide resolved

use Infinity

1d737e5

inferrinizzard merged commit cb58777 into master Aug 10, 2022

nene deleted the lexer/tokenizer-location branch August 30, 2022 07:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add location attributes to token object #374

Add location attributes to token object #374

Uh oh!

inferrinizzard commented Aug 4, 2022 •

edited

Loading

Uh oh!

nene left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nene left a comment

Uh oh!

Uh oh!

Uh oh!

nene Aug 8, 2022

Uh oh!

inferrinizzard Aug 8, 2022

Uh oh!

inferrinizzard Aug 8, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

inferrinizzard commented Aug 8, 2022 •

edited

Loading

Uh oh!

nene commented Aug 8, 2022

Uh oh!

Uh oh!

Uh oh!

Add location attributes to token object #374

Add location attributes to token object #374

Uh oh!

Conversation

inferrinizzard commented Aug 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nene left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nene left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nene Aug 8, 2022

Choose a reason for hiding this comment

Uh oh!

inferrinizzard Aug 8, 2022

Choose a reason for hiding this comment

Uh oh!

inferrinizzard Aug 8, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

inferrinizzard commented Aug 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nene commented Aug 8, 2022

Uh oh!

Uh oh!

Uh oh!

inferrinizzard commented Aug 4, 2022 •

edited

Loading

inferrinizzard commented Aug 8, 2022 •

edited

Loading