Skip to content

Conversation

inferrinizzard
Copy link
Collaborator

@inferrinizzard inferrinizzard commented Aug 4, 2022

adds the following location attributes to tokens:

  • begin: raw index in overall query string
  • end: where token ends in query string

@inferrinizzard inferrinizzard self-assigned this Aug 4, 2022
@inferrinizzard inferrinizzard changed the title add raw string index location to tokens Add location attributes to token object Aug 4, 2022
Copy link
Collaborator

@nene nene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should record both start and end locations of a token. For example babel parser uses:

loc: {
  start: {line, column, index},
  end: {line, column, index},
},

While Esprima parser uses a simple array of start-end indexes:

range: [15, 20],

I'm also not sure whether we actually need line and column numbers. Do you have a concrete plan on how you're planning to use this info?

The line and column number can also be calculated afterwards when index is known: one just needs to count lines until the index in original source code.

@inferrinizzard inferrinizzard changed the base branch from lexer/rename-token-attributes to master August 4, 2022 14:16
Copy link
Collaborator

@nene nene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You haven't responded to my question of how are you planning to use this location information. Currently this is all effectively dead code, as we don't use this info anywhere. To judge whether this is a good approach, I'd need to know where exactly are you planning to head with this.

I know that the start/end indexes would be useful for associating comments with AST nodes. So I'm on board with that.

For the line and column numbers I can currently only think of one usage: reporting errors. For example if our Tokenizer currently fails, we report Parse error: Unexpected "some text". Would be nice to say at which line and column the problem happened. But for that we don't really need to store the line and column positions inside each token - we could instead compute that on-demand, which would save us the overhead of computing line & col positions which we wouldn't use most of the time at all. But perhaps you had some different use case in mind?

I also think it would be better to group this data inside one field of a Token instead of adding 4 new fields. That should also simplify doing operations with this location info later. Like instead of having to extract 4 fields from a token to pass them to doSomethingWithLocation() function, one would only need to extract a single field.

this.line++;
lastIndex = LINEBREAK_REGEX.lastIndex;
}
this.col = token.length - lastIndex + 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is assuming that line breaks are always one character long, which isn't true. Notably there's also no test for \r\n line breaks.

I also find the logic in this code pretty hard to wrap my head around. Like, why do we increase line number before the loop and then also inside the loop... wouldn't it lead to counting one line too many? And why do we set lastIndex to 1 before the loop? Perhaps it's all correct... but it's really hard to figure it out.

How about a different approach:

  • Use String.split() to split the string to multiple lines.
  • You'd get the number of newlines from array length - 1.
  • You'd get the column position from length of the first item in array.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the + 1 is actually for the 1-based col index, not for the line match length
setting the lastIndex to 1 is to account for the line char but you're right that it doesn't account for \r\n

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the initial line increment is because REGEX.test and REGEX.exec have an interaction where the test counts as the first exec - ie. if there is only 1 line break, the exec loop would never start

@inferrinizzard
Copy link
Collaborator Author

inferrinizzard commented Aug 8, 2022

the line and col attributes could be used for more easily formatting in preserve modes, such as that one other PR regarding the placement of line comments that were previously inline - the raw start and end attributes wouldn't encode which line the comment was originally on

@nene
Copy link
Collaborator

nene commented Aug 8, 2022

I suppose you're referring to this comment-placement issue: #365

Another whitespace-preserving related issue is #329

I think these both might be better solved by bringing back the precedingWhitespace field to tokens. Though, it might also be that storing line numbers is a better approach. However the current implementation won't work for that as we'd need to also store the line number of the token end (so we could compare it with next token start to determine if there are newlines in between).

Perhaps though it would be better to leave this line/column number part completely out for now. As we don't yet know if and how exactly we're going to use it - it's hard to design an API for a future unknown usage. I think it might be better to add this line/col-numbers support together with the feature that actually uses it. It's not really that complex of a functionality to bundle together with another feature, and perhaps we end up not using it at all.

@inferrinizzard inferrinizzard force-pushed the lexer/tokenizer-location branch from cb37ca7 to 41fbe74 Compare August 10, 2022 00:29
@inferrinizzard inferrinizzard marked this pull request as ready for review August 10, 2022 00:30
@inferrinizzard inferrinizzard merged commit cb58777 into master Aug 10, 2022
@nene nene deleted the lexer/tokenizer-location branch August 30, 2022 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants