-
Notifications
You must be signed in to change notification settings - Fork 433
Add location attributes to token object #374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should record both start and end locations of a token. For example babel parser uses:
loc: {
start: {line, column, index},
end: {line, column, index},
},
While Esprima parser uses a simple array of start-end indexes:
range: [15, 20],
I'm also not sure whether we actually need line and column numbers. Do you have a concrete plan on how you're planning to use this info?
The line and column number can also be calculated afterwards when index is known: one just needs to count lines until the index in original source code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You haven't responded to my question of how are you planning to use this location information. Currently this is all effectively dead code, as we don't use this info anywhere. To judge whether this is a good approach, I'd need to know where exactly are you planning to head with this.
I know that the start/end indexes would be useful for associating comments with AST nodes. So I'm on board with that.
For the line and column numbers I can currently only think of one usage: reporting errors. For example if our Tokenizer currently fails, we report Parse error: Unexpected "some text"
. Would be nice to say at which line and column the problem happened. But for that we don't really need to store the line and column positions inside each token - we could instead compute that on-demand, which would save us the overhead of computing line & col positions which we wouldn't use most of the time at all. But perhaps you had some different use case in mind?
I also think it would be better to group this data inside one field of a Token instead of adding 4 new fields. That should also simplify doing operations with this location info later. Like instead of having to extract 4 fields from a token to pass them to doSomethingWithLocation()
function, one would only need to extract a single field.
src/lexer/TokenizerEngine.ts
Outdated
this.line++; | ||
lastIndex = LINEBREAK_REGEX.lastIndex; | ||
} | ||
this.col = token.length - lastIndex + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is assuming that line breaks are always one character long, which isn't true. Notably there's also no test for \r\n
line breaks.
I also find the logic in this code pretty hard to wrap my head around. Like, why do we increase line number before the loop and then also inside the loop... wouldn't it lead to counting one line too many? And why do we set lastIndex
to 1 before the loop? Perhaps it's all correct... but it's really hard to figure it out.
How about a different approach:
- Use
String.split()
to split the string to multiple lines. - You'd get the number of newlines from array length - 1.
- You'd get the column position from length of the first item in array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the + 1 is actually for the 1-based col index, not for the line match length
setting the lastIndex
to 1 is to account for the line char but you're right that it doesn't account for \r\n
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the initial line increment is because REGEX.test
and REGEX.exec
have an interaction where the test counts as the first exec - ie. if there is only 1 line break, the exec loop would never start
the |
I suppose you're referring to this comment-placement issue: #365 Another whitespace-preserving related issue is #329 I think these both might be better solved by bringing back the Perhaps though it would be better to leave this line/column number part completely out for now. As we don't yet know if and how exactly we're going to use it - it's hard to design an API for a future unknown usage. I think it might be better to add this line/col-numbers support together with the feature that actually uses it. It's not really that complex of a functionality to bundle together with another feature, and perhaps we end up not using it at all. |
cb37ca7
to
41fbe74
Compare
adds the following location attributes to tokens:
begin
: raw index in overall query stringend
: where token ends in query string