distinguish identifier
from word token
#114
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What
This PR simply isolates
identifier
into two different nonterminals,identifier
andword_identifier
. Both have the same content, so this does not change the meaning of the grammar.Why
I am a program analysis engineer at Semgrep, and one thing we do quite often is manipulate tree-sitter grammars so that we can parse slightly modified grammars of programming languages, usually by extending certain constructs.
A perennial problem that has been happening for the Julia language is that Semgrep metavariables cannot always be put in the same place as identifiers, even though this is desirable (and intuitive) behavior for Semgrep users. Our usual method for ensuring this behavior is to overload
identifier
to be able to match a Semgrep metavariable, but this presents an issue for several reasons:interpolation_expression
(which we are currently doing) is not general enough. We could change the grammar, but this would require hard-coded anywhere aninterpolation_expression
appears to allow anidentifier
, which would be many places. It's better to go to theidentifier
nonterminal itself.identifier
possibly match a Semgrep metavariable regex can produce a partial match in the middle of such an identifier, for instance in the interpolation expression$Pack
. This is an interpolation expression, but augmentingidentifier
in this way will cause it to parse as a metavariable$P
, followed by an identifierack
. This persists becausetree-sitter
does not support regex assertions, including word boundary assertions.identifier
possibly achoice
of an external token, which would allow us to take care of the distinction in the scanner. This would ordinarily be possible, with the exception of the fact that theidentifier
symbol is actually also what is known as the word token intree-sitter
( see https://tree-sitter.github.io/tree-sitter/creating-parsers#keyword-extraction ). This is coded in the logic oftree-sitter
to be unable to be a nonterminal, meaning that it is impossible to augmentidentifier
with another nonterminal.How:
Of these solutions, it is easiest to fix 3. This PR simply separates out
identifier
intoword_identifier
, a symbol that is meant to be used as the word token, andidentifier
, which is set to be the same as it. Notably, however, this means that Semgrep can "hack" theidentifier
symbol and change it, without changing the word token, thus allowingidentifier
to be a nonterminal of an external symbol.That's what this PR does.
Test plan:
tree-sitter test
, and the grammar should be the same.