Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distinguish identifier from word token #114

Merged

Conversation

brandonspark
Copy link
Contributor

What

This PR simply isolates identifier into two different nonterminals, identifier and word_identifier. Both have the same content, so this does not change the meaning of the grammar.

Why

I am a program analysis engineer at Semgrep, and one thing we do quite often is manipulate tree-sitter grammars so that we can parse slightly modified grammars of programming languages, usually by extending certain constructs.

A perennial problem that has been happening for the Julia language is that Semgrep metavariables cannot always be put in the same place as identifiers, even though this is desirable (and intuitive) behavior for Semgrep users. Our usual method for ensuring this behavior is to overload identifier to be able to match a Semgrep metavariable, but this presents an issue for several reasons:

  1. identifiers can appear wherever an interpolation expression can, but interpolation expressions cannot appear anywhere an identifier does. This means that overloading interpolation_expression (which we are currently doing) is not general enough. We could change the grammar, but this would require hard-coded anywhere an interpolation_expression appears to allow an identifier, which would be many places. It's better to go to the identifier nonterminal itself.
  2. trying to make identifier possibly match a Semgrep metavariable regex can produce a partial match in the middle of such an identifier, for instance in the interpolation expression $Pack. This is an interpolation expression, but augmenting identifier in this way will cause it to parse as a metavariable $P, followed by an identifier ack. This persists because tree-sitter does not support regex assertions, including word boundary assertions.
  3. Another approach is to make identifier possibly a choice of an external token, which would allow us to take care of the distinction in the scanner. This would ordinarily be possible, with the exception of the fact that the identifier symbol is actually also what is known as the word token in tree-sitter ( see https://tree-sitter.github.io/tree-sitter/creating-parsers#keyword-extraction ). This is coded in the logic of tree-sitter to be unable to be a nonterminal, meaning that it is impossible to augment identifier with another nonterminal.

How:

Of these solutions, it is easiest to fix 3. This PR simply separates out identifier into word_identifier, a symbol that is meant to be used as the word token, and identifier, which is set to be the same as it. Notably, however, this means that Semgrep can "hack" the identifier symbol and change it, without changing the word token, thus allowing identifier to be a nonterminal of an external symbol.

That's what this PR does.

Test plan:

tree-sitter test, and the grammar should be the same.

@brandonspark brandonspark changed the title Brandon/separate word identifier distinguish identifier from word token Aug 16, 2023
@brandonspark brandonspark force-pushed the brandon/separate-word-identifier branch from e546ad6 to 17c0832 Compare August 16, 2023 20:10
@aryx
Copy link
Contributor

aryx commented Aug 17, 2023

argh, I can't even review the PR, github can now show the diff ... even if I select just the diff for grammar.js ... grrrr

@brandonspark
Copy link
Contributor Author

@aryx here's a screenshot that should help:
image

@aryx
Copy link
Contributor

aryx commented Aug 17, 2023

it works now! I could see the diff.

@aryx aryx merged commit bb7e587 into tree-sitter:master Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants