distinguish `identifier` from word token #114

brandonspark · 2023-08-16T19:57:55Z

What

This PR simply isolates identifier into two different nonterminals, identifier and word_identifier. Both have the same content, so this does not change the meaning of the grammar.

Why

I am a program analysis engineer at Semgrep, and one thing we do quite often is manipulate tree-sitter grammars so that we can parse slightly modified grammars of programming languages, usually by extending certain constructs.

A perennial problem that has been happening for the Julia language is that Semgrep metavariables cannot always be put in the same place as identifiers, even though this is desirable (and intuitive) behavior for Semgrep users. Our usual method for ensuring this behavior is to overload identifier to be able to match a Semgrep metavariable, but this presents an issue for several reasons:

identifiers can appear wherever an interpolation expression can, but interpolation expressions cannot appear anywhere an identifier does. This means that overloading interpolation_expression (which we are currently doing) is not general enough. We could change the grammar, but this would require hard-coded anywhere an interpolation_expression appears to allow an identifier, which would be many places. It's better to go to the identifier nonterminal itself.
trying to make identifier possibly match a Semgrep metavariable regex can produce a partial match in the middle of such an identifier, for instance in the interpolation expression $Pack. This is an interpolation expression, but augmenting identifier in this way will cause it to parse as a metavariable $P, followed by an identifier ack. This persists because tree-sitter does not support regex assertions, including word boundary assertions.
Another approach is to make identifier possibly a choice of an external token, which would allow us to take care of the distinction in the scanner. This would ordinarily be possible, with the exception of the fact that the identifier symbol is actually also what is known as the word token in tree-sitter ( see https://tree-sitter.github.io/tree-sitter/creating-parsers#keyword-extraction ). This is coded in the logic of tree-sitter to be unable to be a nonterminal, meaning that it is impossible to augment identifier with another nonterminal.

How:

Of these solutions, it is easiest to fix 3. This PR simply separates out identifier into word_identifier, a symbol that is meant to be used as the word token, and identifier, which is set to be the same as it. Notably, however, this means that Semgrep can "hack" the identifier symbol and change it, without changing the word token, thus allowing identifier to be a nonterminal of an external symbol.

That's what this PR does.

Test plan:

tree-sitter test, and the grammar should be the same.

aryx · 2023-08-17T07:57:05Z

argh, I can't even review the PR, github can now show the diff ... even if I select just the diff for grammar.js ... grrrr

brandonspark · 2023-08-17T16:24:05Z

@aryx here's a screenshot that should help:

aryx · 2023-08-17T18:33:12Z

it works now! I could see the diff.

brandonspark changed the title ~~Brandon/separate word identifier~~ distinguish identifier from word token Aug 16, 2023

brandonspark force-pushed the brandon/separate-word-identifier branch from e546ad6 to 17c0832 Compare August 16, 2023 20:10

separate identifier from word token

1145ece

brandonspark force-pushed the brandon/separate-word-identifier branch from e1d4433 to 1145ece Compare August 16, 2023 20:18

brandonspark mentioned this pull request Aug 16, 2023

feat(julia): allow idents as metavariables semgrep/ocaml-tree-sitter-semgrep#442

Merged

1 task

aryx requested a review from savq August 17, 2023 07:34

aryx approved these changes Aug 18, 2023

View reviewed changes

aryx merged commit bb7e587 into tree-sitter:master Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distinguish `identifier` from word token #114

distinguish `identifier` from word token #114

brandonspark commented Aug 16, 2023

aryx commented Aug 17, 2023

brandonspark commented Aug 17, 2023

aryx commented Aug 17, 2023

distinguish identifier from word token #114

distinguish identifier from word token #114

Conversation

brandonspark commented Aug 16, 2023

What

Why

How:

Test plan:

aryx commented Aug 17, 2023

brandonspark commented Aug 17, 2023

aryx commented Aug 17, 2023

distinguish `identifier` from word token #114

distinguish `identifier` from word token #114