Tokenize symbols and keywords into namespace and name parts #31

dannyfreeman · 2022-12-08T18:43:58Z

Attempts to address issue #21 and issue #28.

A brief overview of the changes:
kwd_lit and sym_lit will now contain up to 3 new nodes:

(kwd_lit
  ns: (kwd_ns)
  delimiter: "/"
  name: (kwd_name))

(sym_lit
  ns: (sym_ns)
  delimiter: "/"
  name: (sym_name))

Only the (kwd_name) and (sym_name) are required.
sym_lit nodes can still contain metadata nodes (this is unchanged).

NoahTheDuke · 2022-12-08T21:29:18Z

I can give this a test-run.

sogaiu · 2022-12-09T04:41:43Z

One thing that I think likely won't work all of the time is this bit.

Here some examples in the wild:

where parsing doesn't succeed.

dannyfreeman · 2022-12-09T12:49:33Z

@sogaiu I had a feeling that $._ws was going to cause an issue but wasn't sure what the issue would be.

There is an issue of precedence that comes up with the / character when trying to use it for so many things. The $._ws token there was the first (incorrect) way I found to resolve one of them.

The introduction of keyword names and namespaces further complicates things. I believe I can address it using precedence rules. I need to take some time to understand them then apply them correctly in this situation.

dannyfreeman · 2022-12-09T13:16:10Z

I've removed the $._ws, but it comes with some trade offs, see the commit message for 6880025

Edit: The only way I can think of to really solve some of the false positives like /asdf would be with a lookahead mechanism, which could express something like,

match the single character / as a symbol, if it is followed by whitespace or a closing bracket, but do not capture the following character.

But tree-sitter is not capable of that. Instead it relies on the branched parsing and precedence rules. I still plan to play with those a little more today.

sogaiu · 2022-12-09T14:06:45Z

@dannyfreeman

I tried parsing with 6880025 and 1183880 (essentially the same, right?) over the collection of Clojure source from clojars I've fetched locally.

They didn't produce any more easily detectable parse errors than the master branch (8c23e0e) does. This seems like an improvement over 8154310, which yielded > 120 files that weren't handled [1].

Regarding:

(/////) ;; 1 token per /

/asdf ;; 2 tokens: / and asdf

/asdf/hjkl ;; 2 token: / and asfd/hjkl

My memory is a bit fuzzy, but I think I didn't put much effort into handling incorrect code:

tree-sitter-clojure/grammar.js

Lines 1 to 4 in 8c23e0e

    
           // one aim is to try to parse what is correct (in the sense of 
        
           // officially supported), but also be looser in parsing additional 
        
           // things.  this is more or less in line with advice from tree-sitter 
        
           // folks.

I don't recall precisely where I got this impression, but I think maxbrunsfeld may have mentioned it some number of times. I think I found it hard enough just trying to handle what's correct (and even determining what's correct can be tricky for a spec-less language like Clojure) with what tree-sitter is capable of. Though perhaps it depends on the specifics a bit.

[1] I didn't go through all of the files that weren't handled but I think I looked at over half of them. Almost all of them were of the type with / being followed by a closing delimiter. There were two other cases I noticed. One is the file that the master branch fails on (it contains a zero byte, which tree-sitter cannot (yet?) handle) and the other is a template file which isn't really legitimate Clojure code anyway as it has sequences of characters like {{name}}.

dannyfreeman · 2022-12-09T14:32:06Z

I tried parsing with 6880025 and 1183880 (essentially the same, right?)

The later is just a comment change. I already posted a link to 6880025 so I didn't want to squash the two.

over the collection of Clojure source from clojars I've fetched locally.

I need to figure out how to do this :)

I don't recall precisely where I got this impression, but I think maxbrunsfeld may have mentioned it some number of times. I think I found it hard enough just trying to handle what's correct (and even determining what's correct can be tricky for a spec-less language like Clojure) with what tree-sitter is capable of. Though perhaps it depends on the specifics a bit.

I have that impression of tree-sitter as well: parse correct code but not necessarily worry about incorrect code. The outcome of this change, if left as is, would be that extra tokens are generated for invalid symbols beginning with a / character. I think that is rare and not something to be found in published code, linters like clj-kondo will flag them as invalid and clojure will throw errors. From a syntax-highlighting perspective, the extra tokens don't matter. They will just be highlighted as symbols.

If using treesitter for code navigation it might make things act funny on invalid code like ///. Imagine a move forward over expression command, such that | is the cursor. Executing it once in this scenario, |/// would move the cursor over one character instead of the group: /|//. But that code is is invalid so I wouldn't necessarily expect it to work perfectly anyways.

dannyfreeman · 2022-12-09T16:35:44Z

I have figured out the precedence conflicts, the latest commit 19eb4e7 should now split keywords into ns/name parts as well.

Something I'm not sure about: I've set : and :: to both be identified by the name kwd_marker. Should they be identified separately? Something like : is kwd_marker and :: is kwd_auto_resolve_marker or kwd_aliased_marker? I would still place them under the same marker: field in kwd_lit nodes.

sogaiu · 2022-12-10T01:31:02Z

@dannyfreeman More later, but regarding testing over clojars code, please check your matrix.org account.

sogaiu · 2022-12-10T05:31:21Z

@dannyfreeman I tried testing 19eb4e7. This time the checking revealed 36 unexpected files that didn't quite parse.

One issue seems to be not handling :/ appropriately. That does happen in the wild and it is accepted at the repl as a keyword AFAICT.

Another thing I noticed is stuff like :/usr/bin or :/path/from/request/to/a/file/or/folder not parsing:

These types of things are also accepted by my local clojure repl.

Don't have a whole lot to say yet regarding the : and :: point. Will try to think more on it.

sogaiu · 2022-12-10T09:11:04Z

parse correct code but not necessarily worry about incorrect code. The outcome of this change, if left as is, would be that extra tokens are generated for invalid symbols beginning with a / character. I think that is rare and not something to be found in published code, linters like clj-kondo will flag them as invalid and clojure will throw errors. From a syntax-highlighting perspective, the extra tokens don't matter. They will just be highlighted as symbols.

Yeah, although I do take hints from highlighting to notice whether I've made some kinds of mistakes, I tend to rely more on clj-kondo via things like flycheck-clj-kondo. I think I had things set up so that the mode line changed color when clj-kondo had something to report. I think that resulted in noticing problems with the code sooner than otherwise so it was much easier to figure out what I'd done wrong as the amount of "change" since the last valid state would tend to be on the smaller side (and it was more likely that I could recall what changes I might have made). Very much appreciate clj-kondo being as fast as it is.

If using treesitter for code navigation it might make things act funny on invalid code like ///. Imagine a move forward over expression command, such that | is the cursor. Executing it once in this scenario, |/// would move the cursor over one character instead of the group: /|//. But that code is is invalid so I wouldn't necessarily expect it to work perfectly anyways.

Yeah, although as a user I would prefer if things worked well even in the face of invalid code, having tried my hand at providing that sort of experience and not having succeeded that well, my current sense is that learning to pay close attention to when code becomes invalid (at least from the parsing perspective) and fixing things as soon as possible is currently more practical. May be some day things will be different though :)

dannyfreeman · 2022-12-10T17:26:22Z

One issue seems to be not handling :/ appropriately. That does happen in the wild and it is accepted at the repl as a keyword AFAICT.

Another thing I noticed is stuff like :/usr/bin or :/path/from/request/to/a/file/or/folder not parsing

This makes sense, I think I have an idea of how I can handle this now too.

Keywords I think are one of the more difficult things to deal with in Clojure, because just about any string can be made into one, but not all of them can be represented in source. One of my favorites: (name (keyword " "))

dannyfreeman · 2022-12-10T20:54:59Z

Now that I have figured out how to perform the bulk testing, 7cfbef6 seems like a good state for the parser. I'll update the test files soon to conform to the new grammar in this branch, and be sure that some of my discovered edge cases are included.

dannyfreeman · 2022-12-13T14:32:40Z

I think this is pretty close to ready, outside of the differences in tree-sitter versions that needs to be fixed.

I've added prettier to format the grammar.js file automatically.

Something else I've added in that is more in line with the rest of the grammar:

Keyword delimiters (: and ::) do not have named nodes, just fields.
The same for the / delimiters in namespaced keywords and symbols. The can be referred to by position or field name (marker and delimiter respectively).

These fields don't show up in the web-ui or in parsing output, but they are picked up by queries. Emacs also displays them weird in treesit-explor-moode, but that does not prevent them from being correctly picked up in queries there as well.

Attempts to address issue sogaiu#21 and issue sogaiu#28

Accounts for things like '(+ - * /) but creates individual symbol tokens for invalid clojure code like (/////) ;; 1 token per / /asdf ;; 2 tokens: / and asdf /asdf/hjkl ;; 2 token: / and asfd/hjkl ; Correct comment, squash this commit later

Adds runtime conflict resolution for keywords, when one keywords can match both keyword rules, treesitter will prefer using the _kwd_qualified rule over _kwd_unqualified

Also fix regex typo, missing escape char

Tree-sitter does not like parsing empty chars, which is what a null byte registers as. This fixes that by forcing the *_NAMERSPACED_NAMES to require at least one character to be picked up See tree-sitter/tree-sitter#98 This commit was squashed This is the commit message 2: Don't require at least 1 char for namespaced symbol names. This allows for symbols like `this-one/` to be parsed without error. They occur frequently in leiningen templates, which are not valid clojure files, but are still common enough to account for. Note that clojure repls will not accpet them as valid symbols. The reader throws and error This is the commit message 3 Revert "Don't require at least 1 char for namespaced symbol names." The leiningnen templates are not valid clojure code, so we won't make a special exception in the parsing for them, they can just be invalid. Parsing still works, we just get an error node in the `sym_lit` nodes when we fail to matach a `sym_name`

These can still be queried, but they don't appear in parse tree results. This makes the corpus tests a little simpler to maintain, and it's more in line with other markers in the grammar

Not really necessary in the long run, only in the context of reviewing the code

Uses emacs 29 with this .dir-locals.el ((auto-mode-alist . (("\\.js\\'" . js-mode))) (nil . ((js-indent-align-list-continuation . t) (js-indent-level . 2))))

Spun off of clojure-emacs/clojure-mode#644 in clojure-mode, this commit adds a basic readme and work in progress clojure-ts-mode. Currently syntax highlighting is working and requires tree-sitter-clojure built from this PR: sogaiu/tree-sitter-clojure#31 and emacs built from the emacs-29 branch with tree-sitter installed.

sogaiu/tree-sitter-clojure#31 was merged. It addressed two changes required for clojure-ts-mode to work properly: - Forcing tree-sitter to properly identify symbol names in symbols with metadata - Distinguishing namespace and name parts of symbols and keywords.

dannyfreeman mentioned this pull request Dec 8, 2022

Parsing drops symbols after metadata #21

Closed

dannyfreeman force-pushed the master branch 2 times, most recently from 245aa19 to 09d8068 Compare December 9, 2022 16:27

dannyfreeman force-pushed the master branch 2 times, most recently from 6d56739 to 52e2db8 Compare December 14, 2022 14:46

dannyfreeman added 10 commits December 14, 2022 09:48

POC: Tokenize symbol into namespace and name parts

ce62901

Attempts to address issue sogaiu#21 and issue sogaiu#28

Tokenize keyword in namespace and name parts

4eef107

Adds runtime conflict resolution for keywords, when one keywords can match both keyword rules, treesitter will prefer using the _kwd_qualified rule over _kwd_unqualified

Don't use an alias for $.kwd_marker, reduce duplication

319e45c

Account for weird keywords that start with /

80a63dd

Remove dyanamic precedence, no longer needed to resolve conflicts

fd5e743

Also fix regex typo, missing escape char

Remove named symbols for delimiters and keyword markers

e3940a0

These can still be queried, but they don't appear in parse tree results. This makes the corpus tests a little simpler to maintain, and it's more in line with other markers in the grammar

Fix corpus tests to match new grammar

8b85a5b

Add tests for division symbol edge cases

e4d396c

dannyfreeman force-pushed the master branch from 52e2db8 to 702d471 Compare December 14, 2022 14:51

dannyfreeman marked this pull request as ready for review December 14, 2022 14:51

dannyfreeman changed the title ~~POC: Tokenize symbol into namespace and name parts~~ Tokenize symbol into namespace and name parts Dec 14, 2022

dannyfreeman mentioned this pull request Dec 14, 2022

POC: treesitter powered clojure-mode clojure-emacs/clojure-mode#644

Closed

12 tasks

dannyfreeman force-pushed the master branch 2 times, most recently from 264545a to 7f4a6de Compare December 15, 2022 15:42

dannyfreeman added 2 commits December 15, 2022 11:30

Remove some dev comments

9729bda

Not really necessary in the long run, only in the context of reviewing the code

Reorder rules so that dependent rules come later

1744342

dannyfreeman force-pushed the master branch 2 times, most recently from a75d209 to ed35da1 Compare December 16, 2022 15:55

Formatting

d3ec295

Uses emacs 29 with this .dir-locals.el ((auto-mode-alist . (("\\.js\\'" . js-mode))) (nil . ((js-indent-align-list-continuation . t) (js-indent-level . 2))))

dannyfreeman force-pushed the master branch 2 times, most recently from c797b24 to b535e48 Compare December 18, 2022 14:41

dannyfreeman added 3 commits December 18, 2022 09:42

Define emacs local vars for code formatting in grammar.js

8824d3f

Separate rules for keywords leading with / char, inline them

d99a3ba

Rename SYMBOL_NS_DELIMITER, it is used in keywords as well

298d20e

dannyfreeman force-pushed the master branch from b535e48 to 298d20e Compare December 18, 2022 14:43

Autoindent NUMBER

a8d584e

sogaiu mentioned this pull request Dec 19, 2022

clojure syntax highlight queries #30

Closed

dannyfreeman mentioned this pull request Dec 19, 2022

Upcoming change to tree-sitter-clojure nvim-treesitter/nvim-treesitter#3998

Closed

Remove extra line

4d39170

dannyfreeman changed the title ~~Tokenize symbol into namespace and name parts~~ Tokenize symbols and keywords into namespace and name parts Dec 22, 2022

dannyfreeman mentioned this pull request Dec 27, 2022

Update tree-sitter-cli to 0.20.6 #26

Closed

dannyfreeman merged commit 7732550 into sogaiu:master Jan 5, 2023

Akeboshiwind mentioned this pull request Jan 23, 2023

Clojure: Defword on symbol finds no definition (treesitter) Olical/conjure#457

Closed

sogaiu mentioned this pull request Mar 13, 2023

How can I add keyword and namespace highlighting? #28

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenize symbols and keywords into namespace and name parts #31

Tokenize symbols and keywords into namespace and name parts #31

dannyfreeman commented Dec 8, 2022 •

edited

Loading

NoahTheDuke commented Dec 8, 2022

sogaiu commented Dec 9, 2022 •

edited

Loading

dannyfreeman commented Dec 9, 2022

dannyfreeman commented Dec 9, 2022 •

edited

Loading

sogaiu commented Dec 9, 2022 •

edited

Loading

dannyfreeman commented Dec 9, 2022

dannyfreeman commented Dec 9, 2022 •

edited

Loading

sogaiu commented Dec 10, 2022

sogaiu commented Dec 10, 2022 •

edited

Loading

sogaiu commented Dec 10, 2022

dannyfreeman commented Dec 10, 2022

dannyfreeman commented Dec 10, 2022

dannyfreeman commented Dec 13, 2022

Tokenize symbols and keywords into namespace and name parts #31

Tokenize symbols and keywords into namespace and name parts #31

Conversation

dannyfreeman commented Dec 8, 2022 • edited Loading

NoahTheDuke commented Dec 8, 2022

sogaiu commented Dec 9, 2022 • edited Loading

dannyfreeman commented Dec 9, 2022

dannyfreeman commented Dec 9, 2022 • edited Loading

sogaiu commented Dec 9, 2022 • edited Loading

dannyfreeman commented Dec 9, 2022

dannyfreeman commented Dec 9, 2022 • edited Loading

sogaiu commented Dec 10, 2022

sogaiu commented Dec 10, 2022 • edited Loading

sogaiu commented Dec 10, 2022

dannyfreeman commented Dec 10, 2022

dannyfreeman commented Dec 10, 2022

dannyfreeman commented Dec 13, 2022

dannyfreeman commented Dec 8, 2022 •

edited

Loading

sogaiu commented Dec 9, 2022 •

edited

Loading

dannyfreeman commented Dec 9, 2022 •

edited

Loading

sogaiu commented Dec 9, 2022 •

edited

Loading

dannyfreeman commented Dec 9, 2022 •

edited

Loading

sogaiu commented Dec 10, 2022 •

edited

Loading