Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenize symbols and keywords into namespace and name parts #31

Merged
merged 18 commits into from
Jan 5, 2023

Conversation

dannyfreeman
Copy link
Collaborator

@dannyfreeman dannyfreeman commented Dec 8, 2022

Attempts to address issue #21 and issue #28.

A brief overview of the changes:
kwd_lit and sym_lit will now contain up to 3 new nodes:

(kwd_lit
  ns: (kwd_ns)
  delimiter: "/"
  name: (kwd_name))
(sym_lit
  ns: (sym_ns)
  delimiter: "/"
  name: (sym_name))

Only the (kwd_name) and (sym_name) are required.
sym_lit nodes can still contain metadata nodes (this is unchanged).

@NoahTheDuke
Copy link

I can give this a test-run.

@sogaiu
Copy link
Owner

sogaiu commented Dec 9, 2022

@dannyfreeman
Copy link
Collaborator Author

@sogaiu I had a feeling that $._ws was going to cause an issue but wasn't sure what the issue would be.

There is an issue of precedence that comes up with the / character when trying to use it for so many things. The $._ws token there was the first (incorrect) way I found to resolve one of them.

The introduction of keyword names and namespaces further complicates things. I believe I can address it using precedence rules. I need to take some time to understand them then apply them correctly in this situation.

@dannyfreeman
Copy link
Collaborator Author

dannyfreeman commented Dec 9, 2022

I've removed the $._ws, but it comes with some trade offs, see the commit message for 6880025

Edit: The only way I can think of to really solve some of the false positives like /asdf would be with a lookahead mechanism, which could express something like,

match the single character / as a symbol, if it is followed by whitespace or a closing bracket, but do not capture the following character.

But tree-sitter is not capable of that. Instead it relies on the branched parsing and precedence rules. I still plan to play with those a little more today.

@sogaiu
Copy link
Owner

sogaiu commented Dec 9, 2022

@dannyfreeman

I tried parsing with 6880025 and 1183880 (essentially the same, right?) over the collection of Clojure source from clojars I've fetched locally.

They didn't produce any more easily detectable parse errors than the master branch (8c23e0e) does. This seems like an improvement over 8154310, which yielded > 120 files that weren't handled [1].

Regarding:

(/////) ;; 1 token per /

/asdf ;; 2 tokens: / and asdf

/asdf/hjkl ;; 2 token: / and asfd/hjkl

My memory is a bit fuzzy, but I think I didn't put much effort into handling incorrect code:

// one aim is to try to parse what is correct (in the sense of
// officially supported), but also be looser in parsing additional
// things. this is more or less in line with advice from tree-sitter
// folks.

I don't recall precisely where I got this impression, but I think maxbrunsfeld may have mentioned it some number of times. I think I found it hard enough just trying to handle what's correct (and even determining what's correct can be tricky for a spec-less language like Clojure) with what tree-sitter is capable of. Though perhaps it depends on the specifics a bit.


[1] I didn't go through all of the files that weren't handled but I think I looked at over half of them. Almost all of them were of the type with / being followed by a closing delimiter. There were two other cases I noticed. One is the file that the master branch fails on (it contains a zero byte, which tree-sitter cannot (yet?) handle) and the other is a template file which isn't really legitimate Clojure code anyway as it has sequences of characters like {{name}}.

@dannyfreeman
Copy link
Collaborator Author

I tried parsing with 6880025 and 1183880 (essentially the same, right?)

The later is just a comment change. I already posted a link to 6880025 so I didn't want to squash the two.

over the collection of Clojure source from clojars I've fetched locally.

I need to figure out how to do this :)

I don't recall precisely where I got this impression, but I think maxbrunsfeld may have mentioned it some number of times. I think I found it hard enough just trying to handle what's correct (and even determining what's correct can be tricky for a spec-less language like Clojure) with what tree-sitter is capable of. Though perhaps it depends on the specifics a bit.

I have that impression of tree-sitter as well: parse correct code but not necessarily worry about incorrect code. The outcome of this change, if left as is, would be that extra tokens are generated for invalid symbols beginning with a / character. I think that is rare and not something to be found in published code, linters like clj-kondo will flag them as invalid and clojure will throw errors. From a syntax-highlighting perspective, the extra tokens don't matter. They will just be highlighted as symbols.

If using treesitter for code navigation it might make things act funny on invalid code like ///. Imagine a move forward over expression command, such that | is the cursor. Executing it once in this scenario, |/// would move the cursor over one character instead of the group: /|//. But that code is is invalid so I wouldn't necessarily expect it to work perfectly anyways.

@dannyfreeman
Copy link
Collaborator Author

dannyfreeman commented Dec 9, 2022

I have figured out the precedence conflicts, the latest commit 19eb4e7 should now split keywords into ns/name parts as well.

Something I'm not sure about: I've set : and :: to both be identified by the name kwd_marker. Should they be identified separately? Something like : is kwd_marker and :: is kwd_auto_resolve_marker or kwd_aliased_marker? I would still place them under the same marker: field in kwd_lit nodes.

@sogaiu
Copy link
Owner

sogaiu commented Dec 10, 2022

@dannyfreeman More later, but regarding testing over clojars code, please check your matrix.org account.

@sogaiu
Copy link
Owner

sogaiu commented Dec 10, 2022

@dannyfreeman I tried testing 19eb4e7. This time the checking revealed 36 unexpected files that didn't quite parse.

One issue seems to be not handling :/ appropriately. That does happen in the wild and it is accepted at the repl as a keyword AFAICT.

Another thing I noticed is stuff like :/usr/bin or :/path/from/request/to/a/file/or/folder not parsing:

These types of things are also accepted by my local clojure repl.

Don't have a whole lot to say yet regarding the : and :: point. Will try to think more on it.

@sogaiu
Copy link
Owner

sogaiu commented Dec 10, 2022

parse correct code but not necessarily worry about incorrect code. The outcome of this change, if left as is, would be that extra tokens are generated for invalid symbols beginning with a / character. I think that is rare and not something to be found in published code, linters like clj-kondo will flag them as invalid and clojure will throw errors. From a syntax-highlighting perspective, the extra tokens don't matter. They will just be highlighted as symbols.

Yeah, although I do take hints from highlighting to notice whether I've made some kinds of mistakes, I tend to rely more on clj-kondo via things like flycheck-clj-kondo. I think I had things set up so that the mode line changed color when clj-kondo had something to report. I think that resulted in noticing problems with the code sooner than otherwise so it was much easier to figure out what I'd done wrong as the amount of "change" since the last valid state would tend to be on the smaller side (and it was more likely that I could recall what changes I might have made). Very much appreciate clj-kondo being as fast as it is.

If using treesitter for code navigation it might make things act funny on invalid code like ///. Imagine a move forward over expression command, such that | is the cursor. Executing it once in this scenario, |/// would move the cursor over one character instead of the group: /|//. But that code is is invalid so I wouldn't necessarily expect it to work perfectly anyways.

Yeah, although as a user I would prefer if things worked well even in the face of invalid code, having tried my hand at providing that sort of experience and not having succeeded that well, my current sense is that learning to pay close attention to when code becomes invalid (at least from the parsing perspective) and fixing things as soon as possible is currently more practical. May be some day things will be different though :)

@dannyfreeman
Copy link
Collaborator Author

One issue seems to be not handling :/ appropriately. That does happen in the wild and it is accepted at the repl as a keyword AFAICT.

Another thing I noticed is stuff like :/usr/bin or :/path/from/request/to/a/file/or/folder not parsing

This makes sense, I think I have an idea of how I can handle this now too.

Keywords I think are one of the more difficult things to deal with in Clojure, because just about any string can be made into one, but not all of them can be represented in source. One of my favorites: (name (keyword " "))

@dannyfreeman
Copy link
Collaborator Author

Now that I have figured out how to perform the bulk testing, 7cfbef6 seems like a good state for the parser. I'll update the test files soon to conform to the new grammar in this branch, and be sure that some of my discovered edge cases are included.

@dannyfreeman
Copy link
Collaborator Author

I think this is pretty close to ready, outside of the differences in tree-sitter versions that needs to be fixed.

I've added prettier to format the grammar.js file automatically.

Something else I've added in that is more in line with the rest of the grammar:

Keyword delimiters (: and ::) do not have named nodes, just fields.
The same for the / delimiters in namespaced keywords and symbols. The can be referred to by position or field name (marker and delimiter respectively).

These fields don't show up in the web-ui or in parsing output, but they are picked up by queries. Emacs also displays them weird in treesit-explor-moode, but that does not prevent them from being correctly picked up in queries there as well.

Accounts for things like '(+ - * /)

but creates individual symbol tokens for invalid clojure code like

(/////) ;; 1 token per /

/asdf ;; 2 tokens: / and asdf

/asdf/hjkl ;; 2 token: / and asfd/hjkl

; Correct comment, squash this commit later
Adds runtime conflict resolution for keywords, when one keywords can
match both keyword rules, treesitter will prefer using the
_kwd_qualified rule over _kwd_unqualified
Tree-sitter does not like parsing empty chars, which is what a null
byte registers as.

This fixes that by forcing the *_NAMERSPACED_NAMES to require
at least one character to be picked up

See
tree-sitter/tree-sitter#98

This commit was squashed

This is the commit message 2:

Don't require at least 1 char for namespaced symbol names.

This allows for symbols like `this-one/` to be parsed without error.
They occur frequently in leiningen templates, which are not valid
clojure files, but are still common enough to account for. Note that
clojure repls will not accpet them as valid symbols. The reader throws
and error

This is the commit message 3

Revert "Don't require at least 1 char for namespaced symbol names."

The leiningnen templates are not valid clojure code, so we won't make
a special exception in the parsing for them, they can just be invalid.
Parsing still works, we just get an error node in the `sym_lit` nodes
when we fail to matach a `sym_name`
These can still be queried, but they don't appear in parse tree
results. This makes the corpus tests a little simpler to maintain,
and it's more in line with other markers in the grammar
@dannyfreeman dannyfreeman changed the title POC: Tokenize symbol into namespace and name parts Tokenize symbol into namespace and name parts Dec 14, 2022
Not really necessary in the long run, only in the context of reviewing
the code
Uses emacs 29 with this .dir-locals.el

((auto-mode-alist . (("\\.js\\'" . js-mode)))
 (nil . ((js-indent-align-list-continuation . t)
         (js-indent-level . 2))))
dannyfreeman added a commit to clojure-emacs/clojure-ts-mode that referenced this pull request Dec 20, 2022
Spun off of clojure-emacs/clojure-mode#644
in clojure-mode, this commit adds a basic readme and work in progress
clojure-ts-mode. Currently syntax highlighting is working and requires
tree-sitter-clojure built from this PR:
sogaiu/tree-sitter-clojure#31
and emacs built from the emacs-29 branch with tree-sitter installed.
@dannyfreeman dannyfreeman changed the title Tokenize symbol into namespace and name parts Tokenize symbols and keywords into namespace and name parts Dec 22, 2022
@dannyfreeman dannyfreeman merged commit 7732550 into sogaiu:master Jan 5, 2023
dannyfreeman added a commit to clojure-emacs/clojure-ts-mode that referenced this pull request Jan 8, 2023
sogaiu/tree-sitter-clojure#31 was merged.
It addressed two changes required for clojure-ts-mode to work properly:

- Forcing tree-sitter to properly identify symbol names in symbols with
metadata
- Distinguishing namespace and name parts of symbols and keywords.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants