-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenize symbols and keywords into namespace and name parts #31
Conversation
I can give this a test-run. |
One thing that I think likely won't work all of the time is this bit. Here some examples in the wild:
where parsing doesn't succeed. |
@sogaiu I had a feeling that There is an issue of precedence that comes up with the The introduction of keyword names and namespaces further complicates things. I believe I can address it using precedence rules. I need to take some time to understand them then apply them correctly in this situation. |
I've removed the $._ws, but it comes with some trade offs, see the commit message for 6880025 Edit: The only way I can think of to really solve some of the false positives like
But tree-sitter is not capable of that. Instead it relies on the branched parsing and precedence rules. I still plan to play with those a little more today. |
I tried parsing with 6880025 and 1183880 (essentially the same, right?) over the collection of Clojure source from clojars I've fetched locally. They didn't produce any more easily detectable parse errors than the master branch (8c23e0e) does. This seems like an improvement over 8154310, which yielded > 120 files that weren't handled [1]. Regarding:
My memory is a bit fuzzy, but I think I didn't put much effort into handling incorrect code: tree-sitter-clojure/grammar.js Lines 1 to 4 in 8c23e0e
I don't recall precisely where I got this impression, but I think maxbrunsfeld may have mentioned it some number of times. I think I found it hard enough just trying to handle what's correct (and even determining what's correct can be tricky for a spec-less language like Clojure) with what tree-sitter is capable of. Though perhaps it depends on the specifics a bit. [1] I didn't go through all of the files that weren't handled but I think I looked at over half of them. Almost all of them were of the type with |
The later is just a comment change. I already posted a link to 6880025 so I didn't want to squash the two.
I need to figure out how to do this :)
I have that impression of tree-sitter as well: parse correct code but not necessarily worry about incorrect code. The outcome of this change, if left as is, would be that extra tokens are generated for invalid symbols beginning with a If using treesitter for code navigation it might make things act funny on invalid code like |
245aa19
to
09d8068
Compare
I have figured out the precedence conflicts, the latest commit 19eb4e7 should now split keywords into ns/name parts as well. Something I'm not sure about: I've set |
@dannyfreeman More later, but regarding testing over clojars code, please check your matrix.org account. |
@dannyfreeman I tried testing 19eb4e7. This time the checking revealed 36 unexpected files that didn't quite parse. One issue seems to be not handling Another thing I noticed is stuff like
These types of things are also accepted by my local clojure repl. Don't have a whole lot to say yet regarding the |
Yeah, although I do take hints from highlighting to notice whether I've made some kinds of mistakes, I tend to rely more on clj-kondo via things like flycheck-clj-kondo. I think I had things set up so that the mode line changed color when clj-kondo had something to report. I think that resulted in noticing problems with the code sooner than otherwise so it was much easier to figure out what I'd done wrong as the amount of "change" since the last valid state would tend to be on the smaller side (and it was more likely that I could recall what changes I might have made). Very much appreciate clj-kondo being as fast as it is.
Yeah, although as a user I would prefer if things worked well even in the face of invalid code, having tried my hand at providing that sort of experience and not having succeeded that well, my current sense is that learning to pay close attention to when code becomes invalid (at least from the parsing perspective) and fixing things as soon as possible is currently more practical. May be some day things will be different though :) |
This makes sense, I think I have an idea of how I can handle this now too. Keywords I think are one of the more difficult things to deal with in Clojure, because just about any string can be made into one, but not all of them can be represented in source. One of my favorites: |
Now that I have figured out how to perform the bulk testing, 7cfbef6 seems like a good state for the parser. I'll update the test files soon to conform to the new grammar in this branch, and be sure that some of my discovered edge cases are included. |
I think this is pretty close to ready, outside of the differences in tree-sitter versions that needs to be fixed. I've added prettier to format the grammar.js file automatically. Something else I've added in that is more in line with the rest of the grammar: Keyword delimiters ( These fields don't show up in the web-ui or in parsing output, but they are picked up by queries. Emacs also displays them weird in |
6d56739
to
52e2db8
Compare
Accounts for things like '(+ - * /) but creates individual symbol tokens for invalid clojure code like (/////) ;; 1 token per / /asdf ;; 2 tokens: / and asdf /asdf/hjkl ;; 2 token: / and asfd/hjkl ; Correct comment, squash this commit later
Adds runtime conflict resolution for keywords, when one keywords can match both keyword rules, treesitter will prefer using the _kwd_qualified rule over _kwd_unqualified
Also fix regex typo, missing escape char
Tree-sitter does not like parsing empty chars, which is what a null byte registers as. This fixes that by forcing the *_NAMERSPACED_NAMES to require at least one character to be picked up See tree-sitter/tree-sitter#98 This commit was squashed This is the commit message 2: Don't require at least 1 char for namespaced symbol names. This allows for symbols like `this-one/` to be parsed without error. They occur frequently in leiningen templates, which are not valid clojure files, but are still common enough to account for. Note that clojure repls will not accpet them as valid symbols. The reader throws and error This is the commit message 3 Revert "Don't require at least 1 char for namespaced symbol names." The leiningnen templates are not valid clojure code, so we won't make a special exception in the parsing for them, they can just be invalid. Parsing still works, we just get an error node in the `sym_lit` nodes when we fail to matach a `sym_name`
These can still be queried, but they don't appear in parse tree results. This makes the corpus tests a little simpler to maintain, and it's more in line with other markers in the grammar
264545a
to
7f4a6de
Compare
Not really necessary in the long run, only in the context of reviewing the code
a75d209
to
ed35da1
Compare
Uses emacs 29 with this .dir-locals.el ((auto-mode-alist . (("\\.js\\'" . js-mode))) (nil . ((js-indent-align-list-continuation . t) (js-indent-level . 2))))
c797b24
to
b535e48
Compare
Spun off of clojure-emacs/clojure-mode#644 in clojure-mode, this commit adds a basic readme and work in progress clojure-ts-mode. Currently syntax highlighting is working and requires tree-sitter-clojure built from this PR: sogaiu/tree-sitter-clojure#31 and emacs built from the emacs-29 branch with tree-sitter installed.
sogaiu/tree-sitter-clojure#31 was merged. It addressed two changes required for clojure-ts-mode to work properly: - Forcing tree-sitter to properly identify symbol names in symbols with metadata - Distinguishing namespace and name parts of symbols and keywords.
Attempts to address issue #21 and issue #28.
A brief overview of the changes:
kwd_lit
andsym_lit
will now contain up to 3 new nodes:Only the
(kwd_name)
and(sym_name)
are required.sym_lit
nodes can still contain metadata nodes (this is unchanged).