Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an string_contents node #52

Open
mauricioszabo opened this issue Jun 15, 2023 · 13 comments
Open

Add an string_contents node #52

mauricioszabo opened this issue Jun 15, 2023 · 13 comments

Comments

@mauricioszabo
Copy link

First, thanks for this great work - I like it!

So, I am working on injecting JS into cljs code like this: (js* "console.log(~{});", 10) (that's a valid ClojureScript code). This will parse as (sym_lit ... (string_lit) (num_lit)).

What I want is to inject a Javascript grammar into this string_lit, but unfortunately, this doesn't work because what will be captured by the JS grammar is "console.log(~{});", and not only the inner contents.

Is it easy/possible to break down (string_lit) into "\"" (string_contents) "\""?

@sogaiu
Copy link
Owner

sogaiu commented Jun 16, 2023

As you can see in this issue for clojure-ts-mode, I have some interest in possibly having "injections" function in some capacity.

In that issue it's about regex literals and a regex grammar so it's a bit different but it seems possible there is some overlap.

I confess I don't understand the necessity of the grammar having to change to make things work out [1], but this may be more a reflection of my ignorance than anything else (^^;

Would you mind elaborating?

@dannyfreeman Do you have any thoughts about this?


[1] For example, can't the leading and trailing double quotes be removed / ignored by the code using tree-sitter / tree-sitter-clojure (in this case may be that would be editor code)?

@mauricioszabo
Copy link
Author

@sogaiu Sure, I can elaborate!

So, injections mean "passing a tree-sitter node to another grammar and use that grammar to do highlighting". In this case, there is no "node" to pass, because there is no "internal node" to the string_lit - the full node contains the string with the quotes.

As an example on how it works on Ruby: this code:

a = <<js
console.log(#{b});
js

is highlighted as Javascript, but with the exception of #{b} (that's the interpolation char). In Ruby grammar, this is a

(heredoc_body 
  (heredoc_content) 
  (interpolation 
    (identifier)) 
  (heredoc_content) 
  (heredoc_end))

We basically add an "injection point" to the heredoc_body, but only actually pass the nodes that are heredoc_content

The problem is that Pulsar (and AFAIK, other editors) don't actually send text to injections (I think NeoVIM uses queries to define what is going to be injected, and I don't see a way to query for the inner content only).

Finally, some editors (Pulsar included) also highlight the punctuation - like, the " gets a scope like punctuation.string.double - and it's also not possible to query for the quotes like I would do with other grammars like Ruby, for example.

@sogaiu
Copy link
Owner

sogaiu commented Jun 16, 2023

Thanks for the explanation.

It may take me a bit to digest it appropriately.

I'm not confident yet but it looks like Emacs 29+ can use a specific parser on ranges of text using treesit-parser-set-included-ranges. I looked at the source code I have for that and IIUC it uses ts_parser_set_included_ranges. That looks like something that Neovim also uses, though I don't know in what way (^^;

Some thoughts I have about this at the moment are:

  1. Do existing editors really not have a way to work with ranges of text for injection?
  2. If tree-sitter-clojure is to be modified, how can this be done technically?
  3. What are the consequences to existing users?

For the first question, I presume you know Pulsar pretty well, so we have an answer for at least one editor (at least for its current capabilities). I think some more investigation / querying is needed to assess the situation for other editors (e.g. Neovim, Helix, etc.) [1]. I have some contact with some folks that are knowledgeable about Neovim (and some folks for Emacs), but not really for any other editor.

For the second question, I did a little bit of experimentation here so there is something to look at. Perhaps you might try out the following to see how it works for your situation?

diff --git a/grammar.js b/grammar.js
index 372ce71..7dd2d4c 100644
--- a/grammar.js
+++ b/grammar.js
@@ -145,13 +145,11 @@ const KEYWORD_MARK =
 const AUTO_RESOLVE_MARK =
       token("::");
 
-const STRING =
-      token(seq('"',
-                repeat(regex('[^"\\\\]')),
+const STRING_CONTENT =
+      token(seq(repeat(regex('[^"\\\\]')),
                 repeat(seq("\\",
                            regex("."),
-                           repeat(regex('[^"\\\\]')))),
-                '"'));
+                           repeat(regex('[^"\\\\]'))))));
 
 // XXX: better to match \o378 as a single item
 const OCTAL_CHAR =
@@ -333,7 +331,9 @@ module.exports = grammar({
     choice(KEYWORD_MARK, AUTO_RESOLVE_MARK),
 
     str_lit: $ =>
-    STRING,
+    seq(field('open', '"'),
+        field('value', alias(STRING_CONTENT, $.str_content)),
+        field('close', '"')),
 
     char_lit: $ =>
     CHARACTER,
@@ -421,7 +421,9 @@ module.exports = grammar({
 
     regex_lit: $ =>
     seq(field('marker', "#"),
-        STRING),
+        field('open', '"'),
+        field('value', alias(STRING_CONTENT, $.regex_content)),
+        field('close', '"')),
 
     read_cond_lit: $ =>
     seq(repeat($._metadata_lit),

I used the following command to generate a new parser.c:

tree-sitter generate --abi 13 --no-bindings

Then I tested it with:

$ tree-sitter parse string.clj 
(source [0, 0] - [1, 0]
  (str_lit [0, 0] - [0, 18]
    value: (str_content [0, 1] - [0, 17])))
$ tree-sitter.master query query-string-content.scm string.clj 
string.clj
  pattern: 0
    capture: 0 - content, start: (0, 1), end: (0, 17), text: `hi i am a string`

string.clj was:

"hi i am a string"

query-string-content.scm was:

(str_lit
  value: (str_content) @content)

For the third question, at the very least I would need to run some tests to see if the results are less accurate, whether there's much of a performance impact, etc. There is also the question of whether it would cause breakage and what that means to end users. I don't really have answers to these bits yet.


[1] I presume editors are likely to be the main users of injection capabilities, but perhaps there are other types of code that might as well...

@mauricioszabo
Copy link
Author

Sure, I'll try this out during the week and report here if it works :)

Another thing that this could be useful that I missed on my original post - Pulsar have an "expand selection" that basically selects nodes. With the current implementation, it selects the full string, quotes included, where in other grammars it first selects the contents, then the contents with the quotes. Again, something very Pulsar-specific too...

@sogaiu
Copy link
Owner

sogaiu commented Jun 18, 2023

Another thing that this could be useful that I missed on my original post - Pulsar have an "expand selection" that basically selects nodes. With the current implementation, it selects the full string, quotes included, where in other grammars it first selects the contents, then the contents with the quotes. Again, something very Pulsar-specific too...

I agree it's nicer if handling things for Clojure is more in line with other languages. However, I don't know how much this should factor into the final decision -- perhaps it's more of a bonus.

Still, I think making a change along the lines you suggested does cause the changed nodes to be structurally more similar to some other things in the grammar (e.g. this). Not sure if this is that significant but making note of it for potential future discussion.

I think improved highlighting to aid with perceiving [1] interop constructs [2] can be significant though. If it turns out that changing the grammar is what is needed to get this working well for enough folks, possibly it will be worth it, even if it causes breakage.

BTW, I noticed this construct in C's grammar. Does Pulsar's selection handling work for this in the desired manner? I think that construct is similar in form to how strings are done in tree-sitter-clojure at the moment, i.e. the delimiters are part of a single token.

Perhaps most grammars ended up not making the delimiters a part of a token for string-like constructs -- I guess that's what you've been finding while working on Pulsar :)

I know that in tree-sitter-janet-simple, I did something similar to what is done in tree-sitter-clojure for a few things, but may be that is not particularly common.

Some other grammars that took an approach similar to tree-sitter-clojure include:

But IIUC the authors of those may have studied tree-sitter-clojure initially :P

Some other things that might be doing something similar:

I didn't go through all of these, but the above were some that turned up in a brief search.


[1] I suppose that is only really true if colors make a difference for one's perception. I notice it in my own case, but I understand that's not necessarily so for everyone.

[2] If we get improved perception of regex content along with it, that seems like it's not a bad deal either.

@sogaiu
Copy link
Owner

sogaiu commented Jun 20, 2023

For Neovim it looks like there is an offset! directive:

    offset!                                      treesitter-directive-offset!

        Takes the range of the captured node and applies an offset. This will
        generate a new range object for the captured node as
        metadata[capture_id].range.

        Parameters:
            {capture_id}
            {start_row}
            {start_col}
            {end_row}
            {end_col}
        Example:

((identifier) @constant (#offset! @constant 0 1 0 -1))

(Don't know how to link to it directly, but it's the second "directive" listed here (at least at the time of this writing).)

May be this is an example of it being used:

; lit-html style template interpolation
; <a @click=${e => console.log(e)}>
; <a @click="${e => console.log(e)}">
((attribute
  (quoted_attribute_value (attribute_value) @javascript))
  (#lua-match? @javascript "%${")
  (#offset! @javascript 0 2 0 -1))
((attribute
  (attribute_value) @javascript)
  (#lua-match? @javascript "%${")
  (#offset! @javascript 0 2 0 -2))

So perhaps in nvim-treesitter / Neovim's case it is not necessary to change the grammar to support accessing the part of the string that's not the delimiters for the purposes of injection. If that is the case, may be they've succeeded in a technically feasible approach that doesn't involve getting some unspecified number of grammars to change.

I don't know what percent of grammars benefit from being able to work with injections and of those how many of them don't separate the delimiters of string-like constructs from the "interior" of the string, but I know currently there are well over 200 grammars.

I haven't seen any recommendations in the official tree-sitter docs regarding structuring nodes in one's grammar to work better with injections, may be it could be suggested as an addition. Though at this point I wonder how much good it would do.

@sogaiu
Copy link
Owner

sogaiu commented Jun 21, 2023

I didn't see anything similar to offset! among either Helix's or Lapce's files.

Cursorless had these lines, but I didn't figure out quite what they did.

@dannyfreeman
Copy link
Collaborator

I have been working a lot with javascript lately (unfortunately), and have been looking into adding "child" parsers for things like JS template strings to emacs. I found myself wanting something like this for template strings, but also realized that neovim has the "offset" mechanisms in it and I think I find that more useful than an extra named child node.

Something that might be useful is having anonymous nodes in the string for the " characters, similar to what we do for lists. That would also open up the possibility for a contents field for the inner part of the string.

Maybe at that point we might as well name the string contents node, but I'm not sure if that will affect grammar consumer more than an anonymous field for the string contents. Named node or not I believe clojure-ts-mode in emacs will require updates to accommodate a change like this (which is fine)

@sogaiu
Copy link
Owner

sogaiu commented Aug 25, 2023

Thanks a lot for taking a look and commenting!

I hope to wrap my head around the commit you mentioned elsewhere soon -- may be it will be helpful in informing what to do for this issue.

@sogaiu
Copy link
Owner

sogaiu commented Aug 25, 2023

Something that might be useful is having anonymous nodes in the string for the " characters, similar to what we do for lists. That would also open up the possibility for a contents field for the inner part of the string.

That's close to this diff from above, right?

diff --git a/grammar.js b/grammar.js
index 372ce71..7dd2d4c 100644
--- a/grammar.js
+++ b/grammar.js
@@ -145,13 +145,11 @@ const KEYWORD_MARK =
 const AUTO_RESOLVE_MARK =
       token("::");
 
-const STRING =
-      token(seq('"',
-                repeat(regex('[^"\\\\]')),
+const STRING_CONTENT =
+      token(seq(repeat(regex('[^"\\\\]')),
                 repeat(seq("\\",
                            regex("."),
-                           repeat(regex('[^"\\\\]')))),
-                '"'));
+                           repeat(regex('[^"\\\\]'))))));
 
 // XXX: better to match \o378 as a single item
 const OCTAL_CHAR =
@@ -333,7 +331,9 @@ module.exports = grammar({
     choice(KEYWORD_MARK, AUTO_RESOLVE_MARK),
 
     str_lit: $ =>
-    STRING,
+    seq(field('open', '"'),
+        field('value', alias(STRING_CONTENT, $.str_content)),
+        field('close', '"')),
 
     char_lit: $ =>
     CHARACTER,
@@ -421,7 +421,9 @@ module.exports = grammar({
 
     regex_lit: $ =>
     seq(field('marker', "#"),
-        STRING),
+        field('open', '"'),
+        field('value', alias(STRING_CONTENT, $.regex_content)),
+        field('close', '"')),
 
     read_cond_lit: $ =>
     seq(repeat($._metadata_lit),

Maybe at that point we might as well name the string contents node, but I'm not sure if that will affect grammar consumer more than an anonymous field for the string contents.

I sketched out the diff above with an aim toward consistency with what I perceived to be with what's being done in the rest of the grammar (could be missing something important though) [1].

I didn't think about how it might affect a grammar consumer (^^; If you have further thoughts on that I'd be interested in hearing them. May be the above changes can be tried out in a branch with clojure-ts-mode?


[1] Hmm, I didn't think about empty string content...I wonder if that could be an issue with tree-sitter. There are some empty string related issues, but I'm not clear on what those are. Here is a PR that looks like it might be headed toward preventing empty string tokens...I wonder if making the value field optional could be part of a solution if there is really a problem...

@dannyfreeman
Copy link
Collaborator

Yeah that diff is along the lines of what I was thinking. I will have to try out clojure-ts-mode with those changes soon and see what that does. If the field is optional I would think that solves the empty string part but I can't be sure. I will try something out in the next couple of days here and report back.

@dannyfreeman
Copy link
Collaborator

Emacs 30 will have support for offset ranges now and the ability to easily define nested parsers with an offset. emacs-mirror/emacs@b892da5

From an Emacs perspective this isn't an issue anymore. As noted above neovim also solves for this.

@sogaiu
Copy link
Owner

sogaiu commented Sep 21, 2023

Thanks for the update!

Perhaps some other editors will choose to offer similar functionality down the line. I think there have been some discussions about trying to share some query info among editors and predicates (like #offset!) seem like they'd be related to that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants