Cannot parse the lines as I want #3291

yunusey · 2023-01-23T22:09:13Z

yunusey
Jan 23, 2023

Hello everyone,

I am writing my parser for taking notes in neovim. You can check my repository out. Right now, I am working on creating different kinds of text. Right now, I have normal text, bold text, and underlined text. For creating bold text, you should put them in '<' and '>'. So the format should be: "". Right now, it works. However, when you try to write '<' as a symbol, it will parse it as though it is a bold text. My shortened code is:

// grammar.js
module.exports = grammar({
  name: 'cp',

  extras: $ => [' '],
  rules: $ => $.identifier,
  
  rules: {

    source_file: $ => repeat($._definition),

    _definition: $ => choice(
      $.definition,
      $.normal_definition
    ),

    starter: $ => seq(
      repeat($.number),
      choice(
      "#",
      // Other starters
    )),

    _symbol: $ => choice(
      "#",
      // Other symbols...
    ),

    symbol: $ => prec(1, choice('<', '>', $._symbol)),

    definition: $ => prec(2, seq(
      optional("\n"),
      $.starter,
      repeat($.text)
    )),

    normal_definition: $ => prec(1, seq(
      '\n',
      repeat('\t'),
    )),

    text: $ => prec(3, choice(
      $.bold_text,
      $.identifier,
      $.expression,
      $.underlined_text,
    )),

    expression: $ => prec(5, seq(
      $.text,
      $.symbol
    )),

    identifier: $ => prec(4, choice(
      /[a-zA-Z0-9_]\w*/
    )),

    number: $ => choice(
      /[0-9]/
    ),

    bold_identifier: $ => prec(6, choice(
      $.identifier,
      ' ',
      $.bsymbol,
    )),

    bsymbol: $ => prec(6, $._symbol),

    bold_symbol_s: $ => prec(6, '<'),
    bold_symbol_e: $ => prec(6, '>'),

    bold_text: $ => prec(6, seq(
      $.bold_symbol_s,
      repeat($.bold_identifier),
      $.bold_symbol_e,
      optional(' ')
    )),

    underlined_identifier: $ => choice(
      $.identifier,
      ' ',
      $.symbol
    ),

    underline_symbol: $ => ':',

    underlined_text: $ => seq(
      $.underline_symbol,
      repeat($.underlined_identifier),
      $.underline_symbol,
    ),
  }
});

Example file to parse:

# This is something :something cool: <This is a bold text>

The output of tree-sitter parse example.txt:

(source_file [0, 0] - [1, 0]
  (definition [0, 0] - [0, 58]
    (starter [0, 0] - [0, 1])
    (text [0, 2] - [0, 6]
      (identifier [0, 2] - [0, 6]))
    (text [0, 7] - [0, 9]
      (identifier [0, 7] - [0, 9]))
    (text [0, 10] - [0, 19]
      (identifier [0, 10] - [0, 19]))
    (text [0, 20] - [0, 38]
      (expression [0, 20] - [0, 38]
        (text [0, 20] - [0, 36]
          (underlined_text [0, 20] - [0, 36]
            (underline_symbol [0, 20] - [0, 21])
            (underlined_identifier [0, 21] - [0, 30]
              (identifier [0, 21] - [0, 30]))
            (underlined_identifier [0, 30] - [0, 31])
            (underlined_identifier [0, 31] - [0, 35]
              (identifier [0, 31] - [0, 35]))
            (underline_symbol [0, 35] - [0, 36])))
        (symbol [0, 37] - [0, 38])))
    (text [0, 38] - [0, 42]
      (identifier [0, 38] - [0, 42]))
    (text [0, 43] - [0, 45]
      (identifier [0, 43] - [0, 45]))
    (text [0, 46] - [0, 47]
      (identifier [0, 46] - [0, 47]))
    (text [0, 48] - [0, 52]
      (identifier [0, 48] - [0, 52]))
    (text [0, 53] - [0, 58]
      (expression [0, 53] - [0, 58]
        (text [0, 53] - [0, 57]
          (identifier [0, 53] - [0, 57]))
        (symbol [0, 57] - [0, 58]))))
  (normal_definition [0, 58] - [1, 0]))

I tried eveything to make sure that bold_text has the most precedence. So if I am wrong with the prec([number], rule) function, please correct me.

Conclusion:
So basically I want to use '<' and '>' as symbols as well as starters for bold_text. If there is anything you can suggest, I would like to know it.

Thank you!

ahelwer · 2023-01-23T22:38:03Z

ahelwer
Jan 23, 2023

lexical precedence

0 replies

yunusey · 2023-01-23T22:52:31Z

yunusey
Jan 23, 2023
Author

lexical precedence

Thank you for your answer! I still did not understand what I should do. Can you be more specific, please?

0 replies

ahelwer · 2023-01-23T23:20:12Z

ahelwer
Jan 23, 2023

https://tree-sitter.github.io/tree-sitter/creating-parsers
ctrl+f lexical precedence

0 replies

yunusey · 2023-01-23T23:36:16Z

yunusey
Jan 23, 2023
Author

https://tree-sitter.github.io/tree-sitter/creating-parsers ctrl+f lexical precedence

Thank you so much, I had tried that after your first message but hadn't figured it out. Now, I did!

0 replies

yunusey · 2023-01-24T01:53:29Z

yunusey
Jan 24, 2023
Author

This time there is another issue showed up. As I mentioned before, this is a parser for a note-taking plugin. So the '<', identifiers and '>' should be declared as different rules. However, since I used token(), I cannot use $.bold_symbol_s anymore. Although program finds where bold_text's are, I cannot highlight '<' and '>' other than the inside. I tried to use this:

(bold_text
  "<" @comment
  ">" @comment) @kojl.boldtext

And it gave me error:

Caused by:
    Query error at 12:3. Impossible pattern:
      "<" @comment

So, basically this kind of query does not accept token rule either. Because when I tried to write it without token, it did what it should do. My question is: Is there a way to highlight specific parts of a token rule? or Is there a way to use the rules in the token()?

0 replies

ahelwer · 2023-01-24T13:51:39Z

ahelwer
Jan 24, 2023

What does your bold text and related rules look like now

0 replies

yunusey · 2023-01-25T00:12:34Z

yunusey
Jan 25, 2023
Author

So by using token:

bold_text: $ => {
      const identifier = /[a-zA-Z0-9_]\w*/
      const space = ' '
      const bsymbol = choice(
                        "#",
                        "*",
                        "-",
                        "$",
                        "~",
                        ",",
                        ".",
                        "%",
                        ';',
                        '+',
                        '`',
                        '!',
                        '@',
                        '^',
                        '&',
                        '(',
                        ')',
                        '{',
                        '}',
                        '[',
                        ']',
                        '\\',
                        '|',
                        '/',
                        '?',
                        '\'',
                        '=',
                        '"',
      )
      return token(seq(
        "<",
        repeat(
          choice(
            identifier, 
            space,
            bsymbol
          )
        ), 
        ">", 
        // optional(space)
))},

However, since I want to highlight "<" and ">" different than the inside, I need them as a different rule.
I had this:

bold_text: $ => prec.left(1, seq(
      $.bold_symbol_s,
      repeat(choice(
        $.bold_identifier
      )),
      $.bold_symbol_e
)),

However, the moment I put '<' it identifies it as bold_text which I don't want it to be. I want it to be identified as bold_text only if there's '>' after all the identifiers. Since it will be in neovim and neovim parses the buffer continuously it's really important for me.
I have one more question: Do you think I should use everything as a function? If so, is there a way for parsing text with the function names?
Thanks in advance!

0 replies

ahelwer · 2023-01-25T01:56:20Z

ahelwer
Jan 25, 2023

Okay, so your grammar is not LR(1). You need a substantial amount of lookahead until you can determine whether the < token is a simple < or a bold start token. What language is this, did you develop it yourself? Usually in language design you have to keep an eye on ease of parsing so make choices accordingly. For example, what if a user uses a regular <, types some text, then uses a regular > later on? Any time you see a < you will have to read to the end of the file to determine whether there is a corresponding >.

I figure there are two ways to do it, either you use dynamic precedence to trigger GLR parsing or you use an external scanner.

0 replies

yunusey · 2023-01-25T02:52:11Z

yunusey
Jan 25, 2023
Author

Okay, so your grammar is not LR(1). You need a substantial amount of lookahead until you can determine whether the < token is a simple < or a bold start token. What language is this, did you develop it yourself? Usually in language design you have to keep an eye on ease of parsing so make choices accordingly. For example, what if a user uses a regular <, types some text, then uses a regular > later on? Any time you see a < you will have to read to the end of the file to determine whether there is a corresponding >.

I figure there are two ways to do it, either you use dynamic precedence to trigger GLR parsing or you use an external scanner.

Oh, I got you. I'll take a look at what I can do on it, thanks! Yes, this is a language I am developing. And I thought when I am running my parser using nvim-treesitter, it parses the document in every change. I think my problem is on, as you've mentioned, on dynamic precedence. Although I read the documentation in the website, I didn't give much attention on it. I will take a look at it again.

Thank you so much, you helped a lot!

0 replies

ahelwer · 2023-01-25T04:53:22Z

ahelwer
Jan 25, 2023

You can take inspiration from markdown, which uses the * character to denote bold and italic text. If a user wants to write just * they need to escape it (like I did up above) with \*.

0 replies

yunusey · 2023-01-25T05:11:19Z

yunusey
Jan 25, 2023
Author

You can take inspiration from markdown, which uses the * character to denote bold and italic text. If a user wants to write just * they need to escape it (like I did up above) with \*.

I actually had taken a look at it, but I'll check it again for finding the usages of these precedence functions. Thank you again!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot parse the lines as I want #3291

{{title}}

Replies: 11 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Cannot parse the lines as I want #3291

yunusey Jan 23, 2023

Replies: 11 comments

ahelwer Jan 23, 2023

yunusey Jan 23, 2023 Author

ahelwer Jan 23, 2023

yunusey Jan 23, 2023 Author

yunusey Jan 24, 2023 Author

ahelwer Jan 24, 2023

yunusey Jan 25, 2023 Author

ahelwer Jan 25, 2023

yunusey Jan 25, 2023 Author

ahelwer Jan 25, 2023

yunusey Jan 25, 2023 Author

yunusey
Jan 23, 2023

ahelwer
Jan 23, 2023

yunusey
Jan 23, 2023
Author

ahelwer
Jan 23, 2023

yunusey
Jan 23, 2023
Author

yunusey
Jan 24, 2023
Author

ahelwer
Jan 24, 2023

yunusey
Jan 25, 2023
Author

ahelwer
Jan 25, 2023

yunusey
Jan 25, 2023
Author

ahelwer
Jan 25, 2023

yunusey
Jan 25, 2023
Author