External lexer for `extras` comment #884

haxscramper · 2021-01-18T19:32:33Z

haxscramper
Jan 18, 2021

I'm trying out different features of tree-sitter, and having troubles with extras, specifically when used in conjunction with externals. I have simple grammar -

module.exports = grammar({
    name: "simple_4",
    externals: $ => [$.COMMENT],
    extras: $ => [$.COMMENT, /[\s\n]/],
    rules: {
        main: $ => repeat($.stmt),
        stmt: $ => /[0-9]/
    }
});

That correspond to language that consists of single-digit integer literals, with optional comments in form of {some comment text}. Comments are parsed using external lexer, defined as following

bool tree_sitter_simple_4_external_scanner_scan(
    void*       payload,
    TSLexer*    lexer,
    const bool* valid_symbols) {
    printf(
        "Lex at [%c]\n",
        (lexer->lookahead == '\n' ? 'n' : lexer->lookahead));

    if (valid_symbols[COMMENT] && lexer->lookahead == '{') {
        printf("  At comment start\n");
        while (lexer->lookahead != '}' && lexer->lookahead != '\0') {
            lexer->advance(lexer, false);
        }
        lexer->advance(lexer, false);
        lexer->mark_end(lexer);
        lexer->result_symbol = COMMENT;
        return true;
    }
    return false;
}

Full source code

#include <stdio.h>
#include <tree_sitter/parser.h>


enum Tok
{
    COMMENT
};

void* tree_sitter_simple_4_external_scanner_create() {
    return NULL;
}

bool tree_sitter_simple_4_external_scanner_scan(
    void*       payload,
    TSLexer*    lexer,
    const bool* valid_symbols) {
    printf(
        "Lex at [%c]\n",
        (lexer->lookahead == '\n' ? 'n' : lexer->lookahead));

    if (valid_symbols[COMMENT] && lexer->lookahead == '{') {
        printf("  At comment start\n");
        while (lexer->lookahead != '}' && lexer->lookahead != '\0') {
            lexer->advance(lexer, false);
        }
        lexer->advance(lexer, false);
        lexer->mark_end(lexer);
        lexer->result_symbol = COMMENT;
        return true;
    }
    return false;
}

unsigned tree_sitter_simple_4_external_scanner_serialize(
    void* payload,
    char* buffer) {
    return 0;
}

void tree_sitter_simple_4_external_scanner_deserialize(
    void*       payload,
    const char* buffer,
    unsigned    length) {
}

void tree_sitter_simple_4_external_scanner_destroy(void* payload) {
}

Parser works as expected for inputs like 1 1 1 1, or 1{}, correctly detecting extra comment node and ignoring whitespaces. But when I add space after 1 (e.g. 1 {}) lexer fails.

To be more specific - I added printf to show which character the parser is starting at, and for 1 {} starting comment character - { is never shown, meaning lexing is never attempted from {.

I'm not specifically sure, but I might've misunderstood something wrt. of extra rules interaction, or custom parser implementations, so sorry in advance if this is an obvious mistake on my end.

$ tree-sitter --version
tree-sitter 0.17.3

Answered by maxbrunsfeld

Jan 22, 2021

The names in the token enum don’t matter; the order corresponds to the ordering of the externals array.

I think your external scanner just needs to handle white space. White space is not automatically skipped, because sometimes it plays a role in scanners’ token selection logic.

View full answer

carueda · 2021-01-22T02:14:39Z

carueda
Jan 22, 2021

Not an expert at all here; I just started exploring Tree-Sitter myself, but I have some quick suggestions:

name your enum COMMENT (likely the tool is expecting that per the documented example)
the body of your while under that if will never be executed according to the involved conditions

(again, just quick reactions, I haven't looked at the overall logic in any detail at all, but hope that helps.)

0 replies

haxscramper · 2021-01-22T09:24:26Z

haxscramper
Jan 22, 2021
Author

I updated grammar to use the uppercase COMMENT name, though it didn't change anything. While loop works as expected, capturing whole {test} body when executed (including delimiting {}). I originally haven't included output of the test run to save space, but here is what parsing 1{test} yields

Lex at [1]
Lex at [{]
  At comment start
Lex at [n]

(main [0, 0] - [1, 0]
  (stmt [0, 0] - [0, 1])
  (COMMENT [0, 1] - [0, 7]))

The AST is correct as far as I'm concerned, and the comment is correctly recognized. And here is an output for 1 {test} (with space)

Lex at [1]
Lex at [ ]
Lex at [ ]
Lex at [t]
Lex at [e]
Lex at [s]
Lex at [t]
Lex at [}]
Lex at [n]
Lex at [ ]
Lex at [t]
Lex at [e]
Lex at [s]
Lex at [t]
Lex at [}]
Lex at [n]
Lex at [n]
Lex at [n]

(main [0, 0] - [1, 0]
  (stmt [0, 0] - [0, 1])
  (ERROR [0, 2] - [0, 8]
    (ERROR [0, 2] - [0, 8])))

Also including output of tree-sitter -d - I still can't figure out what the issue is, but maybe someone could spot it.

new_parse
process version:0, version_count:1, state:1, row:1, col:0
lex_external state:1, row:1, column:0
Lex at [1]
lex_internal state:0, row:1, column:0
  consume character:'1'
lexed_lookahead sym:stmt, size:1
shift state:2
process version:0, version_count:1, state:2, row:1, col:1
lex_external state:1, row:1, column:1
Lex at [ ]
lex_internal state:0, row:1, column:1
  skip character:' '
lex_external state:1, row:1, column:1
Lex at [ ]
lex_internal state:0, row:1, column:1
  skip character:' '
skip_unrecognized_character
  consume character:'{'
lex_external state:1, row:1, column:3
Lex at [t]
lex_internal state:0, row:1, column:3
  consume character:'t'
lex_external state:1, row:1, column:4
Lex at [e]
lex_internal state:0, row:1, column:4
  consume character:'e'
lex_external state:1, row:1, column:5
Lex at [s]
lex_internal state:0, row:1, column:5
  consume character:'s'
lex_external state:1, row:1, column:6
Lex at [t]
lex_internal state:0, row:1, column:6
  consume character:'t'
lex_external state:1, row:1, column:7
Lex at [}]
lex_internal state:0, row:1, column:7
  consume character:'}'
lex_external state:1, row:1, column:8
Lex at [n]
lex_internal state:0, row:1, column:8
  skip character:10
lexed_lookahead sym:ERROR, size:7, character:'{'
detect_error
resume version:0
process version:0, version_count:1, state:0, row:1, col:1
lex_external state:1, row:1, column:1
Lex at [ ]
lex_internal state:0, row:1, column:1
  skip character:' '
skip_unrecognized_character
  consume character:'{'
lex_external state:1, row:1, column:3
Lex at [t]
lex_internal state:0, row:1, column:3
  consume character:'t'
lex_external state:1, row:1, column:4
Lex at [e]
lex_internal state:0, row:1, column:4
  consume character:'e'
lex_external state:1, row:1, column:5
Lex at [s]
lex_internal state:0, row:1, column:5
  consume character:'s'
lex_external state:1, row:1, column:6
Lex at [t]
lex_internal state:0, row:1, column:6
  consume character:'t'
lex_external state:1, row:1, column:7
Lex at [}]
lex_internal state:0, row:1, column:7
  consume character:'}'
lex_external state:1, row:1, column:8
Lex at [n]
lex_internal state:0, row:1, column:8
  skip character:10
lexed_lookahead sym:ERROR, size:7, character:'{'
skip_token symbol:ERROR
process version:0, version_count:1, state:0, row:1, col:8
lex_external state:1, row:1, column:8
Lex at [n]
lex_internal state:0, row:1, column:8
  skip character:10
lexed_lookahead sym:end, size:1
recover_to_previous state:2, depth:2
recover_eof
process version:1, version_count:2, state:2, row:1, col:8
lex_external state:1, row:1, column:8
Lex at [n]
lex_internal state:0, row:1, column:8
  skip character:10
lexed_lookahead sym:end, size:1
reduce sym:main, child_count:1
accept
select_smaller_error symbol:main, over_symbol:ERROR
done

1 reply

carueda Jan 22, 2021

Oops, I misread the if/while conditions.

maxbrunsfeld · 2021-01-22T17:07:56Z

maxbrunsfeld
Jan 22, 2021
Maintainer

The names in the token enum don’t matter; the order corresponds to the ordering of the externals array.

I think your external scanner just needs to handle white space. White space is not automatically skipped, because sometimes it plays a role in scanners’ token selection logic.

0 replies

haxscramper · 2021-01-22T17:45:08Z

haxscramper
Jan 22, 2021
Author

@maxbrunsfeld yes, that solved main issue, but now $.SPACE token is present in the generated parse tree - if I understand correctly that is because now it is represented by named rule SPACE, and I'm not sure how to make external token into anonymous node.

Updated grammar

module.exports = grammar({
    name: "simple_4",
    externals: $ => [$.COMMENT, $.SPACE],
    extras: $ => [$.COMMENT, $.SPACE, /\n/],
    rules: {
        main: $ => repeat($.stmt),
        stmt: $ => /[0-9]/
    }
});

And lexer code

bool tree_sitter_simple_4_external_scanner_scan(
    void*       payload,
    TSLexer*    lexer,
    const bool* valid_symbols) {
    if (valid_symbols[COMMENT] && lexer->lookahead == '{') {
        while (lexer->lookahead != '}' && lexer->lookahead != '\0') {
            lexer->advance(lexer, false);
        }

        lexer->advance(lexer, false);
        lexer->mark_end(lexer);
        lexer->result_symbol = COMMENT;
        return true;
    } else if (valid_symbols[SPACE] && lexer->lookahead == ' ') {
        lexer->advance(lexer, true);
        lexer->mark_end(lexer);
        lexer->result_symbol = SPACE;
        return true;
    } else {
        return false;
    }
}

For 1 {test} following tree is generated:

(main [0, 0] - [1, 0]
  (stmt [0, 0] - [0, 1])
  (SPACE [0, 2] - [0, 2])
  (COMMENT [0, 2] - [0, 8]))

In this particular case SPACE is not an issue, but for larger source inputs it would be a problematic, even if I compact multiple spaces into one token (left out to simplify example).

8 replies

haxscramper Jan 22, 2021
Author

Yes, that's what I was doing in original grammar - extras: $ => [$.COMMENT, /[\s\n]/],, but then scanner never even starts on comment character.

maxbrunsfeld Jan 22, 2021
Maintainer

If externals and extras both contain COMMENT, then the external scanner should run for every single token (because comments are always valid).

haxscramper Jan 22, 2021
Author

Yes, it should, but it doesn't - see output I added in this comment. There is no Lex at [{] anywhere in second output (parsing 1 {test}), while it is present in the first one (when I remove spaces on the input - 1{test})

maxbrunsfeld Jan 25, 2021
Maintainer

Yes, it does, as you can see in the output:

Lex at [ ]

There is just a space before the { character. As I said in the original answer, your scanner function just isn't handling whitespace.

haxscramper Jan 25, 2021
Author

Oh, yes, I misunderstood your idea for skipping over characters. Adding while loop to unconditionally skip all whitespaces in lexer solved the issue, now 1 {test} is correctly recognized.

haxscramper · 2021-01-25T18:57:30Z

haxscramper
Jan 25, 2021
Author

For the sake of completeness - full main scanner function with edits:

bool tree_sitter_simple_4_external_scanner_scan(
    void*       payload,
    TSLexer*    lexer,
    const bool* valid_symbols) {
+    while (lexer->lookahead == ' ') {
+        lexer->advance(lexer, true);
+    }

    if (valid_symbols[COMMENT] && lexer->lookahead == '{') {
        while (lexer->lookahead != '}' && lexer->lookahead != '\0') {
            lexer->advance(lexer, false);
        }

        lexer->advance(lexer, false);
        lexer->mark_end(lexer);
        lexer->result_symbol = COMMENT;
        return true;
    }
    return false;
}

Original grammar and code left unchanged

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

External lexer for `extras` comment #884

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

External lexer for extras comment #884

haxscramper Jan 18, 2021

Replies: 5 comments · 9 replies

carueda Jan 22, 2021

haxscramper Jan 22, 2021 Author

carueda Jan 22, 2021

maxbrunsfeld Jan 22, 2021 Maintainer

haxscramper Jan 22, 2021 Author

haxscramper Jan 22, 2021 Author

maxbrunsfeld Jan 22, 2021 Maintainer

haxscramper Jan 22, 2021 Author

maxbrunsfeld Jan 25, 2021 Maintainer

haxscramper Jan 25, 2021 Author

haxscramper Jan 25, 2021 Author

External lexer for `extras` comment #884

haxscramper
Jan 18, 2021

Replies: 5 comments 9 replies

carueda
Jan 22, 2021

haxscramper
Jan 22, 2021
Author

maxbrunsfeld
Jan 22, 2021
Maintainer

haxscramper
Jan 22, 2021
Author

haxscramper Jan 22, 2021
Author

maxbrunsfeld Jan 22, 2021
Maintainer

haxscramper Jan 22, 2021
Author

maxbrunsfeld Jan 25, 2021
Maintainer

haxscramper Jan 25, 2021
Author

haxscramper
Jan 25, 2021
Author