Skip to content

External scanners should be able to produce tokens without calling advance #316

Closed
@movermeyer

Description

@movermeyer

Introduction

I was trying to make a parser for YAML, by implementing the grammar expressed by libyaml (the same grammar is also implemented by other YAML parsers as well):

Notice that there is the rule:
block_sequence ::= BLOCK-SEQUENCE-START (BLOCK-ENTRY block_node?)* BLOCK-END

BLOCK-SEQUENCE-START is a zero-width token that is generated by scanners when the indentation increases. It is important to make the grammar unambiguous.

I tried to create an external scanner that produced this token. However, no matter what I did, tree-sitter parse would always return an ERROR node.
I eventually realized that this was because external scanners cannot return a token without first calling advance.


Example

This is a trivial grammar that should match a file containing a single dash using an external scanner:

grammar.js

module.exports = grammar({
  name: 'example',

  externals: $ => [
    $.ZERO_WIDTH_TOKEN,
  ],

  rules: {
    document: $ => seq($.ZERO_WIDTH_TOKEN, $.dash),
    dash: $ => '-'
  }
});

src/scanner.cc

#include <tree_sitter/parser.h>

namespace {

  enum TokenType {
    ZERO_WIDTH_TOKEN
  };

  struct Scanner {
    Scanner() {}

    unsigned serialize(char *buffer) {
      return 0;
    }

    void deserialize(const char *buffer, unsigned length) {
    }

    bool scan(TSLexer *lexer, const bool *valid_symbols) {
      lexer->result_symbol = ZERO_WIDTH_TOKEN;
      return true;
    }
  };

}

extern "C" {
  void *tree_sitter_example_external_scanner_create() {
    return new Scanner();
  }

  bool tree_sitter_example_external_scanner_scan(void *payload, TSLexer *lexer,
                                              const bool *valid_symbols) {
    Scanner *scanner = static_cast<Scanner *>(payload);
    return scanner->scan(lexer, valid_symbols);
  }

  unsigned tree_sitter_example_external_scanner_serialize(void *payload, char *buffer) {
    Scanner *scanner = static_cast<Scanner *>(payload);
    return scanner->serialize(buffer);
  }

  void tree_sitter_example_external_scanner_deserialize(void *payload, const char *buffer, unsigned length) {
    Scanner *scanner = static_cast<Scanner *>(payload);
    scanner->deserialize(buffer, length);
  }

  void tree_sitter_example_external_scanner_destroy(void *payload) {
    Scanner *scanner = static_cast<Scanner *>(payload);
    delete scanner;
  }
}
$ tree-sitter generate && tree-sitter parse --debug file_with_dash.txt
new_parse
process version:0, version_count:1, state:1, row:0, col:0
lex_external state:1, row:0, column:0
lex_internal state:0, row:0, column:0
  consume character:'-'
lexed_lookahead sym:dash, size:1
detect_error
resume version:0
process version:0, version_count:1, state:0, row:0, col:0
skip_token symbol:dash
process version:0, version_count:1, state:0, row:0, col:1
lex_external state:1, row:0, column:1
lex_internal state:0, row:0, column:1
  skip character:10
lexed_lookahead sym:end, size:1
recover_eof
done
(ERROR [0, 0] - [1, 0]
  (dash [0, 0] - [0, 1]))
file_with_dash.txt	0 ms	(ERROR [0, 0] - [1, 0])

I would have expected this to match fine.

Especially since it is possible to match zero-width rules if you don't use an external scanner:

grammar.js

module.exports = grammar({
  name: 'example',

  rules: {
    document: $ => seq($.zero_width_token, $.dash),
    zero_width_token: $ => '',
    dash: $ => '-'
  }
});
$ tree-sitter generate && tree-sitter parse --debug file_with_dash.txt
new_parse
process version:0, version_count:1, state:1, row:0, col:0
lex_internal state:3, row:0, column:0
lexed_lookahead sym:zero_width_token, size:0
shift state:2
process version:0, version_count:1, state:2, row:0, col:0
lex_internal state:0, row:0, column:0
  consume character:'-'
lexed_lookahead sym:dash, size:1
shift state:4
process version:0, version_count:1, state:4, row:0, col:1
lex_internal state:0, row:0, column:1
  skip character:10
lexed_lookahead sym:end, size:1
reduce sym:document, child_count:2
accept
done
(document [0, 0] - [1, 0]
  (zero_width_token [0, 0] - [0, 0])
  (dash [0, 0] - [0, 1]))

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions