Description
Introduction
I was trying to make a parser for YAML, by implementing the grammar expressed by libyaml
(the same grammar is also implemented by other YAML parsers as well):
Notice that there is the rule:
block_sequence ::= BLOCK-SEQUENCE-START (BLOCK-ENTRY block_node?)* BLOCK-END
BLOCK-SEQUENCE-START
is a zero-width token that is generated by scanners when the indentation increases. It is important to make the grammar unambiguous.
I tried to create an external scanner that produced this token. However, no matter what I did, tree-sitter parse
would always return an ERROR
node.
I eventually realized that this was because external scanners cannot return a token without first calling advance
.
Example
This is a trivial grammar that should match a file containing a single dash using an external scanner:
grammar.js
module.exports = grammar({
name: 'example',
externals: $ => [
$.ZERO_WIDTH_TOKEN,
],
rules: {
document: $ => seq($.ZERO_WIDTH_TOKEN, $.dash),
dash: $ => '-'
}
});
src/scanner.cc
#include <tree_sitter/parser.h>
namespace {
enum TokenType {
ZERO_WIDTH_TOKEN
};
struct Scanner {
Scanner() {}
unsigned serialize(char *buffer) {
return 0;
}
void deserialize(const char *buffer, unsigned length) {
}
bool scan(TSLexer *lexer, const bool *valid_symbols) {
lexer->result_symbol = ZERO_WIDTH_TOKEN;
return true;
}
};
}
extern "C" {
void *tree_sitter_example_external_scanner_create() {
return new Scanner();
}
bool tree_sitter_example_external_scanner_scan(void *payload, TSLexer *lexer,
const bool *valid_symbols) {
Scanner *scanner = static_cast<Scanner *>(payload);
return scanner->scan(lexer, valid_symbols);
}
unsigned tree_sitter_example_external_scanner_serialize(void *payload, char *buffer) {
Scanner *scanner = static_cast<Scanner *>(payload);
return scanner->serialize(buffer);
}
void tree_sitter_example_external_scanner_deserialize(void *payload, const char *buffer, unsigned length) {
Scanner *scanner = static_cast<Scanner *>(payload);
scanner->deserialize(buffer, length);
}
void tree_sitter_example_external_scanner_destroy(void *payload) {
Scanner *scanner = static_cast<Scanner *>(payload);
delete scanner;
}
}
$ tree-sitter generate && tree-sitter parse --debug file_with_dash.txt
new_parse
process version:0, version_count:1, state:1, row:0, col:0
lex_external state:1, row:0, column:0
lex_internal state:0, row:0, column:0
consume character:'-'
lexed_lookahead sym:dash, size:1
detect_error
resume version:0
process version:0, version_count:1, state:0, row:0, col:0
skip_token symbol:dash
process version:0, version_count:1, state:0, row:0, col:1
lex_external state:1, row:0, column:1
lex_internal state:0, row:0, column:1
skip character:10
lexed_lookahead sym:end, size:1
recover_eof
done
(ERROR [0, 0] - [1, 0]
(dash [0, 0] - [0, 1]))
file_with_dash.txt 0 ms (ERROR [0, 0] - [1, 0])
I would have expected this to match fine.
Especially since it is possible to match zero-width rules if you don't use an external scanner:
grammar.js
module.exports = grammar({
name: 'example',
rules: {
document: $ => seq($.zero_width_token, $.dash),
zero_width_token: $ => '',
dash: $ => '-'
}
});
$ tree-sitter generate && tree-sitter parse --debug file_with_dash.txt
new_parse
process version:0, version_count:1, state:1, row:0, col:0
lex_internal state:3, row:0, column:0
lexed_lookahead sym:zero_width_token, size:0
shift state:2
process version:0, version_count:1, state:2, row:0, col:0
lex_internal state:0, row:0, column:0
consume character:'-'
lexed_lookahead sym:dash, size:1
shift state:4
process version:0, version_count:1, state:4, row:0, col:1
lex_internal state:0, row:0, column:1
skip character:10
lexed_lookahead sym:end, size:1
reduce sym:document, child_count:2
accept
done
(document [0, 0] - [1, 0]
(zero_width_token [0, 0] - [0, 0])
(dash [0, 0] - [0, 1]))