Skip to content
This repository has been archived by the owner on Feb 6, 2019. It is now read-only.

tree-sitter crashes because of U+FEFF in test file #49

Open
th-we opened this issue Nov 20, 2018 · 3 comments
Open

tree-sitter crashes because of U+FEFF in test file #49

th-we opened this issue Nov 20, 2018 · 3 comments

Comments

@th-we
Copy link

th-we commented Nov 20, 2018

When running tree-sitter test on this repo, the command will hang and eat more and more memory until it crashes.

In the test case, there is a "ZERO WIDTH NO-BREAK SPACE" (U+FEFF) character before the test code. When removing it, the test passes.

I also tried with tree-sitter-cli compiled from the git master branch instead of from the version published on npmjs.com, but results were identical.

@th-we
Copy link
Author

th-we commented Nov 20, 2018

Actually, it turns out that while the U+FEFF is a hard to spot problem, it doesn't have to be something this obscure. I added another test case, which also shows the problem:

============================================
Crashing test
============================================

{}

---

(data)

@maxbrunsfeld
Copy link
Contributor

maxbrunsfeld commented Nov 20, 2018

I haven't had a chance yet to clone this locally and reproduce, but I think you're probably hitting tree-sitter/tree-sitter#98. I just now remembered to change the title of that issue, since it was originally pretty obscure.

I noticed that there's what looks to be an empty string token in your grammar: https://github.com/th-we/tree-sitter-crash-demo/blob/master/grammar.js#L7. Is that intentional? The usual way to write that construct with Tree-sitter would be:

data: $ => $._quoted_value,
_quoted_value: $ => seq('"', optional($.value), '"'),
value: $ => /a[^"]*/

That way, you wouldn't get nodes of zero size.

@th-we
Copy link
Author

th-we commented Nov 21, 2018

While it's true that with optional() the problem does not occur, the tree would change, i.e. there would be no value node for the empty value (""). I used to include the quotes in the value nodes, but those nodes should be the injection point for another grammar.

I see the following workarounds:

  • Include the quotes in the injected grammar. But that would be a bit like including <script> tags in the JavaScript grammar.
  • Introduce a special emtpy_value node i.e. like
    data: $ => choice($._quoted_value, $.empty_value),
    _quoted_value: $ => seq('"', optional($.value), '"'),
    value: $ => /a[^"]*/,
    emtpy_value: $ => '""'
    
    It's inconsistent to include the quotes in one case and exclude them in another, but in the grammar cson file it can at least be made sure that the quotes are always highlighted the same.

There might be more tree-sitterish ways to solve this.

And by the way: Many thanks for this work of yours. It gives me basic linting for an obscure language with a number of pitfalls, some of which I can catch using tree-sitter now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants