Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comment node includes trailing \r #36

Closed
AndreasArvidsson opened this issue Aug 10, 2023 · 13 comments · Fixed by #45
Closed

Comment node includes trailing \r #36

AndreasArvidsson opened this issue Aug 10, 2023 · 13 comments · Fixed by #45

Comments

@AndreasArvidsson
Copy link
Contributor

When using CRLF line endings comments will include a trailing \r

# hello
foo: "bar"

node.text: # hello\r

@wenkokke
Copy link
Owner

This is another scanner bug, unfortunately.

@AndreasArvidsson
Copy link
Contributor Author

Could you elaborate on that? I'm not quite sure what the scanner means in this context.

This is not the first bug where we've had leading or trailing whitespaces on a node. Would it be worth doing a unit test that checks for leading and/or trailing whitespaces?

@wenkokke
Copy link
Owner

Scanner means the code that does the lexing; see scanner.cc. It's a bunch of C++ code that implements a custom lexer for TalonScript, and it's where you need to handle any features of the language that are tricky to express as grammars—e.g., indentation sensitivity or lookahead.

@wenkokke
Copy link
Owner

Could you elaborate on that? I'm not quite sure what the scanner means in this context.

I'd be happy to accept a PR with such tests?

@pokey
Copy link
Contributor

pokey commented Aug 12, 2023

Can you not just tweak the comment regex?

comment: ($) => token(/#.*?/),

@wenkokke
Copy link
Owner

I'm not sure what purpose that serves, because afaik comment tokens are lexed by the scanner. I guess you could try replacing . by [^\r\n]?

@pokey
Copy link
Contributor

pokey commented Aug 12, 2023

Yeah I was thinking something like that

@pokey
Copy link
Contributor

pokey commented Nov 7, 2023

so is this fixed by #42 ?

@wolfmanstout
Copy link
Contributor

@pokey I'm not sure ... @AndreasArvidsson can you retest this?

I considered adding a unit test for this but it's not easy to capture using the built-in tree-sitter testing system, which doesn't include tests for node contents. I think we'd need to set up a separate unit test, e.g. using the Node.js API -- I'm sure this is easy but I'm just not very familiar with Node.js so it wasn't trivial for me.

@AndreasArvidsson
Copy link
Contributor Author

@wolfmanstout The problem is still there, but slightly changed. node.text is now "# hello\r\n"

Should definitely be doable with node

@wolfmanstout
Copy link
Contributor

FWIW @wenkokke suggestion above would probably work. Despite the fact that comments are declared as an external they are still parsed by that regex. FWIW this is following the Python implementation pattern. I guess there is some subtle difference, assuming Python doesn't have the same behavior.

@wolfmanstout
Copy link
Contributor

Okay, I have a draft of a fix out:
#45

Before this is merged, I want to point out that the Python tree-sitter grammar has the exact same behavior (I tested it). Should Cursorless just be robust to this instead?

@wenkokke
Copy link
Owner

wenkokke commented Nov 25, 2023

Before this is merged, I want to point out that the Python tree-sitter grammar has the exact same behavior (I tested it). Should Cursorless just be robust to this instead?

I based the scanner and grammar on the Python grammar, so they might actually welcome your changes there as well.

Might be less "this behavior is endemic" and more "that's where I copied it from".

That said, probably makes sense for Cursorless to be robust to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants