Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rewrite the grammar #29

Merged
merged 30 commits into from
Mar 2, 2021
Merged

rewrite the grammar #29

merged 30 commits into from
Mar 2, 2021

Conversation

tek
Copy link
Contributor

@tek tek commented Feb 26, 2021

hello 👋 I rewrote the grammar and it's working quite nicely.
There's still some stuff to do, but at this point I'm opening this PR to get some advice on one specific problem that seems impossible to me.

The issue is preprocessor macros, as in:

f = do
  do a
#ifdef foo
  a
#else
    a <- a
  a
#endif

Here the block inside of the #ifdef ends the current rule, but in the #else it starts inside of that rule again.
I can keep track of the previous state in the scanner, but I don't know how to deal with that for the grammar.

Is there some way to reset the parser to a previous state? I looked at the C API but didn't find anything suitable.

@tek
Copy link
Contributor Author

tek commented Feb 26, 2021

@rewinfrey @maxbrunsfeld @philipturnbull

@ahlinc
Copy link

ahlinc commented Feb 26, 2021

Is there some way to reset the parser to a previous state? I looked at the C API but didn't find anything suitable.

As I understand tree-sitter's behavior there is no such option for manual controlling the parser's state. The parser makes this automatically and there only one option to correctly save the external scanner state through serialization and restore through deserialization. In the deserialization moment the external scanner gets the buffer with a state to which the parser decided to make the restore.

@tek
Copy link
Contributor Author

tek commented Feb 26, 2021

hm, bummer. I looked at how the C grammar handles #ifdefs, but it's not apparent to me that there is a similar situation.

@tek
Copy link
Contributor Author

tek commented Feb 26, 2021

I naively imagine functionality in tree-sitter.h where I can call get_current_state and receive the state number, then later call shift(old_state) and it will magically continue parsing 😇 guess there's more to it than that

@ahlinc
Copy link

ahlinc commented Feb 26, 2021

@tek regarding this PR contents:

$ git fetch origin refs/pull/29/head
...................................................................
 * branch            refs/pull/29/head -> FETCH_HEAD

$ git log --stat FETCH_HEAD
commit 5b7947c562b9b2b5d844c04351eacc331a6bbb86

    structure grammar into multiple files

 grammar.js          |   1313 +-
 grammar/basic.js    |     98 +
 grammar/class.js    |     99 +
 grammar/data.js     |    147 +
 grammar/decl.js     |     81 +
 grammar/exp.js      |    241 +
 grammar/id.js       |    126 +
 grammar/import.js   |     30 +
 grammar/misc.js     |     12 +
 grammar/module.js   |     37 +
 grammar/pat.js      |     95 +
 grammar/pattern.js  |     32 +
 grammar/type.js     |    201 +
 grammar/util.js     |     66 +
 src/grammar.json    |   7391 +++--
 src/node-types.json |     17 +-
 src/parser.c        | 298845 +++++++++++++++++++---------------------------------
 17 files changed, 154611 insertions(+), 154220 deletions(-)

commit 62d7ebf592a9491d4d44273fceb1390b85d148b1 (refs/pull/29/head)

    rewrite the grammar

...................................................................

  63 files changed, 296573 insertions(+), 484667 deletions(-)

With such amount of changes it's too hard to understand what is an adaptation of the repo to the current tree-sitter version and what is improvements of the grammar. Also splitting the grammar.js to several files in my opinion is controversial change.
For better chances to successful approval of changes to this repo by maintainers I'd suggest to split the work into several PRs:

  1. Remove tree-sitter generated stuff from the repo, there are plans to do this in the tree-sitter repo, issue 930:

    Cleanup - Remove generated files from all the grammar repos in the tree-sitter org

  2. Adapt the repo to the current tree-sitter version, move corpus to test/corpus, etc.
  3. Improvements to the grammar.js and related files.

@tek
Copy link
Contributor Author

tek commented Feb 26, 2021

@ahlinc it's a complete rewrite, I wouldn't know how to split this according to your suggestion.

  1. got it!
  2. since there is no overlap with the original tests, how would that look?

Why is it controversial to split the files?

@ahlinc
Copy link

ahlinc commented Feb 26, 2021

Here the block inside of the #ifdef ends the current rule, but in the #else it starts inside of that rule again.
I can keep track of the previous state in the scanner, but I don't know how to deal with that for the grammar.

@tek I'm not Haskell expert so if you can give here a minimal correct version of a Haskell program with such preprocessor construction, may be I'll been able to help you with a way how to handle your preprocessor issue.

@tek
Copy link
Contributor Author

tek commented Feb 26, 2021

{-# language CPP #-}

f :: IO Int
f = do
  _ <- do
    pure ()
#ifdef foo
  pure 1
#else
    pure ()
  pure 1
#endif

@ahlinc this should compile!

@ahlinc
Copy link

ahlinc commented Feb 26, 2021

2. since there is no overlap with the original tests, how would that look?

Didn't know that, it's hard to understand in the huge PR.

Why is it controversial to split the files?

It's just my opinion, I didn't see such splits in any tree-sitter grammar repos and it's easier for me to navigate in a one file by a word search instead of several files. Also it seems that grammar.js files looks too tricky for IDEs and I don't know any that would completely help with the navigation in them. Use of Typescript for tree-sitter grammars I didn't tried yet.

@ahlinc
Copy link

ahlinc commented Feb 26, 2021

@tek thanks for the program snippet, I'll take a look today or this weekend.

@tek
Copy link
Contributor Author

tek commented Feb 26, 2021

I just now split the files after working on the project for a while, because it was quite unergonomic to navigate with your suggested word search, since lots of rules have strings like exp and pat in them. I'm not married to the idea, if it creates problems I'll change it back.

Why do you suspect this will be problematic with IDEs?

What about Typescript? Are you suggesting to use it?

@tek
Copy link
Contributor Author

tek commented Feb 26, 2021

@tek thanks for the program snippet, I'll take a look today or this weekend.

awesome, thanks!

@tek
Copy link
Contributor Author

tek commented Feb 26, 2021

At least coc-tsserver can navigate to identifiers in the project (though since $ is used everywhere, that's very few)

@ahlinc
Copy link

ahlinc commented Feb 26, 2021

Why do you suspect this will be problematic with IDEs?

For now I use VSCode as the main editor and has good completion in grammar.js for DSL actions but at the same time it has problems in navigation between rules. At this moment I don't know what is the reason, may be I have some IDE misconfiguration or it doesn't understand completely grammar.js constructions.

What about Typescript? Are you suggesting to use it?

I think about it as a way to solve my issues with navigation in the grammar.js files because with Typescript I'll have a way to suggest to IDE about types in a place where it isn't clear for it. As consequence I'll need to call Typscript's tsc compiler before each run of tree-sitter generate.

@tek
Copy link
Contributor Author

tek commented Feb 26, 2021

For now I use VSCode as the main editor and has good completion in grammar.js for DSL actions but at the same time it has problems in navigation between rules. At this moment I don't know what is the reason, may be I have some IDE misconfiguration or it doesn't understand completely grammar.js constructions.

Is this a general problem or does it relate to this project, specifically the file separation?

In the former case, I can't imagine how an IDE would be able to navigate to rules since they are universally referenced as attributes of the $ object, which is unknown in the project due to being part of tree-sitter-cli's machinery.

I think about it as a way to solve my issues with navigation in the grammar.js files because with Typescript I'll have a way to suggest to IDE about types in a place where it isn't clear for it. As consequence I'll need to call Typscript's tsc compiler before each run of tree-sitter generate.

So you would give the $ a type? Could you make that refer to top level attributes?

@tek
Copy link
Contributor Author

tek commented Feb 27, 2021

What do I need to configure for tree-sitter test finding the language for a highlights test? I placed a haskell file in test/highlight and added this to package.json:

  "tree-sitter": {
    "scope": "source.hs",
    "file-types": [
      "hs"
    ],
    "highlights": [
      "queries/highlights.scm"
    ],
    "injection-regex": "^(hs|haskell)$"
  }

but it says No language found for path.

@maxbrunsfeld
Copy link
Contributor

It might have to do with finding the parser repo itself? https://tree-sitter.github.io/tree-sitter/syntax-highlighting#paths

This UX could probably be made smoother.

@tek
Copy link
Contributor Author

tek commented Feb 27, 2021

@maxbrunsfeld I have

  "parser-directories": [
    "/home/tek/code/tek/js"
  ],

in my config, and the repo path is /home/tek/code/tek/js/tree-sitter-haskell. that should be correct, no?

@ahlinc
Copy link

ahlinc commented Feb 28, 2021

About processing preprocessor directives

I was able to break mentioned https://github.com/tree-sitter/tree-sitter-c C language parser with a simple C program where a directive splits function call rule. @tek this looks the same type of problem that you are trying to solve in this Haskell parser.
Example:

#include <stdio.h>

#define FOO

int main(int argc, char const *argv[])
{
#ifdef FOO
    printf
#endif
    ("Hello world!\n");
    return 0;
}

With current tree-sitter abilities it looks for me impossible to implement correct tree-sitter parser in a one tree-sitter's language defined by a one grammar.js file that would be able to process all possible combinations of rules and their splits by preprocessor rules. Even languages like C/C++ don't process preprocessor directives in the one parser, they splits such processing into two independent phases. The C/C++ preporcessor has one important feature that it decides that preprocessor directive in everything till the end of line and when preprocessor replaces condition directives it replaces them with \n so it's impossible to make split of the grammar token what simplifies situation a bit, but I can imagine a language where some preprocessor directives may split anything, any token that consist more than a one character.

What I can suggest with current tree-sitter abilities is to:

  1. Consider such Haskell code as multi language code: https://tree-sitter.github.io/tree-sitter/using-parsers#multi-language-documents
  2. Don't try to parse preprocessor directives in a one tree-sitter language definition. It seems impossible for now to write correct parser implementation and external scanners can't help in any way.
  3. Implement preprocessor directives processing in a separate parser like this implemented in real C/C++ compilers.
  4. Separate preprocessor parser may help to make decision what code blocks are valid for some program state or detect code blocks and produce their combinations that may be parsed separately.

All above is just my opinion and understanding how tree-sitter works and what limitations it has for now.
Also I believe that it's possible to improve tree-sitter so it will be able to parse even such code with preprocessor constructions.

@tek
Copy link
Contributor Author

tek commented Feb 28, 2021

I had the same impression about the C grammar. Thanks for your suggestion with the multi-language feature, I'll try to implement something with that.

Howevery, I'd suggest to not make this part of this PR.

@maxbrunsfeld do you think it would be feasible to add functionality to the scanner API (or grammar) with which a state shift could be transparently enforced in order to reset to a previous preprocessor directive? Any other additional information that might help?

@maxbrunsfeld
Copy link
Contributor

I think we’re always going to have to treat the preprocessor in an approximate way. It’s ok IMO if certain preprocessor patterns cause parse errors. Hopefully the errors will be recoverable and we’ll get useful results the vast majority of the time.

@tek
Copy link
Contributor Author

tek commented Feb 28, 2021

@maxbrunsfeld so if the parser would execute a shift from an external stimulus, would it corrupt its overall state, or the parse tree?

@tek
Copy link
Contributor Author

tek commented Feb 28, 2021

those named precedences are quite the game changer!

@tek
Copy link
Contributor Author

tek commented Feb 28, 2021

question: what is the general policy when one rule can be substituted for another, but is more restrictive?
Like in this case, the simpletype rule can be parsed by the btype rule, but not all parts of the latter are legal in the positions where the former is used. The parse tree is identical, so is it praxis to use the more general rule, to avoid conflicts and be more permissive? or is it better to map the language precisely?

@tek
Copy link
Contributor Author

tek commented Mar 2, 2021

@maxbrunsfeld I added that when I was dealing with the preprocessor directives. Since the indentations would have to be reset on an #else, I just pushed a copy onto the stack on an #if. But I only realized afterwards that the same would have to be done with the external parser state. Thanks for reminding me, this can now be reverted!

@tek
Copy link
Contributor Author

tek commented Mar 2, 2021

@maxbrunsfeld @patrickt invisible tokens are inlined, double vector is removed, I renamed lots of user-facing nodes and all tests green. if you're satisfied, please merge!

@maxbrunsfeld maxbrunsfeld merged commit 267da70 into tree-sitter:master Mar 2, 2021
@tek
Copy link
Contributor Author

tek commented Mar 2, 2021

🚀

@maxbrunsfeld
Copy link
Contributor

Just FYI, I've been doing squash merges on these grammar repos lately, since they contain generated files, to avoid the repo size growing too fast.

Thanks for the awesome work @tek!

@tek
Copy link
Contributor Author

tek commented Mar 2, 2021

makes sense. it's been a pleasure!

@tek tek deleted the rewrite branch March 2, 2021 20:30
@patrickt
Copy link
Contributor

patrickt commented Mar 2, 2021

Huge thanks, @tek! This is a real step forward for the Haskell ecosystem at large, since this is (I think) the only working GHC Haskell parser outside of GHC itself!

@tek
Copy link
Contributor Author

tek commented Mar 2, 2021

omg 😂

@tek
Copy link
Contributor Author

tek commented Mar 2, 2021

@maxbrunsfeld @patrickt Am I supposed to be committing further changes to master?

@tek
Copy link
Contributor Author

tek commented Mar 2, 2021

thanks!

@rewinfrey
Copy link
Member

Thank you so much @tek for the amazing rewrite! This is huge for the Haskell community, wonderful work 👏 ❤️

@tek
Copy link
Contributor Author

tek commented Mar 2, 2021

@rewinfrey very kind, thank you! ❤️

@patrickt
Copy link
Contributor

patrickt commented Mar 2, 2021

@tek Re. master: it’s your call. No one’s consuming this repository as of yet, so I don’t see any huge problem with pushing small fixes directly to master. Bigger features etc. are nice to have as PRs.

@tek
Copy link
Contributor Author

tek commented Mar 2, 2021

@patrickt sure thing, I was mainly asking whether I'm permitted!

@patrickt
Copy link
Contributor

patrickt commented Mar 2, 2021

Yup! I’ve given you maintainer privileges, so you should be able to do most things. Give me a shout if you need anything.

@tek
Copy link
Contributor Author

tek commented Mar 2, 2021

will do, thanks!

@felixroos
Copy link

Hello everyone, I have just found this thread after unsuccessfully trying to use the https://www.npmjs.com/package/tree-sitter-haskell package, which had its latest publish 5 years ago. Would it be possible to publish a new version containing the changes in this PR? Or would this involve additional work that is specific to the npm package?

@tek
Copy link
Contributor Author

tek commented Dec 21, 2023

@maxbrunsfeld you wanna add my account to that package's maintainers?

@felixroos
Copy link

@tek thanks!

For anyone landing here: There is also a prebuilt wasm file in this repo. It can be re-built via tree-sitter build-wasm .
When I do it on my machine it is 4,5MB, while the prebuilt version is 3MB, so it might not be the newest version?

FYI I built this visualizer: https://felixroos.github.io/haskell-tree-sitter-playground/

@tek
Copy link
Contributor Author

tek commented Dec 30, 2023

very nice!

@lancejpollard
Copy link

@felixroos So as per the linked issue above, can we somehow npm install this package these days? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants