rewrite the grammar #29

tek · 2021-02-26T02:13:41Z

hello 👋 I rewrote the grammar and it's working quite nicely.
There's still some stuff to do, but at this point I'm opening this PR to get some advice on one specific problem that seems impossible to me.

The issue is preprocessor macros, as in:

f = do
  do a
#ifdef foo
  a
#else
    a <- a
  a
#endif

Here the block inside of the #ifdef ends the current rule, but in the #else it starts inside of that rule again.
I can keep track of the previous state in the scanner, but I don't know how to deal with that for the grammar.

Is there some way to reset the parser to a previous state? I looked at the C API but didn't find anything suitable.

tek · 2021-02-26T12:32:33Z

@rewinfrey @maxbrunsfeld @philipturnbull

ahlinc · 2021-02-26T12:35:44Z

Is there some way to reset the parser to a previous state? I looked at the C API but didn't find anything suitable.

As I understand tree-sitter's behavior there is no such option for manual controlling the parser's state. The parser makes this automatically and there only one option to correctly save the external scanner state through serialization and restore through deserialization. In the deserialization moment the external scanner gets the buffer with a state to which the parser decided to make the restore.

tek · 2021-02-26T12:40:38Z

hm, bummer. I looked at how the C grammar handles #ifdefs, but it's not apparent to me that there is a similar situation.

tek · 2021-02-26T12:45:52Z

I naively imagine functionality in tree-sitter.h where I can call get_current_state and receive the state number, then later call shift(old_state) and it will magically continue parsing 😇 guess there's more to it than that

ahlinc · 2021-02-26T13:24:57Z

@tek regarding this PR contents:

$ git fetch origin refs/pull/29/head
...................................................................
 * branch            refs/pull/29/head -> FETCH_HEAD

$ git log --stat FETCH_HEAD
commit 5b7947c562b9b2b5d844c04351eacc331a6bbb86

    structure grammar into multiple files

 grammar.js          |   1313 +-
 grammar/basic.js    |     98 +
 grammar/class.js    |     99 +
 grammar/data.js     |    147 +
 grammar/decl.js     |     81 +
 grammar/exp.js      |    241 +
 grammar/id.js       |    126 +
 grammar/import.js   |     30 +
 grammar/misc.js     |     12 +
 grammar/module.js   |     37 +
 grammar/pat.js      |     95 +
 grammar/pattern.js  |     32 +
 grammar/type.js     |    201 +
 grammar/util.js     |     66 +
 src/grammar.json    |   7391 +++--
 src/node-types.json |     17 +-
 src/parser.c        | 298845 +++++++++++++++++++---------------------------------
 17 files changed, 154611 insertions(+), 154220 deletions(-)

commit 62d7ebf592a9491d4d44273fceb1390b85d148b1 (refs/pull/29/head)

    rewrite the grammar

...................................................................

  63 files changed, 296573 insertions(+), 484667 deletions(-)

With such amount of changes it's too hard to understand what is an adaptation of the repo to the current tree-sitter version and what is improvements of the grammar. Also splitting the grammar.js to several files in my opinion is controversial change.
For better chances to successful approval of changes to this repo by maintainers I'd suggest to split the work into several PRs:

Remove tree-sitter generated stuff from the repo, there are plans to do this in the tree-sitter repo, issue 930:

Cleanup - Remove generated files from all the grammar repos in the tree-sitter org
Adapt the repo to the current tree-sitter version, move corpus to test/corpus, etc.
Improvements to the grammar.js and related files.

tek · 2021-02-26T13:30:20Z

@ahlinc it's a complete rewrite, I wouldn't know how to split this according to your suggestion.

got it!
since there is no overlap with the original tests, how would that look?

Why is it controversial to split the files?

ahlinc · 2021-02-26T13:32:13Z

Here the block inside of the #ifdef ends the current rule, but in the #else it starts inside of that rule again.
I can keep track of the previous state in the scanner, but I don't know how to deal with that for the grammar.

@tek I'm not Haskell expert so if you can give here a minimal correct version of a Haskell program with such preprocessor construction, may be I'll been able to help you with a way how to handle your preprocessor issue.

tek · 2021-02-26T13:36:24Z

{-# language CPP #-}

f :: IO Int
f = do
  _ <- do
    pure ()
#ifdef foo
  pure 1
#else
    pure ()
  pure 1
#endif

@ahlinc this should compile!

ahlinc · 2021-02-26T13:42:35Z

2. since there is no overlap with the original tests, how would that look?

Didn't know that, it's hard to understand in the huge PR.

Why is it controversial to split the files?

It's just my opinion, I didn't see such splits in any tree-sitter grammar repos and it's easier for me to navigate in a one file by a word search instead of several files. Also it seems that grammar.js files looks too tricky for IDEs and I don't know any that would completely help with the navigation in them. Use of Typescript for tree-sitter grammars I didn't tried yet.

ahlinc · 2021-02-26T13:45:16Z

@tek thanks for the program snippet, I'll take a look today or this weekend.

tek · 2021-02-26T13:49:02Z

I just now split the files after working on the project for a while, because it was quite unergonomic to navigate with your suggested word search, since lots of rules have strings like exp and pat in them. I'm not married to the idea, if it creates problems I'll change it back.

Why do you suspect this will be problematic with IDEs?

What about Typescript? Are you suggesting to use it?

tek · 2021-02-26T13:49:14Z

@tek thanks for the program snippet, I'll take a look today or this weekend.

awesome, thanks!

tek · 2021-02-26T13:56:23Z

At least coc-tsserver can navigate to identifiers in the project (though since $ is used everywhere, that's very few)

ahlinc · 2021-02-26T14:28:21Z

Why do you suspect this will be problematic with IDEs?

For now I use VSCode as the main editor and has good completion in grammar.js for DSL actions but at the same time it has problems in navigation between rules. At this moment I don't know what is the reason, may be I have some IDE misconfiguration or it doesn't understand completely grammar.js constructions.

What about Typescript? Are you suggesting to use it?

I think about it as a way to solve my issues with navigation in the grammar.js files because with Typescript I'll have a way to suggest to IDE about types in a place where it isn't clear for it. As consequence I'll need to call Typscript's tsc compiler before each run of tree-sitter generate.

tek · 2021-02-26T14:38:13Z

For now I use VSCode as the main editor and has good completion in grammar.js for DSL actions but at the same time it has problems in navigation between rules. At this moment I don't know what is the reason, may be I have some IDE misconfiguration or it doesn't understand completely grammar.js constructions.

Is this a general problem or does it relate to this project, specifically the file separation?

In the former case, I can't imagine how an IDE would be able to navigate to rules since they are universally referenced as attributes of the $ object, which is unknown in the project due to being part of tree-sitter-cli's machinery.

I think about it as a way to solve my issues with navigation in the grammar.js files because with Typescript I'll have a way to suggest to IDE about types in a place where it isn't clear for it. As consequence I'll need to call Typscript's tsc compiler before each run of tree-sitter generate.

So you would give the $ a type? Could you make that refer to top level attributes?

tek · 2021-02-27T13:52:36Z

What do I need to configure for tree-sitter test finding the language for a highlights test? I placed a haskell file in test/highlight and added this to package.json:

  "tree-sitter": {
    "scope": "source.hs",
    "file-types": [
      "hs"
    ],
    "highlights": [
      "queries/highlights.scm"
    ],
    "injection-regex": "^(hs|haskell)$"
  }

but it says No language found for path.

maxbrunsfeld · 2021-02-27T20:20:47Z

It might have to do with finding the parser repo itself? https://tree-sitter.github.io/tree-sitter/syntax-highlighting#paths

This UX could probably be made smoother.

tek · 2021-02-27T20:36:24Z

@maxbrunsfeld I have

  "parser-directories": [
    "/home/tek/code/tek/js"
  ],

in my config, and the repo path is /home/tek/code/tek/js/tree-sitter-haskell. that should be correct, no?

ahlinc · 2021-02-28T16:36:43Z

About processing preprocessor directives

I was able to break mentioned https://github.com/tree-sitter/tree-sitter-c C language parser with a simple C program where a directive splits function call rule. @tek this looks the same type of problem that you are trying to solve in this Haskell parser.
Example:

#include <stdio.h>

#define FOO

int main(int argc, char const *argv[])
{
#ifdef FOO
    printf
#endif
    ("Hello world!\n");
    return 0;
}

With current tree-sitter abilities it looks for me impossible to implement correct tree-sitter parser in a one tree-sitter's language defined by a one grammar.js file that would be able to process all possible combinations of rules and their splits by preprocessor rules. Even languages like C/C++ don't process preprocessor directives in the one parser, they splits such processing into two independent phases. The C/C++ preporcessor has one important feature that it decides that preprocessor directive in everything till the end of line and when preprocessor replaces condition directives it replaces them with \n so it's impossible to make split of the grammar token what simplifies situation a bit, but I can imagine a language where some preprocessor directives may split anything, any token that consist more than a one character.

What I can suggest with current tree-sitter abilities is to:

Consider such Haskell code as multi language code: https://tree-sitter.github.io/tree-sitter/using-parsers#multi-language-documents
Don't try to parse preprocessor directives in a one tree-sitter language definition. It seems impossible for now to write correct parser implementation and external scanners can't help in any way.
Implement preprocessor directives processing in a separate parser like this implemented in real C/C++ compilers.
Separate preprocessor parser may help to make decision what code blocks are valid for some program state or detect code blocks and produce their combinations that may be parsed separately.

All above is just my opinion and understanding how tree-sitter works and what limitations it has for now.
Also I believe that it's possible to improve tree-sitter so it will be able to parse even such code with preprocessor constructions.

tek · 2021-02-28T18:56:53Z

I had the same impression about the C grammar. Thanks for your suggestion with the multi-language feature, I'll try to implement something with that.

Howevery, I'd suggest to not make this part of this PR.

@maxbrunsfeld do you think it would be feasible to add functionality to the scanner API (or grammar) with which a state shift could be transparently enforced in order to reset to a previous preprocessor directive? Any other additional information that might help?

maxbrunsfeld · 2021-02-28T19:12:07Z

I think we’re always going to have to treat the preprocessor in an approximate way. It’s ok IMO if certain preprocessor patterns cause parse errors. Hopefully the errors will be recoverable and we’ll get useful results the vast majority of the time.

tek · 2021-02-28T19:23:55Z

@maxbrunsfeld so if the parser would execute a shift from an external stimulus, would it corrupt its overall state, or the parse tree?

tek · 2021-02-28T20:31:22Z

those named precedences are quite the game changer!

tek · 2021-02-28T23:07:47Z

question: what is the general policy when one rule can be substituted for another, but is more restrictive?
Like in this case, the simpletype rule can be parsed by the btype rule, but not all parts of the latter are legal in the positions where the former is used. The parse tree is identical, so is it praxis to use the more general rule, to avoid conflicts and be more permissive? or is it better to map the language precisely?

tek · 2021-03-02T18:57:10Z

@maxbrunsfeld I added that when I was dealing with the preprocessor directives. Since the indentations would have to be reset on an #else, I just pushed a copy onto the stack on an #if. But I only realized afterwards that the same would have to be done with the external parser state. Thanks for reminding me, this can now be reverted!

tek · 2021-03-02T20:26:33Z

@maxbrunsfeld @patrickt invisible tokens are inlined, double vector is removed, I renamed lots of user-facing nodes and all tests green. if you're satisfied, please merge!

tek · 2021-03-02T20:28:35Z

🚀

maxbrunsfeld · 2021-03-02T20:28:39Z

Just FYI, I've been doing squash merges on these grammar repos lately, since they contain generated files, to avoid the repo size growing too fast.

Thanks for the awesome work @tek!

tek · 2021-03-02T20:29:03Z

makes sense. it's been a pleasure!

patrickt · 2021-03-02T21:10:26Z

Huge thanks, @tek! This is a real step forward for the Haskell ecosystem at large, since this is (I think) the only working GHC Haskell parser outside of GHC itself!

tek · 2021-03-02T21:12:05Z

omg 😂

tek · 2021-03-02T21:47:38Z

@maxbrunsfeld @patrickt Am I supposed to be committing further changes to master?

tek · 2021-03-02T21:55:55Z

thanks!

rewinfrey · 2021-03-02T22:07:31Z

Thank you so much @tek for the amazing rewrite! This is huge for the Haskell community, wonderful work 👏 ❤️

tek · 2021-03-02T22:12:56Z

@rewinfrey very kind, thank you! ❤️

patrickt · 2021-03-02T23:27:07Z

@tek Re. master: it’s your call. No one’s consuming this repository as of yet, so I don’t see any huge problem with pushing small fixes directly to master. Bigger features etc. are nice to have as PRs.

tek · 2021-03-02T23:32:18Z

@patrickt sure thing, I was mainly asking whether I'm permitted!

patrickt · 2021-03-02T23:35:58Z

Yup! I’ve given you maintainer privileges, so you should be able to do most things. Give me a shout if you need anything.

tek · 2021-03-02T23:42:10Z

will do, thanks!

felixroos · 2023-12-21T09:42:46Z

Hello everyone, I have just found this thread after unsuccessfully trying to use the https://www.npmjs.com/package/tree-sitter-haskell package, which had its latest publish 5 years ago. Would it be possible to publish a new version containing the changes in this PR? Or would this involve additional work that is specific to the npm package?

tek · 2023-12-21T12:54:53Z

@maxbrunsfeld you wanna add my account to that package's maintainers?

felixroos · 2023-12-30T16:20:29Z

@tek thanks!

For anyone landing here: There is also a prebuilt wasm file in this repo. It can be re-built via tree-sitter build-wasm .
When I do it on my machine it is 4,5MB, while the prebuilt version is 3MB, so it might not be the newest version?

FYI I built this visualizer: https://felixroos.github.io/haskell-tree-sitter-playground/

tek · 2023-12-30T16:23:31Z

very nice!

lancejpollard · 2024-01-22T19:35:35Z

@felixroos So as per the linked issue above, can we somehow npm install this package these days? Thanks!

tek mentioned this pull request Feb 26, 2021

tree-sitter-haskell rewrite tree-sitter/tree-sitter#944

Closed

tek added 5 commits March 1, 2021 00:45

rewrite the grammar

382ebfe

refactor number literals

77fece3

fix quasiquotes

e018931

fix comments after where on same line

c2c4a33

be more precise about splices

1e35bc7

tek added 5 commits March 2, 2021 20:49

more naturally readable node names

a29a63c

suspend macos CI

355ccf1

remove indent stack restoration for cpp

d7cfe7a

inline reserved characters

8a714d2

fix regression

7c99bdf

maxbrunsfeld approved these changes Mar 2, 2021

View reviewed changes

maxbrunsfeld merged commit 267da70 into tree-sitter:master Mar 2, 2021

tek deleted the rewrite branch March 2, 2021 20:30

rewinfrey mentioned this pull request Mar 2, 2021

Update on the state of the grammar #27

Closed

tek mentioned this pull request Jan 19, 2024

Can't npm install tree-sitter-haskell on Mac M3 Node.js v20.10? #110

Closed

rewrite the grammar #29

rewrite the grammar #29

Conversation

tek commented Feb 26, 2021

tek commented Feb 26, 2021

ahlinc commented Feb 26, 2021

tek commented Feb 26, 2021

tek commented Feb 26, 2021

ahlinc commented Feb 26, 2021 • edited Loading

tek commented Feb 26, 2021

ahlinc commented Feb 26, 2021

tek commented Feb 26, 2021

ahlinc commented Feb 26, 2021

ahlinc commented Feb 26, 2021

tek commented Feb 26, 2021

tek commented Feb 26, 2021

tek commented Feb 26, 2021

ahlinc commented Feb 26, 2021

tek commented Feb 26, 2021

tek commented Feb 27, 2021

maxbrunsfeld commented Feb 27, 2021

tek commented Feb 27, 2021

ahlinc commented Feb 28, 2021 • edited Loading

About processing preprocessor directives

tek commented Feb 28, 2021

maxbrunsfeld commented Feb 28, 2021

tek commented Feb 28, 2021

tek commented Feb 28, 2021

tek commented Feb 28, 2021

tek commented Mar 2, 2021 • edited Loading

tek commented Mar 2, 2021

tek commented Mar 2, 2021

maxbrunsfeld commented Mar 2, 2021

tek commented Mar 2, 2021

patrickt commented Mar 2, 2021

tek commented Mar 2, 2021

tek commented Mar 2, 2021

tek commented Mar 2, 2021

rewinfrey commented Mar 2, 2021

tek commented Mar 2, 2021

patrickt commented Mar 2, 2021

tek commented Mar 2, 2021

patrickt commented Mar 2, 2021

tek commented Mar 2, 2021

felixroos commented Dec 21, 2023

tek commented Dec 21, 2023

felixroos commented Dec 30, 2023

tek commented Dec 30, 2023

lancejpollard commented Jan 22, 2024

ahlinc commented Feb 26, 2021 •

edited

Loading

ahlinc commented Feb 28, 2021 •

edited

Loading

tek commented Mar 2, 2021 •

edited

Loading