Rewrite the grammar once again #120

tek · 2024-03-24T17:00:12Z

I decided to redesign the scanner and parts of the grammar due to the large number of issues without obvious fix that
have accumulated over the years.
The new implementation provides a significant set of improvements, listed below.

Since the repo has become uncloneable in some
situations, I would also prefer to change the workflow so that parser.c is only generated on a release branch, not for
every commit, like some other grammars do it.
However, since the current situation is already bad, it seems that the only way out would be to reset the history, which
would break consumers like nixpkgs who rely on being able to access older commits.
An alternative could be to use a new Github location, but that would also be awkward.
In any case, it's probably better to handle this later.

I've overhauled the tree structure a bit, mostly for higher-level nodes, since I found some parts to be lacking or badly
named.
For example, I added header / imports / declarations for the top-level structure.
Since I don't have much experience using the grammar via tree-sitter API directly, I've launched a survey on the
Haskell discourse to get some more feedback, so I can use the opportunity of introducing breaking changes to
improve the grammar for users.

Not sure about the wasm artifact – is it still necessary to use that patch included in the Makefile?

I'd appreciate opinions and feedback!

Please take a look, @414owen @amaanq @wenkokke

Parses the GHC codebase!

I'm using a trimmed set of the source directories of the compiler and most core libraries in this repo.

This used to break horribly in many files because explicit brace layouts weren't supported very well.
Faster in most cases! Here are a few simple benchmarks to illustrate the difference, not to be taken too seriously, using the test codebases in test/libs:

Old:
```
effects: 32ms
postgrest: 91ms
ivory: 224ms
polysemy: 84ms
semantic: 1336ms
haskell-language-server: 532ms
flatparse: 45ms
```
New:
```
effects: 29ms
postgrest: 64ms
ivory: 178ms
polysemy: 70ms
semantic: 692ms
haskell-language-server: 390ms
flatparse: 36ms
```
GHC's compiler directory takes 3000ms, but is among the fastest repos for per-line and per-character times! To get more detailed info (including new codebases I added, consisting mostly of core libraries), run test/parse-libs. I also added an interface for running hyperfine, exposed as a Nix app – execute nix run .#bench-libs -- stm mtl transformers with the desired set of libraries in test/libs or test/libs/tsh-test-ghc/libraries.
Smaller size of the shared object.

tree-sitter generate produces a haskell.so with a size of 4.4MB for the old grammar, and 3.0MB for the new one.
Smaller size of the generated parser.c: 24MB -> 14MB.
Significantly faster time to generate, and slightly faster build.

On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s.
Smaller size of parser.c, from 24MB down to 14MB.
All terminals now have proper text nodes when possible, like the . in modules. Fixes Include . from qualified modules and variables #102, Include ! from strictness annotations #107, Qualified/unqualified module paths colored differently #115 (partially?).
Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code. Fixes Instance with associated type, following TH top level splice, misparsed as function #89, Components parser as type when they are not #105, Incorrect parse due to top-level splices #111.
Comments aren't pulled into preceding layouts anymore. Fixes Comments following function included in function pattern #82, Incorrect parse for function with where-clause and comments #109. (Can probably still be improved with a few heuristics for e.g. postfix haddock)
Similarly, whitespace is kept out of layout-related nodes as much as possible. Fixes Grammar defines trailing whitespace as part of lambda case statement #74.
Hashes can now be operators in all situations, without sacrificing unboxed tuples. Fixes exp_section_right not parsed when containing a hash #108.
Expression quotes are now handled separately from quasiquotes and their contents parsed properly. Fixes Typed Template Haskell quotations / splices not handled correctly #116.
Explicit brace layouts are now handled correctly. Fixes Misparse of explicit-braced code #92.

Function application with multiple block arguments is handled correctly.
Example:

a = do \ a -> a
    case a of
      a -> a
    do a + a
    if
   | a -> a
   | a -> a
  *
  do a

Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like prefix operator detection.
Haddock comments have dedicated nodes now.
Use named precedences instead of closely replicating the GHC parser's productions.
Different layouts are tracked and closed with their special cases considered. In particular, multi-way if now has layout.
Fixed CPP bug where mid-line #endif would be false positive.
CPP only matches legal directives now.
Generally more lenient parsing than GHC, and in the presence of errors:
- Missing closing tokens at EOF are tolerated for:
  - CPP
  - Comment
  - TH Quotation
- Multiple semicolons in some positions like if/then
- Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions
List comprehensions can have multiple sets of qualifiers (ParallelListComp).
Deriving clauses after GADTs don't require layout anymore.
Newtype instance heads are working properly now.
Escaping newlines in CPP works now.
One remaining issue is that qualified left sections that contain infix ops are broken: (a + a A.+) I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse application, infix and negation without lexing all qualified names in the scanner. I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work. For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing. Solved this by implementing qualified operator lookahead.
Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of Unicode categories, using bitmaps. I might need to change this to write them all to a shared file, so the set of source files stays the same.

clason · 2024-03-24T17:07:16Z

I would also prefer to change the workflow so that parser.c is only generated on a release branch, not for
every commit, like some other grammars do it.

In this case, please at least commit the grammar.json, which is much smaller and has meaningful diffs. This will allow people to generate the parser without having to use node.

Not sure about the wasm artifact – is it still necessary to use that patch included in the Makefile?

You can use tree-sitter build --wasm with the latest version. Please upload that as a release artifact as well.

Important question: will queries remain compatible?

tek · 2024-03-24T17:28:55Z

In this case, please at least commit the grammar.json, which is much smaller and has meaningful diffs. This will allow people to generate the parser without having to use node.

Ah, didn't know that. How do you generate the parser from that file?

You can use tree-sitter build --wasm with the latest version. Please upload that as a release artifact as well.

That's how we're building it, but @wenkokke added a patch for web-tree-sitter for some reason I don't recall.

I regenerated the .wasm file and included it in the PR. Are you talking about Github release artifacts?

Important question: will queries remain compatible?

Some of them will, but it's gonna depend on the result of the survey how contained the breakage will be.
I think most expression queries will still work, but types and top level structures should be affected.

clason · 2024-03-24T17:31:23Z

Ah, didn't know that. How do you generate the parser from that file?

Just tree-sitter generate src/grammar.json (little known fact).

That's how we're building it, but @wenkokke added a patch for web-tree-sitter for some reason I don't recall.

Possibly because the output dir has changed? But that should be fixed upstream.

I regenerated the .wasm file and included it in the PR. Are you talking about Github release artifacts?

Yes, the Github release artifacts (only; it shouldn't be committed to the repo).

Some of them will, but it's gonna depend on the result of the survey how contained the breakage will be.
I think most expression queries will still work, but types and top level structures should be affected.

Thanks for the heads-up. I hope the queries here can serve as a guide for nvim-treesitter.

414owen · 2024-03-24T17:35:04Z

Wow @tek this sounds incredible, clearly a labour of love, and I'm hugely grateful that you undertook such a huge amount of work for the community. The sheer quantity of listed improvements is astounding.

I'll try my best to look over it, but given the large amount of grammar changes, and that I've mostly just dealt with the scanner in the past, you're clearly the one who needs to decide when this is ready :)

🚀 🚀 🚀

414owen · 2024-03-24T18:31:16Z

test/corpus/prec.txt

+-- This should be an error, since a conid followed by two dots parses as the
+-- qualified composition operator, but we allow it since there's no way to use
+-- an operator here, like it would be in a section.
+a = [A..a]


Very nice.

Next step is to fix this in GHC's parser, yeah?

λ> [False..True] <interactive>:1:2: error: Not in scope: ‘False..’

ah 😅 GHC makes up for it by permitting .. to be referred to as an operator if you use a qualifying module, like a A... a 😁

wenkokke · 2024-03-24T20:54:17Z

That's how we're building it, but @wenkokke added a patch for web-tree-sitter for some reason I don't recall.

The reason is that web-tree-sitter doesn't support all of libc, and there's a file in the web-tree-sitter source that determines exactly what symbols are supported.

Please test the WASM build before merging. Tests using only the standard testing infrastructure sadly aren't reliable.

clason · 2024-03-24T20:59:32Z

That is an upstream issue, surely? The point of wasm is to be independent of both the source and the consumer. If something isn't working that should be working, tree-sitter should know about it rather than band-aiding it?

tek · 2024-03-24T21:16:34Z

@414owen thanks so much! I definitely went a bit overboard with the obsession over the last few weeks 😁
I don't think it's necessary to review the scanner in detail – running that fuzzer should be enough to complement the tests.

tek · 2024-03-24T21:22:31Z

Please test the WASM build before merging. Tests using only the standard testing infrastructure sadly aren't reliable.

Ah, I'll make sure to bring the wasm build and release up to speed then

tek · 2024-03-24T21:58:01Z

FYI @wenkokke it appears that the file with the exports was refactored in tree-sitter/tree-sitter@d290051, now not containing standard library symbols anymore

edit: Ah! It's in here now, and appears to contain realloc, which your patch added. So we're good?

414owen · 2024-03-24T18:53:09Z

script/known-failures-semantic.txt

-examples/semantic/semantic/test/fixtures/haskell/corpus/type-synonyms.A.hs
-examples/semantic/semantic/test/fixtures/haskell/corpus/type-synonyms.B.hs
-examples/semantic/semantic/test/fixtures/haskell/corpus/tempate-haskell.A.hs
-examples/semantic/semantic/test/fixtures/haskell/corpus/template-haskell.B.hs


🎉 fantastic

414owen · 2024-03-24T21:31:49Z

src/scanner.c

+  "semi",
+  "cpp-else",
+  "cpp",
+};


Remind me to make a PR tying this together with the enum more tightly, using an X macro

414owen · 2024-03-24T21:51:48Z

src/scanner.c

+static int32_t peek(uint32_t n) {
+  if (state->lookahead->offset + n < state->lookahead->len) return unsafe_peek(n);
+  else {
+    advance_before(n);


I'm a bit uneasy about advancing inside a function named peek. Is there a good reason for this?

What would you consider a better name?

It's returning the nth lookahead character without advancing over it, if it hasn't been advanced over before.
I think it's possible that I've rearranged the steps in a way that the second part never happens, by now.

So, the peek part applies to the n only, I guess. The functions peek{0,1,2} embody this concept a bit clearer maybe.

Well, it looks like unsafe_peek_abs implements a pure peek with n lookahead, and I'm not entirely sure what peek does. It has a few layers, and I don't understand them. They seem to try to advance in relation to a parameter and the lookahead offset?

for the record:

peek(n) := return lookahead character offset + n without advancing over it, but advance over all preceding characters if necessary (may have advanced over offset + n before, in which case it is returned directly from our buffer). Or alternatively: Ensure that our buffer contains offset + n, advance the tree-sitter lookahead as far as necessary if it doesn't, copying characters to our buffer.
peek_unsafe(n) := return lookahead character offset + n without advancing, return 0 if lookahead buffer is too small
peek_unsafe_abs(n) := return lookahead character n without advancing, return 0 if lookahead buffer is too small

wenkokke · 2024-03-24T23:23:22Z

That is an upstream issue, surely? The point of wasm is to be independent of both the source and the consumer. If something isn't working that should be working, tree-sitter should know about it rather than band-aiding it?

This isn't a WASM issue, but rather an emscripten feature that lets you not ship the entirety of libc and libc++ for your web app.

That means that tree-sitter, as the entry point, bundles libc and libc++ and must pick what parts they want. They generally pick only those parts they need for there library itself to run.

wenkokke · 2024-03-24T23:24:50Z

FYI @wenkokke it appears that the file with the exports was refactored in tree-sitter/tree-sitter@d290051, now not containing standard library symbols anymore

edit: Ah! It's in here now, and appears to contain realloc, which your patch added. So we're good?

Probably, but you've rewritten the C code, so if you depend on anything new we might need the infrastructure again. Though it's more advisable to just remove functions that aren't supported.

clason · 2024-03-25T07:41:53Z

They generally pick only those parts they need for there library itself to run.

Not true; they also expose (more and more) for external scanners to use. Hence my recommendation to raise an upstream issue.

mrcjkb · 2024-03-25T08:47:31Z

Important question: will queries remain compatible?

@clason Since I've worked on a lot of the queries in nvim-treesitter, and also maintain neotest-haskell, which also uses tree-sitter queries, I'll be taking part in the survey.
Personally, I'm quite excited about some of the proposed changes and wouldn't mind backward compatibility breaking. And of course, I'd be happy to adjust the queries in nvim-treesitter.

clason · 2024-03-25T08:49:26Z

@mrcjkb Thank you, then I don't need to worry :) (To be clear, I didn't want to argue for not changing the queries; I just wanted a heads-up how much breakage I should expect so I can brace for it.)

414owen · 2024-03-25T08:48:33Z

src/scanner.c

 // --------------------------------------------------------------------------------------------------------

+static void advance_over(uint32_t n) {
+  for (uint32_t i = state->lookahead->len; i <= state->lookahead->offset + n; i++) S_ADVANCE;


I'm not sure why we're advancing with relation to the peek position, and I'm not convinced this for loop is right anyway. Shouldn't the length of the lookahead be a limit, not a starting state?

The docs now explain it, but for the record:

lookahead is the range that's already been advanced over, so this starts at its end.
offset is the reference point for indexing lookahead that can be reset at any point for convenience, so the target index is offset + n.
If that is smaller than lookahead's current size, the target index has already been consumed and the loop immediately returns.

Okay, so there's an internal state, and a tree-sitter state.

advance, now pushes the current token to a the internal-state lookahead, and advances the tree-sitter character.

peek peeks into a internal lookahead buffer, and may advance the tree-sitter state to fill that buffer, if necessary.

So when parsing a file from scratch, does that mean that the entire file is pushed character-by-character into the lookahead buffer?

By the looks of it reset_lookahead doesn't reset the lookahead, so much as it advances the internal state to the end of the lookahead buffer. Maybe there's a better name for this operation?

For my own interest, do you think it's possible for the scanner to work with only one character of lookahead?

Okay, so there's an internal state, and a tree-sitter state.

advance, now pushes the current token to a the internal-state lookahead, and advances the tree-sitter character.

peek peeks into a internal lookahead buffer, and may advance the tree-sitter state to fill that buffer, if necessary.

So when parsing a file from scratch, does that mean that the entire file is pushed character-by-character into the lookahead buffer?

If the question is whether the entire file is stored in that buffer entirely at some point, then no – the buffer is discarded after each run.
Right now the array is reused across runs and only overwritten to achieve that, but like the global pointers this isn't thread safe, so I'll have to revert to the initial approach I used and allocate a new one in each run (unlike the state pointers, this would only be unsafe if multiple nodes in the same file would be processed concurrently, so it might still be feasible if we knew that TS doesn't allow that).

It's probably useful to point out that none of this has any impact on performance.
I've never observed any differences in benchmarks when changing implementation details of the scanner on a "technical" level, only when modifying logic so that the number of times the scanner is called varies.
(On that note, moving stuff from the grammar to the scanner usually results in significant perf benefits 😅 e.g. when I moved pragmas in, HLS parse time reduced from 420ms to 360ms)

I've also looked at memory usage but the profiling capabilities are limited with simple tools.
I only measured RSS, which varied between 19MB and 21MB for ghc/compiler before and after rewrite.

By the looks of it reset_lookahead doesn't reset the lookahead, so much as it advances the internal state to the end of the lookahead buffer. Maybe there's a better name for this operation?

I don't have a clear idea of what "resetting lookahead" canonically means, so if you feel confident about an alternative I'm happy to change it 🙂
My interpretation is that peek0() is the analogue of env->lexer->lookahead, and resetting the offset makes the former return the same as the latter.

For my own interest, do you think it's possible for the scanner to work with only one character of lookahead?

I think it's possible that some things would need to be spread over two runs like newline lookahead, but you could probably implement it in a convoluted way.
The main motivation is to be able to write the individual fragments independently though, without having to pay attention to their order.

414owen · 2024-03-25T08:50:12Z

src/scanner.c

+static int32_t peek(uint32_t n) {
+  if (state->lookahead->offset + n < state->lookahead->len) return unsafe_peek(n);
+  else {
+    advance_before(n);


Well, it looks like unsafe_peek_abs implements a pure peek with n lookahead, and I'm not entirely sure what peek does. It has a few layers, and I don't understand them. They seem to try to advance in relation to a parameter and the lookahead offset?

tek · 2024-03-25T11:24:29Z

Probably, but you've rewritten the C code, so if you depend on anything new we might need the infrastructure again. Though it's more advisable to just remove functions that aren't supported.

@wenkokke I fiddled with the test runner to get it to work for wasm, and it appears that everything is working – when I run test/parse-libs wasm, all libraries are parsed successfully!

tek · 2024-03-25T11:25:28Z

@414owen I'll add comments to the scanner later, so you won't have to keep guessing 😄

414owen · 2024-03-25T08:53:15Z

src/scanner.c

+  State *state = calloc(sizeof(State), 1);
+  VEC_RESIZE(&state->contexts, 8);
+  state->lookahead = calloc(sizeof(Lookahead), 1);
+  VEC_RESIZE(state->lookahead, 8);


Do we need to allocate lookahead separately? Can we not have the Lookahead struct as a (non-pointer) member of the struct State (like contexts), to avoid the triple-dereferences when peeking?

yep that works!

tek · 2024-03-31T15:04:17Z

@amaanq can you tell me whether the scanner is ever executed concurrently? I've made the state pointer a global variable that I assign in tree_sitter_haskell_external_scanner_scan to avoid having it as a parameter of every function, so it's not thread safe.

clason · 2024-04-01T16:56:55Z

test/parse-lib

@@ -38,7 +38,7 @@ then
 elif [[ "$mode" == "wasm" ]]
 then
  # Ensure tree-sitter-haskell.wasm was compiled
-  make tree-sitter-haskell.wasm -s
+  tree-sitter build-wasm


it's tree-sitter build --wasm -o ... now.

thanks. 0.22 doesn't appear to be available from npm or nixpkgs yet though

clason · 2024-05-03T16:41:37Z

Friendly reminder to keep the grammar.json in the repo, or nvim-treesitter will not be able to install this parser.

tek · 2024-05-03T19:11:03Z

Friendly reminder to keep the grammar.json in the repo, or nvim-treesitter will not be able to install this parser.

yep on my list 😅 I'm postponing this till the end to avoid cluttering my dev history

tek · 2024-05-04T01:29:14Z

alright, I think we can merge tomorrow if there's nothing else

* Parses the GHC codebase! I'm using a trimmed set of the source directories of the compiler and most core libraries in [this repo](https://github.com/tek/tsh-test-ghc). This used to break horribly in many files because explicit brace layouts weren't supported very well. * Faster in most cases! Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test codebases in `test/libs`: Old: ``` effects: 32ms postgrest: 91ms ivory: 224ms polysemy: 84ms semantic: 1336ms haskell-language-server: 532ms flatparse: 45ms ``` New: ``` effects: 29ms postgrest: 64ms ivory: 178ms polysemy: 70ms semantic: 692ms haskell-language-server: 390ms flatparse: 36ms ``` GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times! To get more detailed info (including new codebases I added, consisting mostly of core libraries), run `test/parse-libs`. I also added an interface for running `hyperfine`, exposed as a Nix app – execute `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or `test/libs/tsh-test-ghc/libraries`. * Smaller size of the shared object. `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one. * Significantly faster time to generate, and slightly faster build. On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s. * All terminals now have proper text nodes when possible, like the `.` in modules. Fixes #102, #107, #115 (partially?). * Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code. Fixes #89, #105, #111. * Comments aren't pulled into preceding layouts anymore. Fixes #82, #109. (Can probably still be improved with a few heuristics for e.g. postfix haddock) * Similarly, whitespace is kept out of layout-related nodes as much as possible. Fixes #74. * Hashes can now be operators in all situations, without sacrificing unboxed tuples. Fixes #108. * Expression quotes are now handled separately from quasiquotes and their contents parsed properly. Fixes #116. * Explicit brace layouts are now handled correctly. Fixes #92. * Function application with multiple block arguments is handled correctly. * Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like prefix operator detection. * Haddock comments have dedicated nodes now. * Use named precedences instead of closely replicating the GHC parser's productions. * Different layouts are tracked and closed with their special cases considered. In particular, multi-way if now has layout. * Fixed CPP bug where mid-line `#endif` would be false positive. * CPP only matches legal directives now. * Generally more lenient parsing than GHC, and in the presence of errors: * Missing closing tokens at EOF are tolerated for: * CPP * Comment * TH Quotation * Multiple semicolons in some positions like `if/then` * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions * List comprehensions can have multiple sets of qualifiers (`ParallelListComp`). * Deriving clauses after GADTs don't require layout anymore. * Newtype instance heads are working properly now. * Escaping newlines in comments and cpp works now. Escaping newlines on regular lines won't be implemented. * One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)` I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse application, infix and negation without lexing all qualified names in the scanner. I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work. For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing. * Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of Unicode categories, using bitmaps. I might need to change this to write them all to a shared file, so the set of source files stays the same.

tek · 2024-05-04T15:39:14Z

here goes nothing

clason · 2024-05-04T15:53:44Z

So this was merged to the main branch -- will that become the default branch at some point?

tek · 2024-05-04T15:57:07Z

@clason not sure – for now I would leave master as default so that parser.c is present when the repo is cloned. If you have suggestions for a good system I'm all ears.

(in case it's unclear – the release workflow will generate parser.c and create a commit on master whenever a tag is pushed)

clason · 2024-05-04T15:59:32Z

Ah, master is your "release branch". Some parsers do that, but I never understood the point. Isn't it easier to have a release workflow that makes an actual release with artifacts (tarball and wasm) for downloading, and keep the repo as "pure source"?

tek · 2024-05-04T16:01:56Z

I would prefer that, but as far as I could tell there are consumers who want to have a parser.c in the source tree.

(I've also created artifact release workflows btw.)

clason · 2024-05-04T16:04:18Z

I would prefer that, but as far as I could tell there are consumers who want to have a parser.c in the source

Yes, but does it have to be in a git branch? You can just generate the parser.c in-tree and then zip the whole thing up; the output will be identical from cloning the repo (ignoring the .git folder).

tek · 2024-05-04T16:10:20Z

my motivation was that if someone uses the plain github URL in their application to pull the repo, they'll get the generated parser like it used to be the case in the past.

When I researched this issue I got the impression that some projects depend on this, but if you have better insight into the ecosystem I'd be happy to scrap that workflow.

clason · 2024-05-04T16:16:00Z

Well, I understand not wanting to make sudden breaking changes on master, that's completely fine. I am just honestly curious why people depend on this specific setup. If you are only committing the parser to master on manual tags/releases, then there should be zero difference from a downstream pov between an archive/<tag>.tar.gz link and a releases/download/<tag>/tree-sitter-haskell.tar.gz link (where the latter is created from the full repo download plus generated parser.c)?

(Unless I misunderstood and you are pushing on every commit to main -- but then you're back at square one with master bloating?)

tek · 2024-05-04T16:25:07Z

no, you understood that fine. I don't like it either, but that's what I learned 🤷🏻 I saw that tree-sitter-swift had multiple requests to include the generated parser, even though they have a separate release branch! so my reasoning was that keeping master as a legacy branch will avoid having to respond to one of those once a month 😅 at least until it has become the new standard.

Of course you could argue that not playing that game would accelerate adoption...

amaanq · 2024-05-04T16:28:15Z

You should just check in parser.c into master for the time being, and not have a "main" branch + "release" branch. In the future, we will be getting rid of the need for this but for now, this is the way to go. I also wish you had waited till I checked this out - I'm not too familiar with Haskell so I wouldn't be particularly useful in commenting on the grammar, but I would have like to checked changes around everything else in this PR.

clason · 2024-05-04T16:30:23Z

I could understand that if people wanted to pull in that branch as a git submodule (in which case 🤢)...

I do think the best strategy is "keep parser.c out of repo (unless it's small), and create release artefacts for 1. the full repo tree plus the generated parser.c and 2. the generated wasm". (You certainly shouldn't commit the generated wasm to the repo...) Downstream users who can't handle generating their own parser shouldn't YOLO track master, either.

tek · 2024-05-04T19:19:08Z

@amaanq no reason why you shouldn't be able to check the changes – nothing has been released yet.

@clason understandable or not, people will open issues without reading the docs. I'd rather avoid having to deal with it over and over again

clason · 2024-05-04T19:28:20Z

People will open issues without reading the docs no matter what you do ;)

But I am not trying to convince you of anything; I just wanted to know the plan so nvim-treesitter can adapt. In this case, we won't have to do anything for a long while since we track the default branch.

tek · 2024-05-04T19:32:52Z

People will open issues without reading the docs no matter what you do ;)

well they certainly won't open any about parser.c missing in the default branch if I keep it there!

But I am not trying to convince you of anything; I just wanted to know the plan so nvim-treesitter can adapt. In this case, we won't have to do anything for a long while since we track the default branch.

I would have expected you to be eager to consume the tags or artifacts instead, irrespective of the existence of a legacy branch.

clason · 2024-05-04T19:35:49Z

We are not set up for that yet (because the ecosystem isn't) but that is definitely where we want to be going.

(But that will require the release tarballs to have the exact same layout as the repo.)

tek · 2024-05-04T19:42:43Z

oh I see. well in any case I'd assume that at some point all repos will be recreated from scratch in the tree-sitter-grammars org or something, without committing generated files going forward

tek · 2024-05-04T21:16:09Z

hrm, somehow rebasing broke the working tree 🤔

clason · 2024-05-04T22:34:35Z

oh I see. well in any case I'd assume that at some point all repos will be recreated from scratch in the tree-sitter-grammars org or something, without committing generated files going forward

Not really, time is finite and the number of parsers is seemingly infinite. The org was just offering a home to a number of parsers looking for help with automated maintenance.

But, yes, the upstream guidance is to omit parser.c (but not grammar.json) from the repo in favor of versioned release artifacts with generated files, and we are very much looking forward to being able to rely on the latter more widely. Upstream is actively working on adding tooling to make this easier.

For now, my curiosity is satisfied that there will not be a breaking update to the master branch (which we track and so any breaking changes there force us to act) anytime soon, so I can wait and see how things play out. If anybody jumps the gun and makes a PR switching the branch, I'll now know what's what.

414owen reviewed Mar 24, 2024

View reviewed changes

414owen approved these changes Mar 24, 2024

View reviewed changes

414owen reviewed Mar 25, 2024

View reviewed changes

tek force-pushed the rewrite-2 branch 5 times, most recently from 2396cc5 to 86fcdb1 Compare March 30, 2024 21:22

tek changed the base branch from master to main March 30, 2024 21:22

tek force-pushed the rewrite-2 branch from 86fcdb1 to 9e279cc Compare March 30, 2024 23:07

414owen reviewed Mar 31, 2024

View reviewed changes

clason reviewed Apr 1, 2024

View reviewed changes

tek force-pushed the rewrite-2 branch from d5d6ae3 to f127991 Compare April 2, 2024 00:17

tek force-pushed the rewrite-2 branch from 4e06e1c to 81dd922 Compare May 4, 2024 01:26

tek force-pushed the rewrite-2 branch from 81dd922 to 3775f28 Compare May 4, 2024 15:18

tek force-pushed the rewrite-2 branch from 3775f28 to 3d81e26 Compare May 4, 2024 15:21

tek merged commit 9b7b367 into main May 4, 2024
4 checks passed

tek deleted the rewrite-2 branch May 4, 2024 15:39

This was referenced May 5, 2024

Update queries for the rewritten grammar mrcjkb/neotest-haskell#160

Closed

fix(haskell): update parser, queries and tests for rewritten parser nvim-treesitter/nvim-treesitter#6577

Closed

mrcjkb mentioned this pull request May 12, 2024

Update tree-sitter queries for compatibility with tree-sitter-haskell rewrite mrcjkb/haskell-snippets.nvim#26

Closed

Rewrite the grammar once again #120

Rewrite the grammar once again #120

Conversation

tek commented Mar 24, 2024 • edited

clason commented Mar 24, 2024 • edited

tek commented Mar 24, 2024

clason commented Mar 24, 2024 • edited

414owen commented Mar 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wenkokke commented Mar 24, 2024

clason commented Mar 24, 2024

tek commented Mar 24, 2024

tek commented Mar 24, 2024

tek commented Mar 24, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tek Mar 24, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tek Mar 31, 2024 • edited

Choose a reason for hiding this comment

wenkokke commented Mar 24, 2024

wenkokke commented Mar 24, 2024

clason commented Mar 25, 2024

mrcjkb commented Mar 25, 2024

clason commented Mar 25, 2024

414owen Mar 25, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

414owen Apr 3, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tek commented Mar 25, 2024

tek commented Mar 25, 2024

414owen Mar 25, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tek commented Mar 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clason commented May 3, 2024

tek commented May 3, 2024

tek commented May 4, 2024

tek commented May 4, 2024

clason commented May 4, 2024

tek commented May 4, 2024

clason commented May 4, 2024

tek commented May 4, 2024

clason commented May 4, 2024

tek commented May 4, 2024

clason commented May 4, 2024 • edited

tek commented May 4, 2024

amaanq commented May 4, 2024

clason commented May 4, 2024 • edited

tek commented May 4, 2024

clason commented May 4, 2024

tek commented May 4, 2024

clason commented May 4, 2024 • edited

tek commented May 4, 2024

tek commented May 4, 2024

clason commented May 4, 2024

tek commented Mar 24, 2024 •

edited

clason commented Mar 24, 2024 •

edited

clason commented Mar 24, 2024 •

edited

tek commented Mar 24, 2024 •

edited

tek Mar 24, 2024 •

edited

tek Mar 31, 2024 •

edited

414owen Mar 25, 2024 •

edited

414owen Apr 3, 2024 •

edited

414owen Mar 25, 2024 •

edited

clason commented May 4, 2024 •

edited

clason commented May 4, 2024 •

edited

clason commented May 4, 2024 •

edited