Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite the grammar once again #120

Merged
merged 1 commit into from May 4, 2024
Merged

Rewrite the grammar once again #120

merged 1 commit into from May 4, 2024

Conversation

tek
Copy link
Contributor

@tek tek commented Mar 24, 2024

I decided to redesign the scanner and parts of the grammar due to the large number of issues without obvious fix that
have accumulated over the years.
The new implementation provides a significant set of improvements, listed below.

Since the repo has become uncloneable in some
situations, I would also prefer to change the workflow so that parser.c is only generated on a release branch, not for
every commit, like some other grammars do it.
However, since the current situation is already bad, it seems that the only way out would be to reset the history, which
would break consumers like nixpkgs who rely on being able to access older commits.
An alternative could be to use a new Github location, but that would also be awkward.
In any case, it's probably better to handle this later.

I've overhauled the tree structure a bit, mostly for higher-level nodes, since I found some parts to be lacking or badly
named.
For example, I added header / imports / declarations for the top-level structure.
Since I don't have much experience using the grammar via tree-sitter API directly, I've launched a survey on the
Haskell discourse
to get some more feedback, so I can use the opportunity of introducing breaking changes to
improve the grammar for users.

Not sure about the wasm artifact – is it still necessary to use that patch included in the Makefile?

I'd appreciate opinions and feedback!

Please take a look, @414owen @amaanq @wenkokke


  • Parses the GHC codebase!

    I'm using a trimmed set of the source directories of the compiler and most core libraries in this repo.

    This used to break horribly in many files because explicit brace layouts weren't supported very well.

  • Faster in most cases! Here are a few simple benchmarks to illustrate the difference, not to be taken too seriously, using the test codebases in test/libs:

    Old:

    effects: 32ms
    postgrest: 91ms
    ivory: 224ms
    polysemy: 84ms
    semantic: 1336ms
    haskell-language-server: 532ms
    flatparse: 45ms
    

    New:

    effects: 29ms
    postgrest: 64ms
    ivory: 178ms
    polysemy: 70ms
    semantic: 692ms
    haskell-language-server: 390ms
    flatparse: 36ms
    

    GHC's compiler directory takes 3000ms, but is among the fastest repos for per-line and per-character times! To get more detailed info (including new codebases I added, consisting mostly of core libraries), run test/parse-libs. I also added an interface for running hyperfine, exposed as a Nix app – execute nix run .#bench-libs -- stm mtl transformers with the desired set of libraries in test/libs or test/libs/tsh-test-ghc/libraries.

  • Smaller size of the shared object.

    tree-sitter generate produces a haskell.so with a size of 4.4MB for the old grammar, and 3.0MB for the new one.

  • Smaller size of the generated parser.c: 24MB -> 14MB.

  • Significantly faster time to generate, and slightly faster build.

    On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s.

  • Smaller size of parser.c, from 24MB down to 14MB.

  • All terminals now have proper text nodes when possible, like the . in modules. Fixes Include . from qualified modules and variables #102, Include ! from strictness annotations #107, Qualified/unqualified module paths colored differently #115 (partially?).

  • Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code. Fixes Instance with associated type, following TH top level splice, misparsed as function #89, Components parser as type when they are not #105, Incorrect parse due to top-level splices #111.

  • Comments aren't pulled into preceding layouts anymore. Fixes Comments following function included in function pattern #82, Incorrect parse for function with where-clause and comments  #109. (Can probably still be improved with a few heuristics for e.g. postfix haddock)

  • Similarly, whitespace is kept out of layout-related nodes as much as possible. Fixes Grammar defines trailing whitespace as part of lambda case statement #74.

  • Hashes can now be operators in all situations, without sacrificing unboxed tuples. Fixes exp_section_right not parsed when containing a hash #108.

  • Expression quotes are now handled separately from quasiquotes and their contents parsed properly. Fixes Typed Template Haskell quotations / splices not handled correctly #116.

  • Explicit brace layouts are now handled correctly. Fixes Misparse of explicit-braced code #92.

  • Function application with multiple block arguments is handled correctly.
    Example:

    a = do \ a -> a
        case a of
          a -> a
        do a + a
        if
       | a -> a
       | a -> a
      *
      do a
  • Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like prefix operator detection.

  • Haddock comments have dedicated nodes now.

  • Use named precedences instead of closely replicating the GHC parser's productions.

  • Different layouts are tracked and closed with their special cases considered. In particular, multi-way if now has layout.

  • Fixed CPP bug where mid-line #endif would be false positive.

  • CPP only matches legal directives now.

  • Generally more lenient parsing than GHC, and in the presence of errors:

    • Missing closing tokens at EOF are tolerated for:
      • CPP
      • Comment
      • TH Quotation
    • Multiple semicolons in some positions like if/then
    • Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions
  • List comprehensions can have multiple sets of qualifiers (ParallelListComp).

  • Deriving clauses after GADTs don't require layout anymore.

  • Newtype instance heads are working properly now.

  • Escaping newlines in CPP works now.

  • One remaining issue is that qualified left sections that contain infix ops are broken: (a + a A.+) I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse application, infix and negation without lexing all qualified names in the scanner. I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work. For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing. Solved this by implementing qualified operator lookahead.

  • Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of Unicode categories, using bitmaps. I might need to change this to write them all to a shared file, so the set of source files stays the same.

@clason
Copy link

clason commented Mar 24, 2024

I would also prefer to change the workflow so that parser.c is only generated on a release branch, not for
every commit, like some other grammars do it.

In this case, please at least commit the grammar.json, which is much smaller and has meaningful diffs. This will allow people to generate the parser without having to use node.

Not sure about the wasm artifact – is it still necessary to use that patch included in the Makefile?

You can use tree-sitter build --wasm with the latest version. Please upload that as a release artifact as well.

Important question: will queries remain compatible?

@tek
Copy link
Contributor Author

tek commented Mar 24, 2024

In this case, please at least commit the grammar.json, which is much smaller and has meaningful diffs. This will allow people to generate the parser without having to use node.

Ah, didn't know that. How do you generate the parser from that file?

You can use tree-sitter build --wasm with the latest version. Please upload that as a release artifact as well.

That's how we're building it, but @wenkokke added a patch for web-tree-sitter for some reason I don't recall.

I regenerated the .wasm file and included it in the PR. Are you talking about Github release artifacts?

Important question: will queries remain compatible?

Some of them will, but it's gonna depend on the result of the survey how contained the breakage will be.
I think most expression queries will still work, but types and top level structures should be affected.

@clason
Copy link

clason commented Mar 24, 2024

Ah, didn't know that. How do you generate the parser from that file?

Just tree-sitter generate src/grammar.json (little known fact).

That's how we're building it, but @wenkokke added a patch for web-tree-sitter for some reason I don't recall.

Possibly because the output dir has changed? But that should be fixed upstream.

I regenerated the .wasm file and included it in the PR. Are you talking about Github release artifacts?

Yes, the Github release artifacts (only; it shouldn't be committed to the repo).

Some of them will, but it's gonna depend on the result of the survey how contained the breakage will be.
I think most expression queries will still work, but types and top level structures should be affected.

Thanks for the heads-up. I hope the queries here can serve as a guide for nvim-treesitter.

@414owen
Copy link
Contributor

414owen commented Mar 24, 2024

Wow @tek this sounds incredible, clearly a labour of love, and I'm hugely grateful that you undertook such a huge amount of work for the community. The sheer quantity of listed improvements is astounding.

I'll try my best to look over it, but given the large amount of grammar changes, and that I've mostly just dealt with the scanner in the past, you're clearly the one who needs to decide when this is ready :)

🚀 🚀 🚀

-- This should be an error, since a conid followed by two dots parses as the
-- qualified composition operator, but we allow it since there's no way to use
-- an operator here, like it would be in a section.
a = [A..a]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice.

Next step is to fix this in GHC's parser, yeah?

λ> [False..True]

<interactive>:1:2: error:
    Not in scope: ‘False..’

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah 😅 GHC makes up for it by permitting .. to be referred to as an operator if you use a qualifying module, like a A... a 😁

@wenkokke
Copy link
Contributor

That's how we're building it, but @wenkokke added a patch for web-tree-sitter for some reason I don't recall.

The reason is that web-tree-sitter doesn't support all of libc, and there's a file in the web-tree-sitter source that determines exactly what symbols are supported.

Please test the WASM build before merging. Tests using only the standard testing infrastructure sadly aren't reliable.

@clason
Copy link

clason commented Mar 24, 2024

That is an upstream issue, surely? The point of wasm is to be independent of both the source and the consumer. If something isn't working that should be working, tree-sitter should know about it rather than band-aiding it?

@tek
Copy link
Contributor Author

tek commented Mar 24, 2024

@414owen thanks so much! I definitely went a bit overboard with the obsession over the last few weeks 😁
I don't think it's necessary to review the scanner in detail – running that fuzzer should be enough to complement the tests.

@tek
Copy link
Contributor Author

tek commented Mar 24, 2024

Please test the WASM build before merging. Tests using only the standard testing infrastructure sadly aren't reliable.

Ah, I'll make sure to bring the wasm build and release up to speed then

@tek
Copy link
Contributor Author

tek commented Mar 24, 2024

FYI @wenkokke it appears that the file with the exports was refactored in tree-sitter/tree-sitter@d290051, now not containing standard library symbols anymore

edit: Ah! It's in here now, and appears to contain realloc, which your patch added. So we're good?

examples/semantic/semantic/test/fixtures/haskell/corpus/type-synonyms.A.hs
examples/semantic/semantic/test/fixtures/haskell/corpus/type-synonyms.B.hs
examples/semantic/semantic/test/fixtures/haskell/corpus/tempate-haskell.A.hs
examples/semantic/semantic/test/fixtures/haskell/corpus/template-haskell.B.hs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 fantastic

"semi",
"cpp-else",
"cpp",
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me to make a PR tying this together with the enum more tightly, using an X macro

src/scanner.c Outdated
static int32_t peek(uint32_t n) {
if (state->lookahead->offset + n < state->lookahead->len) return unsafe_peek(n);
else {
advance_before(n);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit uneasy about advancing inside a function named peek. Is there a good reason for this?

Copy link
Contributor Author

@tek tek Mar 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you consider a better name?

It's returning the nth lookahead character without advancing over it, if it hasn't been advanced over before.
I think it's possible that I've rearranged the steps in a way that the second part never happens, by now.

So, the peek part applies to the n only, I guess. The functions peek{0,1,2} embody this concept a bit clearer maybe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it looks like unsafe_peek_abs implements a pure peek with n lookahead, and I'm not entirely sure what peek does. It has a few layers, and I don't understand them. They seem to try to advance in relation to a parameter and the lookahead offset?

Copy link
Contributor Author

@tek tek Mar 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the record:

peek(n) := return lookahead character offset + n without advancing over it, but advance over all preceding characters if necessary (may have advanced over offset + n before, in which case it is returned directly from our buffer). Or alternatively: Ensure that our buffer contains offset + n, advance the tree-sitter lookahead as far as necessary if it doesn't, copying characters to our buffer.
peek_unsafe(n) := return lookahead character offset + n without advancing, return 0 if lookahead buffer is too small
peek_unsafe_abs(n) := return lookahead character n without advancing, return 0 if lookahead buffer is too small

@wenkokke
Copy link
Contributor

That is an upstream issue, surely? The point of wasm is to be independent of both the source and the consumer. If something isn't working that should be working, tree-sitter should know about it rather than band-aiding it?

This isn't a WASM issue, but rather an emscripten feature that lets you not ship the entirety of libc and libc++ for your web app.

That means that tree-sitter, as the entry point, bundles libc and libc++ and must pick what parts they want. They generally pick only those parts they need for there library itself to run.

@wenkokke
Copy link
Contributor

FYI @wenkokke it appears that the file with the exports was refactored in tree-sitter/tree-sitter@d290051, now not containing standard library symbols anymore

edit: Ah! It's in here now, and appears to contain realloc, which your patch added. So we're good?

Probably, but you've rewritten the C code, so if you depend on anything new we might need the infrastructure again. Though it's more advisable to just remove functions that aren't supported.

@clason
Copy link

clason commented Mar 25, 2024

They generally pick only those parts they need for there library itself to run.

Not true; they also expose (more and more) for external scanners to use. Hence my recommendation to raise an upstream issue.

@mrcjkb
Copy link

mrcjkb commented Mar 25, 2024

Important question: will queries remain compatible?

@clason Since I've worked on a lot of the queries in nvim-treesitter, and also maintain neotest-haskell, which also uses tree-sitter queries, I'll be taking part in the survey.
Personally, I'm quite excited about some of the proposed changes and wouldn't mind backward compatibility breaking. And of course, I'd be happy to adjust the queries in nvim-treesitter.

@clason
Copy link

clason commented Mar 25, 2024

@mrcjkb Thank you, then I don't need to worry :) (To be clear, I didn't want to argue for not changing the queries; I just wanted a heads-up how much breakage I should expect so I can brace for it.)

src/scanner.c Outdated
// --------------------------------------------------------------------------------------------------------

static void advance_over(uint32_t n) {
for (uint32_t i = state->lookahead->len; i <= state->lookahead->offset + n; i++) S_ADVANCE;
Copy link
Contributor

@414owen 414owen Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we're advancing with relation to the peek position, and I'm not convinced this for loop is right anyway. Shouldn't the length of the lookahead be a limit, not a starting state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs now explain it, but for the record:

lookahead is the range that's already been advanced over, so this starts at its end.
offset is the reference point for indexing lookahead that can be reset at any point for convenience, so the target index is offset + n.
If that is smaller than lookahead's current size, the target index has already been consumed and the loop immediately returns.

Copy link
Contributor

@414owen 414owen Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so there's an internal state, and a tree-sitter state.

advance, now pushes the current token to a the internal-state lookahead, and advances the tree-sitter character.

peek peeks into a internal lookahead buffer, and may advance the tree-sitter state to fill that buffer, if necessary.

So when parsing a file from scratch, does that mean that the entire file is pushed character-by-character into the lookahead buffer?

By the looks of it reset_lookahead doesn't reset the lookahead, so much as it advances the internal state to the end of the lookahead buffer. Maybe there's a better name for this operation?


For my own interest, do you think it's possible for the scanner to work with only one character of lookahead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so there's an internal state, and a tree-sitter state.

advance, now pushes the current token to a the internal-state lookahead, and advances the tree-sitter character.

peek peeks into a internal lookahead buffer, and may advance the tree-sitter state to fill that buffer, if necessary.

So when parsing a file from scratch, does that mean that the entire file is pushed character-by-character into the lookahead buffer?

If the question is whether the entire file is stored in that buffer entirely at some point, then no – the buffer is discarded after each run.
Right now the array is reused across runs and only overwritten to achieve that, but like the global pointers this isn't thread safe, so I'll have to revert to the initial approach I used and allocate a new one in each run (unlike the state pointers, this would only be unsafe if multiple nodes in the same file would be processed concurrently, so it might still be feasible if we knew that TS doesn't allow that).

It's probably useful to point out that none of this has any impact on performance.
I've never observed any differences in benchmarks when changing implementation details of the scanner on a "technical" level, only when modifying logic so that the number of times the scanner is called varies.
(On that note, moving stuff from the grammar to the scanner usually results in significant perf benefits 😅 e.g. when I moved pragmas in, HLS parse time reduced from 420ms to 360ms)

I've also looked at memory usage but the profiling capabilities are limited with simple tools.
I only measured RSS, which varied between 19MB and 21MB for ghc/compiler before and after rewrite.

By the looks of it reset_lookahead doesn't reset the lookahead, so much as it advances the internal state to the end of the lookahead buffer. Maybe there's a better name for this operation?

I don't have a clear idea of what "resetting lookahead" canonically means, so if you feel confident about an alternative I'm happy to change it 🙂
My interpretation is that peek0() is the analogue of env->lexer->lookahead, and resetting the offset makes the former return the same as the latter.

For my own interest, do you think it's possible for the scanner to work with only one character of lookahead?

I think it's possible that some things would need to be spread over two runs like newline lookahead, but you could probably implement it in a convoluted way.
The main motivation is to be able to write the individual fragments independently though, without having to pay attention to their order.

src/scanner.c Outdated
static int32_t peek(uint32_t n) {
if (state->lookahead->offset + n < state->lookahead->len) return unsafe_peek(n);
else {
advance_before(n);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it looks like unsafe_peek_abs implements a pure peek with n lookahead, and I'm not entirely sure what peek does. It has a few layers, and I don't understand them. They seem to try to advance in relation to a parameter and the lookahead offset?

@tek
Copy link
Contributor Author

tek commented Mar 25, 2024

Probably, but you've rewritten the C code, so if you depend on anything new we might need the infrastructure again. Though it's more advisable to just remove functions that aren't supported.

@wenkokke I fiddled with the test runner to get it to work for wasm, and it appears that everything is working – when I run test/parse-libs wasm, all libraries are parsed successfully!

@tek
Copy link
Contributor Author

tek commented Mar 25, 2024

@414owen I'll add comments to the scanner later, so you won't have to keep guessing 😄

@tek tek force-pushed the rewrite-2 branch 5 times, most recently from 2396cc5 to 86fcdb1 Compare March 30, 2024 21:22
@tek tek changed the base branch from master to main March 30, 2024 21:22
src/scanner.c Outdated
State *state = calloc(sizeof(State), 1);
VEC_RESIZE(&state->contexts, 8);
state->lookahead = calloc(sizeof(Lookahead), 1);
VEC_RESIZE(state->lookahead, 8);
Copy link
Contributor

@414owen 414owen Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to allocate lookahead separately? Can we not have the Lookahead struct as a (non-pointer) member of the struct State (like contexts), to avoid the triple-dereferences when peeking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep that works!

@tek
Copy link
Contributor Author

tek commented Mar 31, 2024

@amaanq can you tell me whether the scanner is ever executed concurrently? I've made the state pointer a global variable that I assign in tree_sitter_haskell_external_scanner_scan to avoid having it as a parameter of every function, so it's not thread safe.

test/parse-lib Outdated
@@ -38,7 +38,7 @@ then
elif [[ "$mode" == "wasm" ]]
then
# Ensure tree-sitter-haskell.wasm was compiled
make tree-sitter-haskell.wasm -s
tree-sitter build-wasm
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's tree-sitter build --wasm -o ... now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. 0.22 doesn't appear to be available from npm or nixpkgs yet though

@clason
Copy link

clason commented May 3, 2024

Friendly reminder to keep the grammar.json in the repo, or nvim-treesitter will not be able to install this parser.

@tek
Copy link
Contributor Author

tek commented May 3, 2024

Friendly reminder to keep the grammar.json in the repo, or nvim-treesitter will not be able to install this parser.

yep on my list 😅 I'm postponing this till the end to avoid cluttering my dev history

@tek
Copy link
Contributor Author

tek commented May 4, 2024

alright, I think we can merge tomorrow if there's nothing else

* Parses the GHC codebase!

  I'm using a trimmed set of the source directories of the compiler and most core libraries in
  [this repo](https://github.com/tek/tsh-test-ghc).

  This used to break horribly in many files because explicit brace layouts weren't supported very well.

* Faster in most cases!
  Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test
  codebases in `test/libs`:

  Old:
  ```
  effects: 32ms
  postgrest: 91ms
  ivory: 224ms
  polysemy: 84ms
  semantic: 1336ms
  haskell-language-server: 532ms
  flatparse: 45ms
  ```

  New:
  ```
  effects: 29ms
  postgrest: 64ms
  ivory: 178ms
  polysemy: 70ms
  semantic: 692ms
  haskell-language-server: 390ms
  flatparse: 36ms
  ```

  GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times!
  To get more detailed info (including new codebases I added, consisting mostly of core libraries), run
  `test/parse-libs`.
  I also added an interface for running `hyperfine`, exposed as a Nix app – execute
  `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or
  `test/libs/tsh-test-ghc/libraries`.

* Smaller size of the shared object.

  `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one.

* Significantly faster time to generate, and slightly faster build.

  On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s.

* All terminals now have proper text nodes when possible, like the `.` in modules.
  Fixes #102, #107, #115 (partially?).

* Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative
  interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code.
  Fixes #89, #105, #111.

* Comments aren't pulled into preceding layouts anymore.
  Fixes #82, #109.
  (Can probably still be improved with a few heuristics for e.g. postfix haddock)

* Similarly, whitespace is kept out of layout-related nodes as much as possible.
  Fixes #74.

* Hashes can now be operators in all situations, without sacrificing unboxed tuples.
  Fixes #108.

* Expression quotes are now handled separately from quasiquotes and their contents parsed properly.
  Fixes #116.

* Explicit brace layouts are now handled correctly.
  Fixes #92.

* Function application with multiple block arguments is handled correctly.

* Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like
  prefix operator detection.

* Haddock comments have dedicated nodes now.

* Use named precedences instead of closely replicating the GHC parser's productions.

* Different layouts are tracked and closed with their special cases considered.
  In particular, multi-way if now has layout.

* Fixed CPP bug where mid-line `#endif` would be false positive.

* CPP only matches legal directives now.

* Generally more lenient parsing than GHC, and in the presence of errors:
  * Missing closing tokens at EOF are tolerated for:
    * CPP
    * Comment
    * TH Quotation
  * Multiple semicolons in some positions like `if/then`
  * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions

* List comprehensions can have multiple sets of qualifiers (`ParallelListComp`).

* Deriving clauses after GADTs don't require layout anymore.

* Newtype instance heads are working properly now.

* Escaping newlines in comments and cpp works now.
  Escaping newlines on regular lines won't be implemented.

* One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)`
  I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse
  application, infix and negation without lexing all qualified names in the scanner.
  I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work.
  For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing.

* Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of
  Unicode categories, using bitmaps.
  I might need to change this to write them all to a shared file, so the set of source files stays the same.
@tek
Copy link
Contributor Author

tek commented May 4, 2024

here goes nothing

@tek tek merged commit 9b7b367 into main May 4, 2024
4 checks passed
@tek tek deleted the rewrite-2 branch May 4, 2024 15:39
@clason
Copy link

clason commented May 4, 2024

So this was merged to the main branch -- will that become the default branch at some point?

@tek
Copy link
Contributor Author

tek commented May 4, 2024

@clason not sure – for now I would leave master as default so that parser.c is present when the repo is cloned. If you have suggestions for a good system I'm all ears.

(in case it's unclear – the release workflow will generate parser.c and create a commit on master whenever a tag is pushed)

@clason
Copy link

clason commented May 4, 2024

Ah, master is your "release branch". Some parsers do that, but I never understood the point. Isn't it easier to have a release workflow that makes an actual release with artifacts (tarball and wasm) for downloading, and keep the repo as "pure source"?

@tek
Copy link
Contributor Author

tek commented May 4, 2024

I would prefer that, but as far as I could tell there are consumers who want to have a parser.c in the source tree.

(I've also created artifact release workflows btw.)

@clason
Copy link

clason commented May 4, 2024

I would prefer that, but as far as I could tell there are consumers who want to have a parser.c in the source

Yes, but does it have to be in a git branch? You can just generate the parser.c in-tree and then zip the whole thing up; the output will be identical from cloning the repo (ignoring the .git folder).

@tek
Copy link
Contributor Author

tek commented May 4, 2024

my motivation was that if someone uses the plain github URL in their application to pull the repo, they'll get the generated parser like it used to be the case in the past.

When I researched this issue I got the impression that some projects depend on this, but if you have better insight into the ecosystem I'd be happy to scrap that workflow.

@clason
Copy link

clason commented May 4, 2024

Well, I understand not wanting to make sudden breaking changes on master, that's completely fine. I am just honestly curious why people depend on this specific setup. If you are only committing the parser to master on manual tags/releases, then there should be zero difference from a downstream pov between an archive/<tag>.tar.gz link and a releases/download/<tag>/tree-sitter-haskell.tar.gz link (where the latter is created from the full repo download plus generated parser.c)?

(Unless I misunderstood and you are pushing on every commit to main -- but then you're back at square one with master bloating?)

@tek
Copy link
Contributor Author

tek commented May 4, 2024

no, you understood that fine. I don't like it either, but that's what I learned 🤷🏻 I saw that tree-sitter-swift had multiple requests to include the generated parser, even though they have a separate release branch! so my reasoning was that keeping master as a legacy branch will avoid having to respond to one of those once a month 😅 at least until it has become the new standard.

Of course you could argue that not playing that game would accelerate adoption...

@amaanq
Copy link
Member

amaanq commented May 4, 2024

You should just check in parser.c into master for the time being, and not have a "main" branch + "release" branch. In the future, we will be getting rid of the need for this but for now, this is the way to go. I also wish you had waited till I checked this out - I'm not too familiar with Haskell so I wouldn't be particularly useful in commenting on the grammar, but I would have like to checked changes around everything else in this PR.

@clason
Copy link

clason commented May 4, 2024

I could understand that if people wanted to pull in that branch as a git submodule (in which case 🤢)...

I do think the best strategy is "keep parser.c out of repo (unless it's small), and create release artefacts for 1. the full repo tree plus the generated parser.c and 2. the generated wasm". (You certainly shouldn't commit the generated wasm to the repo...) Downstream users who can't handle generating their own parser shouldn't YOLO track master, either.

@tek
Copy link
Contributor Author

tek commented May 4, 2024

@amaanq no reason why you shouldn't be able to check the changes – nothing has been released yet.

@clason understandable or not, people will open issues without reading the docs. I'd rather avoid having to deal with it over and over again

@clason
Copy link

clason commented May 4, 2024

People will open issues without reading the docs no matter what you do ;)

But I am not trying to convince you of anything; I just wanted to know the plan so nvim-treesitter can adapt. In this case, we won't have to do anything for a long while since we track the default branch.

@tek
Copy link
Contributor Author

tek commented May 4, 2024

People will open issues without reading the docs no matter what you do ;)

well they certainly won't open any about parser.c missing in the default branch if I keep it there!

But I am not trying to convince you of anything; I just wanted to know the plan so nvim-treesitter can adapt. In this case, we won't have to do anything for a long while since we track the default branch.

I would have expected you to be eager to consume the tags or artifacts instead, irrespective of the existence of a legacy branch.

@clason
Copy link

clason commented May 4, 2024

We are not set up for that yet (because the ecosystem isn't) but that is definitely where we want to be going.

(But that will require the release tarballs to have the exact same layout as the repo.)

@tek
Copy link
Contributor Author

tek commented May 4, 2024

oh I see. well in any case I'd assume that at some point all repos will be recreated from scratch in the tree-sitter-grammars org or something, without committing generated files going forward

@tek
Copy link
Contributor Author

tek commented May 4, 2024

hrm, somehow rebasing broke the working tree 🤔

@clason
Copy link

clason commented May 4, 2024

oh I see. well in any case I'd assume that at some point all repos will be recreated from scratch in the tree-sitter-grammars org or something, without committing generated files going forward

Not really, time is finite and the number of parsers is seemingly infinite. The org was just offering a home to a number of parsers looking for help with automated maintenance.

But, yes, the upstream guidance is to omit parser.c (but not grammar.json) from the repo in favor of versioned release artifacts with generated files, and we are very much looking forward to being able to rely on the latter more widely. Upstream is actively working on adding tooling to make this easier.

For now, my curiosity is satisfied that there will not be a breaking update to the master branch (which we track and so any breaking changes there force us to act) anytime soon, so I can wait and see how things play out. If anybody jumps the gun and makes a PR switching the branch, I'll now know what's what.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment