Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared Parser Library #10765

Open
16 of 31 tasks
matklad opened this issue Nov 14, 2021 · 0 comments
Open
16 of 31 tasks

Shared Parser Library #10765

matklad opened this issue Nov 14, 2021 · 0 comments
Labels
A-parser parser issues C-Architecture Big architectural things which we need to figure up-front (or suggestions for rewrites :0) ) E-hard fun A technically challenging issue with high impact S-unactionable Issue requires feedback, design decisions or is blocked on other work

Comments

@matklad
Copy link
Member

matklad commented Nov 14, 2021

So.... I've been trying to make rustc/rust-analyzer shared parser library for the past four years (rust-analyzer originally intended to be just a parser library), and the results have been meagre -- we share the lexer, and that's it. My theory is that's due to org stuff -- parser/AST has wide APIs, so extracing that is a whole lot of poorly factorable work. As such, other, more immediate things tend to always get higher priority. But today rust-analyzer feels like it is on a relatively stable footing, so it seems like a good opportunity to try to move the giant ship for real.

Let's see what we need to do to achieve that:

  1. Have isolated, IDE friendly Rust parsing library.
  • separate repo from ra? Or at least /libs folder?
  • get rid of Source Sync traits, use more direct API
  1. Figure out the best way to integrate with rustc.
  • Tree -> Tree transformation (there was PR to rustc proving feasibility)
    • need to stabilize & finalize rowan for that
    • perf?
  • Parser -> (Tree1, Tree2)
    • how to emit typed ast out of untyped parser?
      • ungrammar
  • Concede that sharing "nice" library is infeasible, and just hack today's parser to emit CST via cfg flags
  1. Do cleanups on rustc side.
  • harmonize token tree model (always use split tokens)
  • reduce dependencies on global state (remove code-map from parser, allow for shared-nothing parallel parsing)
  • how to handle Interpolated tokens?
  1. Implement the merge
  • ??? and lots of work

Tasks:

  • switch from TokenSource to SOA tokens internal: switch from trait-based TokenSource to simple struct of arrays #10995
  • move text-based lexing to the parser as well.
  • switch from TreeSink to emiting a vec of events.
  • move trivia attachment logic to the parser
  • move tests to parser
    • move lexer tests
    • move parser tests
    • add dedicated tests for ws attachment
  • get rid of synthetic_root
  • remove parse_text_as
    • split parse into "parse top level" and "parse prefix"
    • remove extra argument from build_tree
  • audit TopEntryPoint / PrefixEntryPoint to make sure it doesn't have some leftovers
    • add tests for non-main prefix entry points
    • add tests for non-main top entry points
  • figure out invariant for parse -- it doesn't parse the whole file. Four cases:
    • Parse the whole input as SourceFile -- working as intended today
    • Parse prefix of the input as $expr, for MBE -- sorta-working
    • Parse the whole input as $expr, for MBE output -- broken, primarily because it isn't separated from the previous point
    • Parse ??? as a template for SSR -- was never properly considered as a design goal, creates a lot of paints for the interface.
  • prototype structured AST creation from structured tokens
  • fix FIXMEs in prefix entry point tests
    • ensure there are macro-expansion level tests
  • add ast::EmptyStmt
  • move to libs dir
  • pick name (robust_parser? rust is substring. sisyphus also is a fitting name)
  • publish to crates.io
  • document guidelines (data based interface, simple, not necessary minimal (hooks for recovery, etc), flexible).
  • figure out the story for incremental parsing
    • move incremental parsing to parser crate
  • figure out the story for macros
    • rustc-style tree captures via $expr
    • parsing prefixes without creating tokens for everything
    • TokenMap which works
  • structured lexer errors.
  • drop limit dependency.
  • restore {} invariant
    • strengthen the invariant to cover all kinds of parenthesis for macros?
@matklad matklad added A-parser parser issues C-Architecture Big architectural things which we need to figure up-front (or suggestions for rewrites :0) ) E-hard fun A technically challenging issue with high impact S-unactionable Issue requires feedback, design decisions or is blocked on other work labels Nov 14, 2021
bors bot added a commit that referenced this issue Dec 12, 2021
10995: internal: switch from trait-based TokenSource to simple struct of arrays r=matklad a=matklad

cc #10765 

The idea here is to try to simplify the interface as best as we can. The original trait-based approach is a bit over-engineered and hard to debug. Here, we replace callback with just data. The next PR in series will replace the output `TreeSink` trait with data as well. 


The biggest drawback here is that we now require to materialize all parser's input up-front. This is a bad fit for macro by example: when you parse `$e:expr`, you might consume only part of the input. However, today's trait-based solution doesn't really help -- we were already materializing the whole thing! So, let's keep it simple!

Co-authored-by: Aleksey Kladov <aleksey.kladov@gmail.com>
bors bot added a commit that referenced this issue Dec 25, 2021
11117: internal: replace TreeSink with a data structure  r=matklad a=matklad

The general theme of this is to make parser a better independent
library.

The specific thing we do here is replacing callback based TreeSink with
a data structure. That is, rather than calling user-provided tree
construction methods, the parser now spits out a very bare-bones tree,
effectively a log of a DFS traversal.

This makes the parser usable without any *specifc* tree sink, and allows
us to, eg, move tests into this crate.

Now, it's also true that this is a distinction without a difference, as
the old and the new interface are equivalent in expressiveness. Still,
this new thing seems somewhat simpler. But yeah, I admit I don't have a
suuper strong motivation here, just a hunch that this is better.

cc #10765 

Co-authored-by: Aleksey Kladov <aleksey.kladov@gmail.com>
bors added a commit that referenced this issue Nov 9, 2023
…lnicola

Try to update parser/event doc

`TokenSource` and `TreeSink` has been refactored as part of #10765, they no longer exist in code repo. This pr tries to remove them from event module level comment to prevent confusion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-parser parser issues C-Architecture Big architectural things which we need to figure up-front (or suggestions for rewrites :0) ) E-hard fun A technically challenging issue with high impact S-unactionable Issue requires feedback, design decisions or is blocked on other work
Projects
None yet
Development

No branches or pull requests

1 participant