Move to a fully hand-written parser to improve compile / iteration times #173

domenicquirl · 2021-07-30T21:25:49Z

This is part of https://rust-lang.zulipchat.com/#narrow/stream/186049-t-compiler.2Fwg-polonius/topic/Polonius.20Hackathon.202021-07-30.

Preliminary results:

---- ALL ----	   ---- OLD ----	---- NEW ----
build --all: 		   28.874		     8.38
build --all --release: 38.162			14.15
test: 				   42.74		    16.569
test --release:		   53.709			23.505

---- PARSER ONLY ----
build: 				   26.367		     1.427
build --release: 	   31.56			 2.84
test: 				   28.244			 2.1
test --release: 	   34.41			 2.945

replace `lalrpop` dependency with `logos` plus a hand-written parser

remove from .gitignore

lqd

Great job and great results! Thanks so much.

I left a couple of quick comments, but will try it out more later.

I would personally love to see a few more doc comments if you're up for it: it would really help understanding and maintaining this in the future (not that I expect the format to change any time soon).

For example, it would be great to have an extensibility example, either in the README or in the book, detailing the changes one would need to do to add, say, another effect emitting facts for a liveness relation.

It looks straightforward looking at the existing fact parsing in the PR, but since this is likely the most common operation we'll ever do on the parser, it seems like an interesting example to have.

polonius-parser/Cargo.toml

polonius-parser/src/lib.rs

lqd · 2021-08-01T20:49:52Z

polonius-parser/src/token.rs

+}
+
+#[macro_export]
+macro_rules! T {


Since this macro is used in a lot of places, a quick doc comment would be nice. I'm guessing it returns the interned token kind for a given token ?

TokenKind is just a u16 and Copy, so there isn't really a need to intern it. The macro is more for convenience, and I also find it more readable since you have a lot of checks against kinds in matches and calls like self.consume(T!['('])?; or lists like

ParseError::UnexpectedToken { found, expected: vec![T![;], T![/]], position: self.position(), }

A macro like this is used by the folks over at rust-analyzer, and has come in handy for me in my own projects as well a few times. I've added some explanation in the code that reflects what I'm explaining here.

polonius-parser/src/token.rs

domenicquirl · 2021-08-06T15:23:30Z

@lqd I've made some changes addressing your comments.

During the sprint, I was very much focused on getting this off the ground and compile times as far down as possible, largely flying by the existing tests. Now with more time I've done some minor refactorings to clean up the implementation, add documentation to a lot of places and give the polonius_parser crate its own README. The latter contains some general description and instructions, plus the example you asked about.

I would have liked to put some doc tests on the actual parsing methods as well, but doc testing with internal items doesn't really work out that well.

Let me know in case you have further questions on this.

polonius-parser/README.MD

lqd · 2021-08-30T11:24:28Z

Thanks a ton!

KvanTTT · 2021-09-07T22:54:22Z

polonius-parser/README.md

+```
+
+## Usage
+The `polonius_parser` crate provides a single function `parse_input`, which takes a program description as its input string.


As a common practice it's better to surround header with blank lines. See MD022 - Headers should be surrounded by blank lines and other markdown lint rules.

True, good catch.

If you have a time to open a PR that would be great. Otherwise, I'll fix it soon.

KvanTTT · 2021-09-07T22:56:14Z

polonius-parser/README.md

+    ```rs
+    kw if kw.starts_with("loan_bazzles_var_at".as_bytes()) => {
+        ("loan_bazzles_var_at".len() as u32, T![loan_bazzles_var_at])
+    }
+    ```


Code blocks also should be surround with blank lines: MD031 - Fenced code blocks should be surrounded by blank lines

KvanTTT · 2021-09-07T23:01:00Z

polonius-parser/src/lexer.rs

+            kw if kw.starts_with("use_of_var_derefs_origin".as_bytes()) => (
+                "use_of_var_derefs_origin".len() as u32,
+                T![use_of_var_derefs_origin],
+            ),
+            kw if kw.starts_with("drop_of_var_derefs_origin".as_bytes()) => (
+                "drop_of_var_derefs_origin".len() as u32,
+                T![drop_of_var_derefs_origin],
+            ),
+            kw if kw.starts_with("placeholders".as_bytes()) => {
+                ("placeholders".len() as u32, T![placeholders])
+            }
+            kw if kw.starts_with("known_subsets".as_bytes()) => {
+                ("known_subsets".len() as u32, T![known subsets])
+            }
+            // CFG keywords
+            kw if kw.starts_with("block".as_bytes()) => ("block".len() as u32, T![block]),
+            kw if kw.starts_with("goto".as_bytes()) => ("goto".len() as u32, T![goto]),
+            // effect keywords - facts
+            kw if kw.starts_with("outlives".as_bytes()) => ("outlives".len() as u32, T![outlives]),
+            kw if kw.starts_with("loan_issued_at".as_bytes()) => {
+                ("loan_issued_at".len() as u32, T![loan_issued_at])
+            }
+            kw if kw.starts_with("loan_invalidated_at".as_bytes()) => {
+                ("loan_invalidated_at".len() as u32, T![loan_invalidated_at])
+            }
+            kw if kw.starts_with("loan_killed_at".as_bytes()) => {
+                ("loan_killed_at".len() as u32, T![loan_killed_at])
+            }
+            kw if kw.starts_with("var_used_at".as_bytes()) => {
+                ("var_used_at".len() as u32, T![var_used_at])
+            }
+            kw if kw.starts_with("var_defined_at".as_bytes()) => {
+                ("var_defined_at".len() as u32, T![var_defined_at])
+            }
+            kw if kw.starts_with("origin_live_on_entry".as_bytes()) => (
+                "origin_live_on_entry".len() as u32,
+                T![origin_live_on_entry],
+            ),
+            kw if kw.starts_with("var_dropped_at".as_bytes()) => {
+                ("var_dropped_at".len() as u32, T![var_dropped_at])
+            }
+            // effect keywords - use
+            kw if kw.starts_with("use".as_bytes()) => ("use".len() as u32, T![use]),


Why prefix-like tree structure (trie) is not used here? It seems like it can provide additional performance because it considers all keywords simultaneously during comparison.

Mostly because:

the goal was to reduce compile times, and runtime is not especially important here: the tests are small, and not numerous yet. The slight gain in parsing could be interesting in the future, but not essential for this to land. If you're interested in benchmarking and improving this, by all means please do, we would love that.

this whole PR was done during the latest friday-afternoon-sprint, and the lexer in particular was put together in a couple hours at most. Impressive work in such a short time.

domenicquirl added 3 commits July 30, 2021 21:08

parser: lalrpop -> hand-written

8a1c231

replace `lalrpop` dependency with `logos` plus a hand-written parser

parser: add new parser file

c9a448a

remove from .gitignore

parser: also remove logos and hand-write the lexer

c7cf86d

lqd reviewed Aug 1, 2021

View reviewed changes

domenicquirl added 6 commits August 6, 2021 13:41

Merge remote-tracking branch 'upstream/master' into parser

f25fe79

minor cleanups

c05239f

add rustdoc documentation

8a59678

factor out parameter parsing

002a07b

add README for parser with extension example

2fd73a3

use full relation names for T!

78c29ca

bjorn3 reviewed Aug 6, 2021

View reviewed changes

polonius-parser/README.MD Outdated Show resolved Hide resolved

domenicquirl and others added 3 commits August 6, 2021 18:42

use lowercase .md extension

22ba020

fix typo in readme

c3e7a84

fix old rust-lang nursery reference

a7301ff

lqd merged commit 0cbbb7c into rust-lang:master Aug 30, 2021

KvanTTT reviewed Sep 7, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move to a fully hand-written parser to improve compile / iteration times #173

Move to a fully hand-written parser to improve compile / iteration times #173

domenicquirl commented Jul 30, 2021

lqd left a comment

lqd Aug 1, 2021

domenicquirl Aug 6, 2021

domenicquirl commented Aug 6, 2021

lqd commented Aug 30, 2021

KvanTTT Sep 7, 2021

lqd Sep 8, 2021

KvanTTT Sep 7, 2021

KvanTTT Sep 7, 2021

lqd Sep 8, 2021

Move to a fully hand-written parser to improve compile / iteration times #173

Move to a fully hand-written parser to improve compile / iteration times #173

Conversation

domenicquirl commented Jul 30, 2021

lqd left a comment

Choose a reason for hiding this comment

lqd Aug 1, 2021

Choose a reason for hiding this comment

domenicquirl Aug 6, 2021

Choose a reason for hiding this comment

domenicquirl commented Aug 6, 2021

lqd commented Aug 30, 2021

KvanTTT Sep 7, 2021

Choose a reason for hiding this comment

lqd Sep 8, 2021

Choose a reason for hiding this comment

KvanTTT Sep 7, 2021

Choose a reason for hiding this comment

KvanTTT Sep 7, 2021

Choose a reason for hiding this comment

lqd Sep 8, 2021

Choose a reason for hiding this comment