New Parser #262

AntonLydike · 2022-12-09T11:39:54Z

First of all, I want to note that none of this is final, and that I welcome any ciriticism and discussion on all parts of this! This is my "getting to know xDSL" project, so my understanding of the underlying principles is still basic at best. I hope the following makes sense.

The new parser is aimed at making it easier to:

Change existing syntax (as Mathieu isn't that happy with current xDSL synax (or MLIR for that matter))
Reason about the parsing code, fix bugs, introduce features, etc.
Implement custom parsing/printing for Attribues/Types
Move closer to 100% MLIR<->xDSL compat
Produce very nice error messages

For that a couple of things were implemented:

The parser now operates on Spans over the input, meaning we have location information attached to our string snippets we move around
Backtracking built into the parser from the ground up, using:
```
with self.tokenizer.backtrackin():
   # do stuff
```
A BNF-like meta-programming layer was introduced
1. Complex structures can be written in a BNF-like notation in the parser to make analyis/changes easier
2. The BNF trees should be able to handle parsing and printing
3. In the future we want to automatically generate Attribute and Operation Parsing/Printing either from user-supplied BNF notation or form the tablegen spec
4. Will allow custom parsers to drop into their own parsing routines whenever they want

The BNF stuff:

There are a couple different motivating factors, and it wasn't easy to find something that satisfied them all:

Make it easy to reason about parsing correctness
Make it easy to implement custom parsers (and printers?)
Make it easy to convert from tablegen format

I also had the weird dream on generating printer and parser out of a single spec.

So let's begin adressing the first three points. For that I sketched up parsing of a generic operation:

generic_operation = BNF.Group([  
    BNF.Nonterminal('string-literal', bind="name"),  
    BNF.Literal('('),  
    BNF.ListOf(BNF.Nonterminal('value-id'), bind='args'),  
    BNF.Literal(')'),  
    BNF.OptionalGroup([  
        BNF.Literal('['),  
        BNF.ListOf(BNF.Nonterminal('block-id'), allow_empty=False, bind='blocks'),  
        # TODD: allow for block args here?! (accordin to spec)  
        BNF.Literal(']')  
    ]),  
    BNF.OptionalGroup([  
        BNF.Literal('('),  
        BNF.ListOf(BNF.Nonterminal('region'), bind='regions', allow_empty=False),  
        BNF.Literal(')')  
    ]),  
    BNF.Nonterminal('attr-dict', bind='attributes'),  
    BNF.Literal(':'),  
    BNF.Nonterminal('function-type', bind='type_signature')  
])

Note that a BNF.Literal represents a fixed string in the input, and is not to be confused with the parsing of e.g. a string-literal (so "some arbitrary string here")

Each Nonterminal calls an underlying function in the parser.

This is not exactly pure BNF. I instead provided something I feel is easier to use. For example, Optional and Group are often combined, so there is a wrapper for that. And parsing lists can be much more comfortable now using ListOf which takes a containing token, a regex separator, and can be configured to either allow or disallow empty lists.

There are also no plans to provide a OR (so basically ( something | something-else )) as this explodes the parsing complexity.

Extracting parsed fields is done through the bind=<name> attributes on the nodes. After parsing is complete, you get a dictionary where the fields <name> are populated with the parsing results of that parser.

Note: There are constraints on what makes "sense" to bind to. If you, for example bind inside of a ListOf, you will only have the last element on the output dictionary. Instead you should probably bind the ListOf? I am not sure though, because you can, theoretically, have arbitrary BNF inside the ListOf, which would take away all the simplicity we gained from bind. This is not a solved problem yet.

Printing with the Parser?

This whole bind stuff, gave me the idea, that it might be possible to now go from dict[<name>, <value>] and the BNF tree back to the source code and implement parsing/printing in one!

On the surface it seems possible, but it is a lot more complicated than this sadly. How do we decide when to print an OptionalGroup, what to do with a ListOf that contains a Group? an OptionalGroup? I have some ideas, but it's not very straight forward sadly.

The best thing might be to restrict the nested complexity of the BNF to allow for good ergonomics there. lets see. I only just got "here" yesterday evening.

On Error Messages:

They currently don't fulfill their promise. At all. Sorry about that. It's all still very wonky right now!

codecov · 2022-12-09T11:42:29Z

Codecov Report

Base: 88.73% // Head: 88.53% // Decreases project coverage by -0.20% ⚠️

Coverage data is based on head (3c4349a) compared to base (b6678ee).
Patch coverage: 84.79% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #262      +/-   ##
==========================================
- Coverage   88.73%   88.53%   -0.20%     
==========================================
  Files          64       65       +1     
  Lines        7864     8017     +153     
  Branches     1286     1270      -16     
==========================================
+ Hits         6978     7098     +120     
- Misses        631      660      +29     
- Partials      255      259       +4

Impacted Files	Coverage Δ
xdsl/parser.py	`83.60% <ø> (-0.47%)`	⬇️
tests/test_attribute_definition.py	`95.36% <61.53%> (ø)`
xdsl/utils/exceptions.py	`68.62% <61.53%> (-23.69%)`	⬇️
xdsl/dialects/builtin.py	`82.57% <73.07%> (+1.19%)`	⬆️
xdsl/dialects/llvm.py	`91.48% <80.00%> (ø)`
tests/test_parser_error.py	`93.02% <83.33%> (-6.98%)`	⬇️
xdsl/ir.py	`84.69% <93.33%> (-0.06%)`	⬇️
tests/test_printer.py	`99.07% <96.66%> (ø)`
tests/test_attribute_builder.py	`99.22% <100.00%> (ø)`
tests/test_ir.py	`100.00% <100.00%> (ø)`
... and 13 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

superlopuh · 2022-12-09T14:05:49Z

xdsl/parser_ng.py

+        ])
+
+
+class MlirParser:


Suggested change

class MlirParser:

class MLIRParser:

superlopuh · 2022-12-09T14:08:06Z

xdsl/parser_ng.py

+    methods marked try_... will attempt to parse, and return None if they failed. If they return None
+    they must make sure to restore all state.
+
+    methods marked must_... will do greedy parsing, meaning they consume as much as they can. They will


"must parse" reads like a boolean expression to me, it took a little while to understand that it was actually parsing the text. Maybe replace with either parse or parse_greedy to make into an active verb phrase?

webmiche · 2022-12-09T14:47:56Z

Puh, honestly this PR is a mouthful, so let me start at the top (or the thing I feel like I understand somewhat): The BNF description. I am reading this as the BNF description of the xDSL representation, not the MLIR one.

I am personally more familiar with EBNF which, AFAIK, is the same, but with some "syntactic sugar". Therefore, I will argue with EBNF. If something is unclear, please ask!

AFAIU, Optional and Group correspond to a choice and a list. In EBNF, we use { } in this case, so just the list, as a not-taken choice is the same as repeating 0 times in a way. Therefore, I feel like you might not even need to combine into an OptionalGroup, a Group should be enough (if it allows to be taken 0 times).

Next, I am confused that blocks are not nested into regions. and that they get the [ ] symbols. These symbols are used for the attributes in the current representation. Are you suggesting to change the representation of operations or is this an oversight?

And why is the entire thing wrapped inside a group?

Fundamentally, I feel that it really should be possible to generate parsers and printers from this, if we are a little bit careful about how we write the grammars. Not sure whether you are familiar with the concept of left/right-recursive grammars and its implications on parsings such grammars.

Anyway, will jump into the code now, but I don't believe I will be able to fully review/understand this today. I will probably revisit it on monday :)

AntonLydike · 2022-12-09T14:51:51Z

Hey, thanks @webmiche, I don't think you want to/should jump into the code right now. This PR is very WIP. It's more meant to be a "Hey, I'm working on this feature right now and am thinking about these concepts", and to get some feedback on the concepts.

Edit: Had the wrong mention here

math-fehr · 2022-12-09T15:18:50Z

Next, I am confused that blocks are not nested into regions. and that they get the [ ] symbols. These symbols are used for the attributes in the current representation. Are you suggesting to change the representation of operations or is this an oversight?

Just to quickly respond to that, I think that this is the syntax for the successors, not the blocks themselves!

webmiche · 2022-12-09T15:43:28Z

Ah yes. But aren't we currently printing successors wrapped into ( )?

math-fehr · 2022-12-09T15:51:42Z

I think we still wrap them in [].
Though I plan on drastically changing the IR syntax, to match almost the one from MLIR (besides 1-2 changes).
The only changes I would like to keep, is that attributes and the operation type is written before the regions, so you don't have to jump to the other side of the IR to see them.
The other thing is maybe changing the region syntax, since I feel like the MLIR one (({}, {}, {})) is confusing, especially for users used to using {}, which will be parsed as an attribute dictionary.
The hope is that this PR will make it obvious how to change it, and make it less error prone!

webmiche · 2022-12-09T15:54:47Z

Looking at https://github.com/xdslproject/xdsl/blob/main/tests/filecheck/cf_ops.xdsl we use ( ), right? Am I confused? xD

Sure, I agree with you. I also don't like that regions and attributes are wrapped into { } in MLIR. So maybe we could keep the change to [ ] for attrs?

math-fehr · 2022-12-09T16:33:21Z

Looking at https://github.com/xdslproject/xdsl/blob/main/tests/filecheck/cf_ops.xdsl we use ( ), right? Am I confused? xD

Sure, I agree with you. I also don't like that regions and attributes are wrapped into { } in MLIR. So maybe we could keep the change to [ ] for attrs?

Okay, forget my comments, I'm probably the most confused xD I think [] is for successors in MLIR (but is often changed for the custom constraints).
I'm still not sure what is the best syntax for attributes, since people from the Python world would prefer {}, though [] is removing the ambiguous syntax that MLIR has.

webmiche · 2022-12-09T16:45:20Z

Well, at the end of the day, aren't we all confused? xD

Okay, yes I agree that { } makes sense in the python world as this basically is a dictionary attached to an op...
On the other hand, people from Java/C/C++ feel like { } implies something like a function/nesting, which is pretty much a region...

I guess people from the python world can also look at it as a list of tuples, so there is a weak argument for [ ] 😅

wence- · 2023-01-11T10:48:11Z

A flyby comment (I don't think I'm going to make it up the hill to the hackathon this week, sorry). Is there a reason that you are not using an existing package (for example lark) for the parsing infrastructure. It seems to me that would help quite a bit because you just need to define the grammars and translation of parsed trees into XDSL (using their existing tree-visitor infrastructure).

math-fehr · 2023-01-11T10:55:46Z

So the reason we cannot use most parser/printer generators is that we need to use arbitrary Python for the grammar (for attributes and operations).
If Lark allows to execute arbitrary Python (without too much of a hassle), then I would say it's worth it to look at it!

AntonLydike · 2023-01-13T16:17:53Z

Current state:

Filecheck:

Testing Time: 3.92s
  Unsupported:  3
  Passed     : 19
  Failed     : 25

Pytest:
4 failed, 322 passed, 1 skipped

AntonLydike · 2023-01-13T16:27:43Z

I removed the BNF stuff from this PR as well, the design was not ready, and the PR is big enough as is.

webmiche

Could this be split into multiple PRs? I just scrolled through the text in the parser file and added the most obvious comments, but I cannot review the actual functionality like this, it is just too much.

webmiche · 2023-01-19T07:43:46Z

tests/test_ir.py

-    parser = Parser(ctx, program_func)
-    program: ModuleOp = parser.parse_op()
+    parser = XDSLParser(ctx, program_func)
+    program: ModuleOp = parser.must_parse_operation()


I feel that must_parse_operation sounds like an awful name for an API you want to expose. Maybe rename or wrap into a properly named function?

I think you are correct. There already is a function called begin_parsing which is meant to be called from outside to parse a file. I'll get to it!

i just realized that this breaks some tests. Specifically, begin_parse makes sure that the operation parsed is a module_op. Some tests don't wrap their input in a builtin.module, and are therefore not "valid" xdsl/mlir programs.

I changed back to using must_parse_operation, as we are wanting to parse just a single operation here. I don't think we should expose an interface like parse_op.

I think that must_parse_op is not an API the parser wants to expose. It's only meant to be used parser internally. The test just has to use it as it isn't using the "whole" parser. If that makes sense?

If it's not an exposed API, preface it with _ please. (Python coding standards)

tests/test_mlir_printer.py

webmiche · 2023-01-19T07:47:22Z

tests/test_parser.py

+    attr = DictionaryAttr.from_dict(data)
+
+    with StringIO() as io:
+        Printer(io).print(attr)


I don't think this makes sense as this makes the Parser tests depend on the Printer. I feel that parser tests should really take strings and the data structure that is expected in order to test.

Valid point. I'll change that

I just noticed that most of the printer tests rely on the parser as well. What's up with that? Why is that more okay?

tests/test_printer.py

xdsl/ir.py

xdsl/xdsl_opt_main.py

webmiche · 2023-01-19T07:55:37Z

xdsl/parser.py

+import itertools
+import re
+import sys
+import traceback


Are any of these new dependencies?

These are all builtin python modules

I am surprised the parser needs the python ast?

We use it to evaluate string literals, that is actually quite tricky, and instead of re-inventing the wheel I looked for a stdlib function that handles that for us. That's why I used ast.literal_eval. We also could have used json.loads, or something else. I just found literal_eval first while searching/thinking about it.

xdsl/parser.py

AntonLydike · 2023-01-19T10:50:57Z

@webmiche How would you go about splitting something like this into smaller PRs? I'm genuinely curious, I can't think of a sensible way.

I could sit down with you and give you a high-level overview over the concepts, if that would help? The plan was to do the review during the Hackathon, sadly I couldn't get it done in time :(

webmiche · 2023-01-19T11:02:45Z

So I guess the issue is that a lot of tests need a full parser in order to run, right?

I could imagine having a branch without the old parser and with all tests marked as UNSUPPORTED and then basically upstream into that. So start out with a xDSLParser that can just about parse an empty module and then develop that by adding more functionality, enabling tests along the way. And once you pass a good amount of tests, we can merge that parser into the main branch and continue upstreaming there.

Or just remove everything that is around the MLIRParser. That should be relatively little anyhow. Then we can let that still flow through the old infrastructure, or maybe not support it at all. I think this would already cut down the number of lines by a lot.

Also notice that you are pretty much removing the old parser, so github diffs look extremely bad. Simply renaming the parser file might already be quite an improvement on that side.

AntonLydike · 2023-01-19T11:10:53Z

We cam move the parser back to parser_ng.py (I originally developed it there), which would make the diff much more readable. The problem with that would be, that we then have two completely different parsers in the same codebase. This might be a worthy tradeoff for git diff visibility though, as it can be removed relatively easily on a follow-up pr. I can make the change if you want @webmiche

Co-authored-by: Fehr Mathieu <mathieu.fehr@gmail.com>

math-fehr

Nice! I think we can merge it now!

superlopuh reviewed Dec 9, 2022

View reviewed changes

xdsl/parser_ng.py Outdated

])

class MlirParser:

Copy link

Member

superlopuh Dec 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

class MlirParser:

class MLIRParser:

superlopuh reviewed Dec 9, 2022

View reviewed changes

georgebisbas assigned AntonLydike Jan 4, 2023

georgebisbas added the dialects Changes on the dialects label Jan 4, 2023

tobiasgrosser added the hackathon To be tackled at the hackathon label Jan 7, 2023

webmiche added xdsl xdsl framework specific changes and removed dialects Changes on the dialects labels Jan 10, 2023

AntonLydike marked this pull request as draft January 11, 2023 16:43

AntonLydike force-pushed the anton/parser-printer-rework branch from 726dba6 to a69de90 Compare January 13, 2023 16:13

AntonLydike force-pushed the anton/parser-printer-rework branch 3 times, most recently from 77ec5ca to 2fcfedb Compare January 18, 2023 17:19

AntonLydike marked this pull request as ready for review January 18, 2023 18:17

webmiche requested changes Jan 19, 2023

View reviewed changes

AntonLydike and others added 23 commits January 23, 2023 16:38

parser: fixed special attribute-entry parsing for UnitAttr

b58143a

xdsl: fix how the parser is used in tests

58339bc

xdsl: fix typo in Block.delcared_at

d0b5a5b

xdsl: fixed errorneous type hints on DictionaryAttr

41049fe

tests: fix tests that don't wrap their input in a builtin.module

7fda7e0

parser: fix a typo

70ecada

Co-authored-by: Fehr Mathieu <mathieu.fehr@gmail.com>

parser: add docstring to tokenizer

8e5ce88

parser: remove BacktrackingAbort

6968619

parser: add docstring for BacktrackingHistory

ebbc7bc

parser: renamed begin_parse to parse_module

56490fd

parser: added return type to get_block_from_name

932e403

parser: fix typos and alignement issues

4e985cd

parser: removed get_nth_line_bounds - unused function

ea9a75b

parser: fix minor nitpicks in tests

db7c4cb

xdsl: revert back to a callable interface for xdsl-opt frontends

4e27ede

tests: removed a bunch of unneeded arguments for the parser

7a40bf2

parser: move stuff around, fix formatting

46fee68

parser: uppercase comments

2f09883

parser: make a bunch of methods private

91a8d12

tests: clean up imports and unused vars in test_parse_error

85810fc

tools: removed assertion text - off topic

54a7788

tests: fix docstring

9db3804

xdsl: removed must_ prefix from parser methods

868e509

AntonLydike force-pushed the anton/parser-printer-rework branch from c3b2915 to 868e509 Compare January 23, 2023 16:38

AntonLydike added 3 commits January 23, 2023 16:39

formatting: run yapf

de5fa12

formatting: fixed typos and other minor issues

cc67615

formatting: yapf run

3c4349a

math-fehr approved these changes Jan 23, 2023

View reviewed changes

webmiche merged commit b09e94e into main Jan 23, 2023

tobiasgrosser deleted the anton/parser-printer-rework branch January 24, 2023 06:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Parser #262

New Parser #262

AntonLydike commented Dec 9, 2022

codecov bot commented Dec 9, 2022 •

edited

superlopuh Dec 9, 2022

superlopuh Dec 9, 2022

webmiche commented Dec 9, 2022 •

edited

AntonLydike commented Dec 9, 2022 •

edited

math-fehr commented Dec 9, 2022

webmiche commented Dec 9, 2022

math-fehr commented Dec 9, 2022

webmiche commented Dec 9, 2022

math-fehr commented Dec 9, 2022

webmiche commented Dec 9, 2022

wence- commented Jan 11, 2023

math-fehr commented Jan 11, 2023

AntonLydike commented Jan 13, 2023

AntonLydike commented Jan 13, 2023

webmiche left a comment

webmiche Jan 19, 2023

AntonLydike Jan 19, 2023

AntonLydike Jan 20, 2023

AntonLydike Jan 20, 2023

webmiche Jan 23, 2023

webmiche Jan 19, 2023

AntonLydike Jan 19, 2023

AntonLydike Jan 23, 2023

webmiche Jan 19, 2023

AntonLydike Jan 19, 2023

tobiasgrosser Jan 19, 2023

AntonLydike Jan 19, 2023

AntonLydike commented Jan 19, 2023

webmiche commented Jan 19, 2023

AntonLydike commented Jan 19, 2023

math-fehr left a comment

New Parser #262

New Parser #262

Conversation

AntonLydike commented Dec 9, 2022

The BNF stuff:

Printing with the Parser?

On Error Messages:

codecov bot commented Dec 9, 2022 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

webmiche commented Dec 9, 2022 • edited

AntonLydike commented Dec 9, 2022 • edited

math-fehr commented Dec 9, 2022

webmiche commented Dec 9, 2022

math-fehr commented Dec 9, 2022

webmiche commented Dec 9, 2022

math-fehr commented Dec 9, 2022

webmiche commented Dec 9, 2022

wence- commented Jan 11, 2023

math-fehr commented Jan 11, 2023

AntonLydike commented Jan 13, 2023

AntonLydike commented Jan 13, 2023

webmiche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AntonLydike commented Jan 19, 2023

webmiche commented Jan 19, 2023

AntonLydike commented Jan 19, 2023

math-fehr left a comment

Choose a reason for hiding this comment

codecov bot commented Dec 9, 2022 •

edited

webmiche commented Dec 9, 2022 •

edited

AntonLydike commented Dec 9, 2022 •

edited