Tidy up preprocessor and parser #12

tomstuart · 2023-07-16T15:08:37Z

The preprocessor and parser (and their tests) were becoming a little messy and inconsistent. This PR cleans them up to fix many minor annoyances in preparation for implementing the type use abbreviation.

The problems addressed are mostly related to naming and code organisation. The most significant change is that the instructions inside a global definition’s initialiser expression and the expanded result of an inline import are now being preprocessed correctly; I’d previously overlooked the need to do this because neither case is exercised by the WebAssembly test suite for any of the abbreviations implemented so far.

These don’t really affect the preprocessor because it doesn’t care about the details, but it’s less confusing if the modules in the tests are at least syntactically valid.

Since almost all of the preprocessor tests are for fields within a single module body, we can reduce boilerplate (and indentation levels) by introducing an assertion which automatically wraps its arguments in `(module …)`.

Most of the preprocessor tests use syntax which relies upon the ability to omit an explicit type use [0] and have it automatically inserted when abbreviations are expanded. To prevent these tests from breaking when we implement desugaring of type use abbreviations we need to update them to contain full type uses which are unaffected by the expansion of that abbreviation. When adding an unabbreviated type use, I’ve removed the corresponding inline declarations unless they’re the focus of the test. [0] https://webassembly.github.io/spec/core/text/modules.html#abbreviations

This makes it easier to see what’s gone wrong.

As with `offset` and element expressions, the test is demonstrating desugaring of instructions in an invalid program because that’s the correct behaviour for the parsing phase regardless of validation. In practice only constant instructions [0] may appear inside a global definition [1] but we have no way to demonstrate desugaring of constant instructions yet. [0] https://webassembly.github.io/spec/core/valid/instructions.html#valid-constant [1] https://webassembly.github.io/spec/core/valid/modules.html#globals

We need to desugar an inline import after expanding it because its description can contain other abbreviations (e.g. multiple anonymous parameters and results in the case of function imports). I overlooked this earlier because I didn’t write a test to cover it and presumably the official WebAssembly test suite doesn’t cover it either, so no tests failed when I removed inline import support from the parser.

These are the only two remaining kinds of field and neither supports any abbreviations, so we can list them as an intentional no-op case and have all field kinds explicitly represented.

This separation exists in most places in the preprocessor, so for the sake of stylistic consistency we should do it everywhere.

…cessor There’s no reason for these to be different, and the longer names are clearer.

It’s still the wrong way of doing it, but at least now it’s only wrong once.

This logic is repeated in a lot of places, so let’s extract and reuse it. IDs are essentially always optional so I’ve chosen to write a single method which unconditionally tries to read an ID (and may return nil) rather than a pair of #can_read_id?/#read_id methods. That does mean we have to check for `id.nil?` in a couple of places rather than asking a dedicated predicate, but it’s rare enough that I think this trade-off is the right one.

…uction This is how we do it everywhere else, and there’s no reason for it to be different here. The pattern match also has the advantage of tolerating non-string values (e.g. arrays), which makes the #can_read_list? call unnecessary. Removing the numbered parameter changes the block’s arity, so we need to add an explicit dummy parameter to keep #repeatedly happy. In hindsight I don’t think the `until:` API is a good idea so I’m going to change that soon.

This provides a nice expressive interface for dealing with indexes, moves a complicated regexp out of the parser, and puts the implementation in a helper where the preprocessor will be able to share it in future.

In 8ea2aa0 I removed unnecessary `?:` from regular expressions which already didn’t capture unnamed groups (because they also contained named ones), but I neglected to check whether other regular expressions were unintentionally capturing unnamed groups, so I’m fixing that now. Although this adds more syntactic noise to the regular expressions, I think our intent is clearer if we only capture groups which we intend to use, so I’m prepared to pay the price.

The pattern match will raise an exception if it fails, so there’s no need to do it ourselves.

The grammar says a type use appears here [0] so we should be reusing all of the appropriate parsing machinery. This creates the opportunity to check that no named parameters appear in the type use, which we’ll do in the next commit. [0] https://webassembly.github.io/spec/core/text/instructions.html#text-call-indirect

We already do this in #parse_blocktype because of its side condition which requires a type use to contain only unnamed entries [0]. The same side condition applies to `call_indirect`’s type use [1] so we should perform the same check. [0] https://webassembly.github.io/spec/core/text/instructions.html#text-instr-block [1] https://webassembly.github.io/spec/core/text/instructions.html#text-call-indirect

The grammar says zero or more results appear here [0] so we should be reusing the appropriate parsing machinery. [0] https://webassembly.github.io/spec/core/text/instructions.html#parametric-instructions

These are indexes [0], so we should be reusing the index-parsing machinery. [0] https://webassembly.github.io/spec/core/text/instructions.html#table-instructions

This’ll allow the loop to be terminated by the `StopIteration` exception instead of needing to support custom conditions through arguments.

It’s admittedly a bit “exceptions for control flow”, but this way of using `StopIteration` is idiomatic in Ruby so I think giving the caller arbitrary control over loop termination is easier to understand than relying upon a mystery `until:` argument. I’ve resisted the temptation to push `raise StopIteration` further down the call stack (into e.g. #read_list, or even #read) to avoid spooky action at a distance; I think the logic is easiest to understand when `raise StopIteration` appears syntactically within the #repeatedly block. I’ve also chosen a guard-style raise (`raise … if …`) rather than putting the whole body of the block inside a conditional (`if … else raise … end`) because the former puts the `raise` lexically adjacent to the #repeatedly call instead of pushing it further away (e.g. inside the `else` clause) which should improve clarity. This choice also helps to keep the indentation levels under control.

We always use `StopIteration` now.

… parser This avoids us having to do the concatenating ourselves. Again I’ve chosen guard-style `raise StopIteration if …` clauses to keep them adjacent to the #repeatedly call for clarity’s sake.

I’d like to use destructuring with optional block parameters in the implementation of #unzip_pairs, but there’s a bug in YJIT [0] which prevents this from working. The easiest solution for now is to turn YJIT off. [0] Shopify/yjit#313

When `pairs` is empty then `pairs.transpose` is `[]`, which gets destructured into a pair of nils for the two block parameters. We can replace the nils in this case by defaulting each parameter to the empty array.

The only uses of #unzip_pairs are to unzip the results of these methods, so we might as well have them return an unzipped result.

This was the only caller of #parse_declarations which didn’t unzip its result with #unzip_pairs. By doing it consistently across all callers we create the opportunity to have #parse_declarations return unzipped results directly, which we’ll do in the next commit.

The only uses of #unzip_pairs are to unzip the result of the #parse_declarations method, so we might as well have it return an unzipped result.

This is now the only caller of #unzip_pairs, so we can safely build it into #parse_declarations directly.

Validation constrains modules to at most one memory [0] but the parser doesn’t need to worry about that. It creates more regularity in the parser to support arbitrarily many memories (just like functions, tables, globals etc) and let validation deal with any problems. [0] https://webassembly.github.io/spec/core/valid/modules.html#valid-module

Now that these are all arrays (except for `start`) we can consolidate the setup of their initial values.

I’d like to support multiple memories here to match the AST parser, and that change will be easier if we assign `current_module.memory` to a local variable instead of repeatedly using it inline.

Validation won’t actually allow multiple memories [0] but it makes the implementation more regular, and more like the formal WebAssembly language specification, if we assume they’re possible. All of the memory instructions implicitly operate on memory index zero [1] so we’ve now made that explicit in their implementation. The #initialise_memories method no longer raises an error if the memory index is non-zero; this check should’ve been happening during validation anyway (which we’re not doing yet) so losing it doesn’t matter. [0] https://webassembly.github.io/spec/core/valid/modules.html#valid-module [1] https://webassembly.github.io/spec/core/syntax/instructions.html#memory-instructions

Now that modules have multiple memories, we can use the same approach for all relevant fields during allocation [0]. This makes it clear that exports are different and shouldn’t be “built” in the same way; we could’ve already seen this from the definition of module allocation [1]. I’m not going to get distracted by making any more changes to the interpreter at the moment, but we should come back to this in future. [0] https://webassembly.github.io/spec/core/exec/modules.html#allocation [1] https://webassembly.github.io/spec/core/exec/modules.html#alloc-module

To desugar the type use abbreviation [0] we’ll need access to a list of all the module’s type definitions [1] when preprocessing a type use [2], but the full list of type definitions will only be available once the complete module has been processed. This creates a chicken-and-egg situation: we can’t desugar a type use until we’ve processed all of the enclosing module’s fields, but many of those fields will contain type uses which need to be desugared during processing. This commit breaks the deadlock by refactoring the preprocessor to allow some field-processing methods to return a “deferred result”, namely a proc which must be called with a list of type definitions in order to produce the final S-expression. Deferred results give us the ability to process a module’s fields first and then provide type definitions later; the next step will be to generate the type definitions during processing so we can make use of that ability. This continues the preparatory work from #12 and #13. We’re still in the “make the change easy” [3] phase, so there’s no externally visible difference in the preprocessor’s behaviour yet. [0] https://webassembly.github.io/spec/core/text/modules.html#abbreviations [1] https://webassembly.github.io/spec/core/text/modules.html#types [2] https://webassembly.github.io/spec/core/text/modules.html#type-uses [3] https://twitter.com/KentBeck/status/250733358307500032

) The #process_module method is ready to provide type definitions to the deferred results produced by preprocessing the module’s fields so that the type use abbreviation [0] can be desugared, but we don’t currently have any way to communicate those type definitions to #process_module. Although the processed fields do include type definition fields, they’re all deferred until we’re ready to provide the type definitions — another chicken-and-egg situation. This commit adds a “list of type definitions” return value to #process_type_definition and refactors other preprocessor methods to allow that value to propagate up to #process_module. The next (and final!) step will be to populate it with real type definitions and use those to desugar the type use abbreviation. This concludes the preparatory work begun in #12, #13 and #14, and therefore still doesn’t introduce any externally visible changes to the preprocessor’s behaviour. [0] https://webassembly.github.io/spec/core/text/modules.html#abbreviations

…tions (#17) Despite #12, the preprocessor, parser and interpreter (and their tests) were still a little messy and inconsistent. This commit cleans them up to fix many minor annoyances in preparation for implementing instruction abbreviations. The problems addressed are mostly related to naming and code organisation. The largest change is that all of the preprocessor tests have been updated to no longer depend upon the remaining instruction abbreviations (i.e. folded instructions and missing `call_index` index) which we intend to desugar.

tomstuart added 30 commits July 2, 2023 22:45

Fix typos in preprocessor tests

b6f09f5

These don’t really affect the preprocessor because it doesn’t care about the details, but it’s less confusing if the modules in the tests are at least syntactically valid.

Extract #assert_preprocess_module_fields test helper

3cf39b6

Since almost all of the preprocessor tests are for fields within a single module body, we can reduce boilerplate (and indentation levels) by introducing an assertion which automatically wraps its arguments in `(module …)`.

Pretty print S-expressions when a preprocessor test fails

27c94c4

This makes it easier to see what’s gone wrong.

Make preprocessing for export and start fields explicit

41f065c

These are the only two remaining kinds of field and neither supports any abbreviations, so we can list them as an intentional no-op case and have all field kinds explicitly represented.

Decouple reading from result generation everywhere in preprocessor

02663dc

This separation exists in most places in the preprocessor, so for the sake of stylistic consistency we should do it everywhere.

Rename module field parsing methods to match language spec and prepro…

7ed0b57

…cessor There’s no reason for these to be different, and the longer names are clearer.

Extract #fresh_id helper in preprocessor

b2d3ea8

It’s still the wrong way of doing it, but at least now it’s only wrong once.

Extract ReadIndex helper module from AST parser

adf38a6

This provides a nice expressive interface for dealing with indexes, moves a complicated regexp out of the parser, and puts the implementation in a helper where the preprocessor will be able to share it in future.

Clean up action read in #parse_assert_return and #parse_assert_trap

9483690

The pattern match will raise an exception if it fails, so there’s no need to do it ourselves.

Use #parse_results when parsing select instruction

499b1e1

The grammar says zero or more results appear here [0] so we should be reusing the appropriate parsing machinery. [0] https://webassembly.github.io/spec/core/text/instructions.html#parametric-instructions

Use #parse_index when parsing table.init instruction

af92d91

These are indexes [0], so we should be reusing the index-parsing machinery. [0] https://webassembly.github.io/spec/core/text/instructions.html#table-instructions

Use Kernel#loop in #repeatedly helper

f3681f3

This’ll allow the loop to be terminated by the `StopIteration` exception instead of needing to support custom conditions through arguments.

Remove unused until: argument from #repeatedly helper

347c635

We always use `StopIteration` now.

Use #repeatedly helper instead of while/until in preprocessor and…

bff8044

… parser This avoids us having to do the concatenating ourselves. Again I’ve chosen guard-style `raise StopIteration if …` clauses to keep them adjacent to the #repeatedly call for clarity’s sake.

Disable YJIT when running tests

192014b

I’d like to use destructuring with optional block parameters in the implementation of #unzip_pairs, but there’s a bug in YJIT [0] which prevents this from working. The easiest solution for now is to turn YJIT off. [0] Shopify/yjit#313

Use optional block parameters in #unzip_pairs

dd5488c

When `pairs` is empty then `pairs.transpose` is `[]`, which gets destructured into a pair of nils for the two block parameters. We can replace the nils in this case by defaulting each parameter to the empty array.

Push #unzip_pairs into #parse_parameters and #parse_locals

505caa5

The only uses of #unzip_pairs are to unzip the results of these methods, so we might as well have them return an unzipped result.

Push #unzip_pairs into #parse_declarations

c98e965

The only uses of #unzip_pairs are to unzip the result of the #parse_declarations method, so we might as well have it return an unzipped result.

Inline #unzip_pairs into #parse_declarations

b951678

This is now the only caller of #unzip_pairs, so we can safely build it into #parse_declarations directly.

tomstuart added 4 commits July 16, 2023 09:41

Initialise all module field vectors together in AST parser

6df4e7a

Now that these are all arrays (except for `start`) we can consolidate the setup of their initial values.

Consolidate references to singleton memory in interpreter

17157a4

I’d like to support multiple memories here to match the AST parser, and that change will be easier if we assign `current_module.memory` to a local variable instead of repeatedly using it inline.

tomstuart merged commit a0534d5 into main Jul 16, 2023
2 checks passed

tomstuart deleted the tidy-up-preprocessor-and-parser branch July 16, 2023 15:12

tomstuart mentioned this pull request Jul 19, 2023

Allow type definitions to be provided after preprocessing a module #14

Merged

tomstuart mentioned this pull request Jul 20, 2023

Allow type definitions to be generated while preprocessing a module #15

Merged

tomstuart mentioned this pull request Aug 20, 2023

Tidy up preprocessor, parser and interpreter for instruction abbreviations #17

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tidy up preprocessor and parser #12

Tidy up preprocessor and parser #12

tomstuart commented Jul 16, 2023

Tidy up preprocessor and parser #12

Tidy up preprocessor and parser #12

Conversation

tomstuart commented Jul 16, 2023