Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tidy up preprocessor and parser #12

Merged
merged 34 commits into from Jul 16, 2023
Merged

Conversation

tomstuart
Copy link
Owner

The preprocessor and parser (and their tests) were becoming a little messy and inconsistent. This PR cleans them up to fix many minor annoyances in preparation for implementing the type use abbreviation.

The problems addressed are mostly related to naming and code organisation. The most significant change is that the instructions inside a global definition’s initialiser expression and the expanded result of an inline import are now being preprocessed correctly; I’d previously overlooked the need to do this because neither case is exercised by the WebAssembly test suite for any of the abbreviations implemented so far.

These don’t really affect the preprocessor because it doesn’t care about
the details, but it’s less confusing if the modules in the tests are at
least syntactically valid.
Since almost all of the preprocessor tests are for fields within a
single module body, we can reduce boilerplate (and indentation levels)
by introducing an assertion which automatically wraps its arguments in
`(module …)`.
Most of the preprocessor tests use syntax which relies upon the ability
to omit an explicit type use [0] and have it automatically inserted when
abbreviations are expanded. To prevent these tests from breaking when we
implement desugaring of type use abbreviations we need to update them to
contain full type uses which are unaffected by the expansion of that
abbreviation.

When adding an unabbreviated type use, I’ve removed the corresponding
inline declarations unless they’re the focus of the test.

[0] https://webassembly.github.io/spec/core/text/modules.html#abbreviations
This makes it easier to see what’s gone wrong.
As with `offset` and element expressions, the test is demonstrating
desugaring of instructions in an invalid program because that’s the
correct behaviour for the parsing phase regardless of validation. In
practice only constant instructions [0] may appear inside a global
definition [1] but we have no way to demonstrate desugaring of constant
instructions yet.

[0] https://webassembly.github.io/spec/core/valid/instructions.html#valid-constant
[1] https://webassembly.github.io/spec/core/valid/modules.html#globals
We need to desugar an inline import after expanding it because its
description can contain other abbreviations (e.g. multiple anonymous
parameters and results in the case of function imports).

I overlooked this earlier because I didn’t write a test to cover it and
presumably the official WebAssembly test suite doesn’t cover it either,
so no tests failed when I removed inline import support from the parser.
These are the only two remaining kinds of field and neither supports any
abbreviations, so we can list them as an intentional no-op case and have
all field kinds explicitly represented.
This separation exists in most places in the preprocessor, so for the
sake of stylistic consistency we should do it everywhere.
…cessor

There’s no reason for these to be different, and the longer names are
clearer.
It’s still the wrong way of doing it, but at least now it’s only wrong
once.
This logic is repeated in a lot of places, so let’s extract and
reuse it.

IDs are essentially always optional so I’ve chosen to write a single
method which unconditionally tries to read an ID (and may return nil)
rather than a pair of #can_read_id?/#read_id methods. That does mean we
have to check for `id.nil?` in a couple of places rather than asking a
dedicated predicate, but it’s rare enough that I think this trade-off is
the right one.
…uction

This is how we do it everywhere else, and there’s no reason for it to be
different here. The pattern match also has the advantage of tolerating
non-string values (e.g. arrays), which makes the #can_read_list? call
unnecessary.

Removing the numbered parameter changes the block’s arity, so we need to
add an explicit dummy parameter to keep #repeatedly happy. In hindsight
I don’t think the `until:` API is a good idea so I’m going to change
that soon.
This provides a nice expressive interface for dealing with indexes,
moves a complicated regexp out of the parser, and puts the
implementation in a helper where the preprocessor will be able to share
it in future.
In 8ea2aa0 I removed unnecessary `?:`
from regular expressions which already didn’t capture unnamed groups
(because they also contained named ones), but I neglected to check
whether other regular expressions were unintentionally capturing unnamed
groups, so I’m fixing that now.

Although this adds more syntactic noise to the regular expressions, I
think our intent is clearer if we only capture groups which we intend to
use, so I’m prepared to pay the price.
The pattern match will raise an exception if it fails, so there’s no
need to do it ourselves.
The grammar says a type use appears here [0] so we should be reusing all
of the appropriate parsing machinery.

This creates the opportunity to check that no named parameters appear in
the type use, which we’ll do in the next commit.

[0] https://webassembly.github.io/spec/core/text/instructions.html#text-call-indirect
We already do this in #parse_blocktype because of its side condition
which requires a type use to contain only unnamed entries [0]. The same
side condition applies to `call_indirect`’s type use [1] so we should
perform the same check.

[0] https://webassembly.github.io/spec/core/text/instructions.html#text-instr-block
[1] https://webassembly.github.io/spec/core/text/instructions.html#text-call-indirect
The grammar says zero or more results appear here [0] so we should be
reusing the appropriate parsing machinery.

[0] https://webassembly.github.io/spec/core/text/instructions.html#parametric-instructions
This’ll allow the loop to be terminated by the `StopIteration` exception
instead of needing to support custom conditions through arguments.
It’s admittedly a bit “exceptions for control flow”, but this way of
using `StopIteration` is idiomatic in Ruby so I think giving the caller
arbitrary control over loop termination is easier to understand than
relying upon a mystery `until:` argument.

I’ve resisted the temptation to push `raise StopIteration` further down
the call stack (into e.g. #read_list, or even #read) to avoid spooky
action at a distance; I think the logic is easiest to understand when
`raise StopIteration` appears syntactically within the #repeatedly
block.

I’ve also chosen a guard-style raise (`raise … if …`) rather than
putting the whole body of the block inside a conditional (`if … else
raise … end`) because the former puts the `raise` lexically adjacent to
the #repeatedly call instead of pushing it further away (e.g. inside the
`else` clause) which should improve clarity. This choice also helps to
keep the indentation levels under control.
We always use `StopIteration` now.
… parser

This avoids us having to do the concatenating ourselves.

Again I’ve chosen guard-style `raise StopIteration if …` clauses to keep
them adjacent to the #repeatedly call for clarity’s sake.
I’d like to use destructuring with optional block parameters in the
implementation of #unzip_pairs, but there’s a bug in YJIT [0] which
prevents this from working. The easiest solution for now is to turn YJIT
off.

[0] Shopify/yjit#313
When `pairs` is empty then `pairs.transpose` is `[]`, which gets
destructured into a pair of nils for the two block parameters. We can
replace the nils in this case by defaulting each parameter to the empty
array.
The only uses of #unzip_pairs are to unzip the results of these methods,
so we might as well have them return an unzipped result.
This was the only caller of #parse_declarations which didn’t unzip its
result with #unzip_pairs. By doing it consistently across all callers we
create the opportunity to have #parse_declarations return unzipped
results directly, which we’ll do in the next commit.
The only uses of #unzip_pairs are to unzip the result of
the #parse_declarations method, so we might as well have it return an
unzipped result.
This is now the only caller of #unzip_pairs, so we can safely build it
into #parse_declarations directly.
Validation constrains modules to at most one memory [0] but the parser
doesn’t need to worry about that. It creates more regularity in the
parser to support arbitrarily many memories (just like functions,
tables, globals etc) and let validation deal with any problems.

[0] https://webassembly.github.io/spec/core/valid/modules.html#valid-module
Now that these are all arrays (except for `start`) we can consolidate
the setup of their initial values.
I’d like to support multiple memories here to match the AST parser, and
that change will be easier if we assign `current_module.memory` to a
local variable instead of repeatedly using it inline.
Validation won’t actually allow multiple memories [0] but it makes the
implementation more regular, and more like the formal WebAssembly
language specification, if we assume they’re possible. All of the memory
instructions implicitly operate on memory index zero [1] so we’ve now
made that explicit in their implementation.

The #initialise_memories method no longer raises an error if the memory
index is non-zero; this check should’ve been happening during validation
anyway (which we’re not doing yet) so losing it doesn’t matter.

[0] https://webassembly.github.io/spec/core/valid/modules.html#valid-module
[1] https://webassembly.github.io/spec/core/syntax/instructions.html#memory-instructions
Now that modules have multiple memories, we can use the same approach
for all relevant fields during allocation [0].

This makes it clear that exports are different and shouldn’t be “built”
in the same way; we could’ve already seen this from the definition of
module allocation [1]. I’m not going to get distracted by making any
more changes to the interpreter at the moment, but we should come back
to this in future.

[0] https://webassembly.github.io/spec/core/exec/modules.html#allocation
[1] https://webassembly.github.io/spec/core/exec/modules.html#alloc-module
@tomstuart tomstuart merged commit a0534d5 into main Jul 16, 2023
2 checks passed
@tomstuart tomstuart deleted the tidy-up-preprocessor-and-parser branch July 16, 2023 15:12
tomstuart added a commit that referenced this pull request Jul 19, 2023
To desugar the type use abbreviation [0] we’ll need access to a list of
all the module’s type definitions [1] when preprocessing a type use [2],
but the full list of type definitions will only be available once the
complete module has been processed. This creates a chicken-and-egg
situation: we can’t desugar a type use until we’ve processed all of the
enclosing module’s fields, but many of those fields will contain type
uses which need to be desugared during processing.

This commit breaks the deadlock by refactoring the preprocessor to allow
some field-processing methods to return a “deferred result”, namely a
proc which must be called with a list of type definitions in order to
produce the final S-expression. Deferred results give us the ability to
process a module’s fields first and then provide type definitions later;
the next step will be to generate the type definitions during processing
so we can make use of that ability.

This continues the preparatory work from #12 and #13. We’re still in the
“make the change easy” [3] phase, so there’s no externally visible
difference in the preprocessor’s behaviour yet.

[0] https://webassembly.github.io/spec/core/text/modules.html#abbreviations
[1] https://webassembly.github.io/spec/core/text/modules.html#types
[2] https://webassembly.github.io/spec/core/text/modules.html#type-uses
[3] https://twitter.com/KentBeck/status/250733358307500032
tomstuart added a commit that referenced this pull request Jul 20, 2023
)

The #process_module method is ready to provide type definitions to the
deferred results produced by preprocessing the module’s fields so that
the type use abbreviation [0] can be desugared, but we don’t currently
have any way to communicate those type definitions to #process_module.
Although the processed fields do include type definition fields, they’re
all deferred until we’re ready to provide the type definitions — another
chicken-and-egg situation.

This commit adds a “list of type definitions” return value
to #process_type_definition and refactors other preprocessor methods to
allow that value to propagate up to #process_module. The next (and
final!) step will be to populate it with real type definitions and use
those to desugar the type use abbreviation.

This concludes the preparatory work begun in #12, #13 and #14, and
therefore still doesn’t introduce any externally visible changes to the
preprocessor’s behaviour.

[0] https://webassembly.github.io/spec/core/text/modules.html#abbreviations
tomstuart added a commit that referenced this pull request Aug 20, 2023
…tions (#17)

Despite #12, the preprocessor, parser and interpreter (and their tests)
were still a little messy and inconsistent. This commit cleans them up
to fix many minor annoyances in preparation for implementing instruction
abbreviations.

The problems addressed are mostly related to naming and code
organisation. The largest change is that all of the preprocessor tests
have been updated to no longer depend upon the remaining instruction
abbreviations (i.e. folded instructions and missing `call_index` index)
which we intend to desugar.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant