Syntax Tree Transforms and Compiler Design #1739

yassere · 2021-11-01T16:51:12Z

yassere
Nov 1, 2021

These are some notes and open questions regarding syntax tree transforms and designing the infrastructure equivalent to the original Rome compiler. In building the new higher-performance version of Rome, we may choose to make different tradeoffs compared to the original version (referred to below as RomeJS). For context, here are some (simplified) characteristics of how syntax tree transformations work in RomeJS:

Transforms operate on an abstract syntax tree (AST) that contains no whitespace, punctuation, or formatting information.
When transforming an AST using multiple visitors (e.g. performing auto-fixes for all lint rules), the compiler recursively traverses the AST, calling the visitors sequentially for each node.
In order to reduce the significance of visitor ordering, modifying a node will cause all visitors to be run again on the modified AST.
As an optimization to prevent full re-traversal on every transformation, replacing a node doesn't immediately cause all of its ancestors to be updated. Instead, the modified node can continue being visited at that level of recursion. The new tree will be rebuilt correctly as the recursion unwinds.
AST nodes don't have parent pointers. Visitors of a node can access the up-to-date parent from the path as path.parent.
Visitors have access to ancestryPaths but the nodes in those paths might be stale and are not necessarily ancestors of the current node. In other words, path.node might not belong to the same tree as path.ancestryPaths[1].node. This is a drawback of the prior optimization.

For the new version of Rome, we've chosen to use a concrete syntax tree (CST) which losslessly represents source code and allows for fine-grained control of syntax transformations. For operations where a CST would be cumbersome, the syntax tree could be lowered into a more suitable intermediate representation (IR).

The CST is built using a forked version of rowan which has a convenient traversal API with methods like parent(), ancestors(), siblings(), and children() available directly on syntax nodes. The entire CST is reachable from any node. This property relies on having parent pointers on syntax nodes.

Questions

What sort of transforms should operate on a CST?
- A full-fidelity CST seems like a great choice for syntactic analysis, code migrations, and editor-related transformations where the output is intended to be human-readable.
- Compiler transforms and bundling may be easier on a lowered IR.
- If we have multiple transformation APIs, they should all feel familiar.
- Should an analytics API be distinct from a transform API?
When should the compiler perform additional traversals?
- The approach used by RomeJS trades some performance in exchange for reducing the significance of visitor ordering.
- For our built-in transformations, it may be worth manually sequencing them to minimize the number of traversals.
Should the compiler call multiple independent visitors during a single traversal?
- This can amortize costs of ahead-of-time computations (like scope evaluation in RomeJS) but might be less useful if most costs are lazily computed and cached (using salsa, for example).
- There's no major downside when visitors are only performing analysis.
- If multiple visitors perform transformations during the same traversal, it adds complications.
During traversal, should visitors receive a live view of the syntax tree?
- In RomeJS, when a visitor modifies a node, the new node (but not necessarily an up-to-date ancestryPaths) is passed to subsequent visitors during the same traversal.
- A mismatch between the current node and its visible ancestry could cause bugs.
- It's helpful for visitors to see the changes made by other visitors when they're running together in a single traversal, but a live view seems less important if transforms traverse independently. A lone visitor could visit each node of the original syntax tree with the understanding that its own transformations are not reflected in the syntax nodes that it receives.
How should visitors report diagnostics and perform transformations?
- In RomeJS, visitors can imperatively attach diagnostics and suggested auto-fixes to the compiler context. They can also return a signal with a single "safe" auto-fix.
- Should all diagnostics and auto-fixes be part of a visitor's return value?
Should lint rules have a way to avoid computing auto-fixes when they aren't necessary?
- They could receive a flag when they only need to produce diagnostics.
- Alternately, an auto-fix could comprise a target node and a callback that computes its replacement.
Should it be possible for a single transform to replace multiple independent nodes?
- In RomeJS, a transform that wants to replace multiple nodes would need to find the common ancestor of all those nodes to replace that single ancestor.
- It could be useful to signal a list of multiple independent replacements as part of a single transformation action that only rebuilds the syntax tree a single time.
- All of the nodes targeted for replacement must have non-overlapping subtrees.
- Would it help with source maps if transforms could improve the granularity of their replacements?
Do CST transforms need to retain source mapping information?
- This partially depends on which transformations operate on a CST.
Should there be a way for a visitor to signal that it doesn't care about certain subtrees?
- If visitors traverse independently, this is very simple.
- When multiple visitors run together, we would have to maintain independent state for each visitor so that a "skip" signal doesn't affect other visitors in the shared traversal.
Should there be a lint rule API that isn't based on visiting nodes?
- Rules would receive a context that contains the whole syntax tree and cursor information for cursor-dependent assists/completions.
- Is this different from a node that matches on the root node and has a way to signal that it shouldn't be called again for any children?
If, when, and how to validate transformed syntax trees?
- What gets validated? Syntax (statement is not an expression — seems easy), semantics (referencing a binding before its instantiated — seems hard), etc.

This is not a comprehensive document. There are certainly more questions to discuss, and it's likely we'll want to use code examples to explore these questions.

MichaReiser · 2021-11-01T17:11:40Z

MichaReiser
Nov 1, 2021

Thanks, @yassere for writing all these questions down.

I've mainly taken a look at Roslyn and Swift and want to give examples of their APIs. This might become useful when discussing the different approaches.

Roslyn / Swift APIs

Query based

private static void AnalyzeTree(SyntaxTreeAnalysisContext context)
{
    var tree = context.Tree;  
    var root = tree.GetRoot(context.CancellationToken);

    foreach (var ifStmt in root.DescendantNodes().OfType<IfStatementSyntax>())
    {
        if (ifStmt.Statement is EmptyStatementSyntax)
        {
            context.ReportDiagnostic(Diagnostic.Create(Rule, ifStmt.GetLocation()));
        }
    }
}

Can run in parallel, because isn't mutating the tree (The implementation can call Replace methods but the result of that would just be thrown away).

Visitor

Read-only tree visitor

public class Test: CSharpSyntaxVisitor<Signal>
{
    public override Signal VisitIfStatement(IfStatementSyntax node)
    {
        base.VisitIfStatement(node);

        if (node.Statement is EmptyStatementSyntax)
        {
            return Signal.REPORT;
        }

        return Signal.CONTINUE;
    }
}

Can run in parallel, because the visitor isn't mutating the underlying tree

Open Questions

Visit function per AST node?
- visit_node(node: SyntaxNode)
  - Can be hand written
  - Doesn't require double-dispatch (node.accept(visitor))
- visit_if_stmt(stmt: IfStmt), visit_block_stmt(block: BlockStmt), ...
  - Correctly resolves lists to SeparatedSyntaxList or SyntaxList depending on the type of the field.

Rewriter

Creates a new tree

private class Rewriter : CSharpSyntaxRewriter
{
    private readonly SemanticDocument _document;
    private readonly Func<SyntaxNode, bool> _predicate;
    private readonly CancellationToken _cancellationToken;

    public Rewriter(
        SemanticDocument document,
        Func<SyntaxNode, bool> predicate,
        CancellationToken cancellationToken)
    {
        _document = document;
        _predicate = predicate;
        _cancellationToken = cancellationToken;
    }

    private ExpressionSyntax SimplifyInvocation(InvocationExpressionSyntax invocation)
    {
        var expression = invocation.Expression;
        if (expression is MemberAccessExpressionSyntax memberAccess)
        {
            var symbolMap = SemanticMap.From(_document.SemanticModel, memberAccess.Expression, _cancellationToken);
            var anySideEffects = symbolMap.AllReferencedSymbols.Any(s =>
                s.Kind is SymbolKind.Method or SymbolKind.Property);

            if (anySideEffects)
            {
                var annotation = WarningAnnotation.Create("Warning: Expression may have side effects. Code meaning may change.");
                expression = expression.ReplaceNode(memberAccess.Expression, memberAccess.Expression.WithAdditionalAnnotations(annotation));
            }
        }

        return expression.Parenthesize()
            .WithAdditionalAnnotations(Formatter.Annotation);
    }

    public override SyntaxNode VisitSimpleLambdaExpression(SimpleLambdaExpressionSyntax node)
    {
        if (CanSimplify(_document, node, _cancellationToken))
        {
            var invocation = TryGetInvocationExpression(node.Body);
            if (invocation != null)
            {
                return SimplifyInvocation(invocation);
            }
        }

        return base.VisitSimpleLambdaExpression(node);
    }
}

May run in parallel but raises questions around what changes are visible

Personal Preference

I personally very much like Roslyn's "query based" API because I'm not very used to writing and thinking in visitors and it often requires storing state on the visitor, not allowing you to keep all state local to your function.

I also very much like that the Rewritter isn't changing the tree in place but instead produces a new one. It avoids many of the visibility issues (part of because it's visiting depth-first).

5 replies

MichaReiser Nov 2, 2021

To extend on this a bit. I suggest that we outline the benefits and difficulties of each of these approaches (and maybe propose new ones?) I mainly want to emphasise that we should focus on building an API that allows for as few food guns as possible. For example, storing nodes or paths as state inside of a babel visitor is an anti-pattern but nothing prevents authors from doing so (and it probably only goes wrong with a specific plugin combination and execution order).

My suggestion would be to support the query-based API as well as a visitor based API and let pass authors decide what works best for them. For example, implementing a scope analysis seems easier to me using the visitor API because I want to traverse many different nodes. But an analysis like this that retrieves all declarations in the current activation frame seems easier to build with the query-based API. I also believe that authors are more familiar with query-based APIs and having the state scoped to a function also sets clear boundaries for how long an analysis can hold on to state (ignoring globals for now). You can also see in this pass that the consumer tells Roslyn in which nodes it is interested and Roslyn only executes the analyzer if the semantic analysis completes for the listed nodes.

Visitors do have a slight edge performance-wise, assuming they're used correctly, because they can be run in groups, especially for visitors not mutating the tree. However, I don't know how much this will show in numbers. Something we should consider tough is that authors might be inclined to call descendants inside an untyped visitor (one passing Syntax* types instead of having a dedicated method per AST type) method, which voids many of the benefits of grouping visitors together because each visitor then traverses the tree individually (but we still pay for the overhead of running multiple visitors at once). My main question here is if we already need to consider performance in full detail since my understanding that running multiple mutating transforms at once is mainly a problem of non-lint transforms.

I'm also somewhat concerned that using untyped-visitors results in pass author repetitively having to implement a switch on the node type, adding even more boilerplate to an already "bulky" pattern.

pub fn visit_node(&mut self, node: &SyntaxNode) -> SyntaxResult<Signal> {
  let node = node.clone();
  if let Some(parameter_list) = ParameterList:cast(node) {
    self.declare_parameters(parameter_list);
  }
  if let Some(block) = BlockStmt::cast(node) {
    self.begin_scope();
  }

  ... 
}

This is less efficient than using dynamic dispatch or a jump table because the visitor must repeat all it's cast-tests for every node in the tree, especially since calling cast requires an owned node (something we could change in the cast implementation to only-clone when
the cast is successful)

ematipico Nov 2, 2021

I'm also somewhat concerned that using untyped-visitors results in pass author repetitively having to implement a switch on the node type, adding even more boilerplate to an already "bulky" pattern.

I would stay away from non-typed nodes too, if possible. If we would have to implement the visitor pattern, it should be implemented using typed nodes. Consumers should not bother casting some code, it's something that we (Rome) should be responsible for.

yassere Nov 2, 2021
Author

I like the Roslyn query-based API too, and I think it's a good fit for many use cases. I think we could also have some sort of visitor API, but I don't know yet about strongly-typed visit methods.

If we do have a typed visitor API, that brings us back to the question of node granularity. If we have lots of fine-grained nodes, then the overridable visit_specific_node methods might be inconvenient in some cases.

Also, would we want to have both enter and exit methods rather than a single visit method per node?

I'm also somewhat concerned that using untyped-visitors results in pass author repetitively having to implement a switch on the node type, adding even more boilerplate to an already "bulky" pattern.

I think the match_ast macro (or something like it) can make this feel a lot more ergonomic. I'm not sure it would necessarily feel more bulky than dividing up a single analyzer into several overridden visit methods, especially when there's shared logic that's applicable to multiple node types.

This is less efficient than using dynamic dispatch or a jump table because the visitor must repeat all it's cast-tests for every node in the tree

I think we'd want to get a better sense of the actual performance impact of this if we think it's a determining factor.

MichaReiser Nov 3, 2021

If we do have a typed visitor API, that brings us back to the question of node granularity. If we have lots of fine-grained nodes, then the overridable visit_specific_node methods might be inconvenient in some cases.

I agree. I'm currently leaning towards keeping the granularity as is and changing it if we have specific use cases where more fine-grained nodes are beneficial.

I think the match_ast macro (or something like it) can make this feel a lot more ergonomic. I'm not sure it would necessarily feel more bulky than dividing up a single analyzer into several overridden visit methods, especially when there's shared logic that's applicable to multiple node types.

I think we'd want to get a better sense of the actual performance impact of this if we think it's a determining factor.

That could solve some of it indeed. And yes, some evaluation could make sense. Maybe my concern is that matching on node type which is a linear search of trying to cast the children (and casting a union type is probably a binary-search because SyntaxKind isn't dense) is unfunded.

Trying to implement the "validation" approach of Roslyn is probably easier when having "typed-visitors". It's possible to do it with untyped-visitors but it means that the rewrite then must match on the kind to call the right Update method.

Not saying that I don't like the untyped visitor approach. I actually like its simplicity a lot. No need for a fancy codegen.

Overall: It seems most are liking the query-based API quite a bit and there are many unanswered questions around visitors. Would it be an option to go with the query-based API for now and see how far we get (I do believe that we need visitors, but maybe not right away)? I'm saying this because there's no need for any infrastructure (rowan/rslint provides all the fundamentals). It would allow us to build some first analyzers, build out the diagnostics before we then tackle mutation, which likely boils down to a mutating visitor.

yassere Nov 3, 2021
Author

Would it be an option to go with the query-based API for now and see how far we get

Yeah, this is my suggestion too. I can work on prototyping this and then we can see what sort of limitations we encounter that we'd want to solve with an alternate API. I think the query-based API is really going to shine for LSP-related use cases.

MichaReiser · 2021-11-01T17:16:04Z

MichaReiser
Nov 1, 2021

Validation

Roslyn uses a generated CSharpSyntaxRewriter that has a typed method for each syntax kind. That method first calls the corresponding Visit method for each child and then passes the result to the nodes Update function but first validates what has been returned by the visitor by:

Casting to the expected node type
Making sure mandatory nodes aren't null

public override SyntaxNode? VisitAliasQualifiedName(AliasQualifiedNameSyntax node)
            => node.Update((IdentifierNameSyntax?)Visit(node.Alias) ?? throw new ArgumentNullException("alias"), VisitToken(node.ColonColonToken), (SimpleNameSyntax?)Visit(node.Name) ?? throw new ArgumentNullException("name"));

5 replies

ematipico Nov 1, 2021

I like this approach, and we should move this way:

it can be auto-generated each time we create new node types
it would hide a "cast" call for potential consumers. Ideally consumers (such as visitors) should not cast anything. It would be great if we can pass the "casted" node straight to the consumers.

yassere Nov 2, 2021
Author

I'm not sure yet if we want to retain the ability for multiple transforms to operate during the same traversal, but would this approach be viable in that case?

MichaReiser Nov 3, 2021

I think it should be possible, depending on the functionality we want to offer. Rome would need to update the Node before calling the next visitor. Updating the node is a no-op if the node didn't change. Overall, calling Update doesn't do more than create a new SyntaxNode under the hood or return the existing one if nothing changed. So it's about the same as updating the syntax nodes directly.

It may be difficult (or costly) to correctly reflect the parent tough. But that's something we're already tracking.

yassere Nov 3, 2021
Author

I'm probably not understanding exactly how this class works, so I'll need to take a closer look. I was thinking that a single visitor would override the methods it cares about and that nodes are traversed through recursive visit calls. I'm not seeing where multiple visitors can fit into this model if they're intended to traverse simultaneously (not that I'm necessarily in favor of that).

I certainly want to have some sort of syntax tree validation for transforms, but I'm not sure yet where I'd want to put it.

MichaReiser Nov 3, 2021

That's correct. I don't think the specific visitor of roslyn supports running multiple visitors but I do belive that we should be able to work that around if we see need fo rit.

yassere · 2021-11-02T18:44:23Z

yassere
Nov 2, 2021
Author

If we're considering not using the mutable API of rowan, we may want to have something like SyntaxAnnotation and TrackNodes from Roslyn so we can have a way to keep track of specific nodes across transformations.

2 replies

MichaReiser Nov 3, 2021

I had a quick glance at the documentation but it isn't immediately clear what each of those is used for

SyntaxAnnotation: Add messages/warnings to nodes?
TrackNodes: ???

Would you mind giving a short example (maybe there are existing ones available online).

yassere Nov 3, 2021
Author

Here's an example of TrackNodes with some useful comments: AbstractUseCollectionInitializerCodeFixProvider
I'm not sure what the full capabilities of SyntaxAnnotation are, but I think the main point is that annotations are attached directly to nodes, so you can annotate a node and find it later in a transformed version of the tree. TrackNodes uses SyntaxAnnotation under the hood.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Syntax Tree Transforms and Compiler Design #1739

{{title}}

Replies: 3 comments 12 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Syntax Tree Transforms and Compiler Design #1739

Questions

Replies: 3 comments · 12 replies

Roslyn / Swift APIs

Query based

Visitor

Rewriter

Personal Preference

yassere Nov 2, 2021 Author

yassere Nov 3, 2021 Author

Validation

yassere Nov 2, 2021 Author

yassere Nov 3, 2021 Author

yassere Nov 2, 2021 Author

yassere Nov 3, 2021 Author

Replies: 3 comments 12 replies

yassere Nov 2, 2021
Author

yassere Nov 3, 2021
Author

yassere Nov 2, 2021
Author

yassere Nov 3, 2021
Author

yassere
Nov 2, 2021
Author

yassere Nov 3, 2021
Author