Skip to content
Sam Harwell edited this page Mar 13, 2014 · 7 revisions

Rule Versioning in ANTLR 4

1. Introduction

Rule versioning refers to a feature in ANTLR 4 where Java code may declare explicit dependencies on grammar rules. Each declared dependency is annotated with a version number; when a version number does not match the actual version of a rule in the grammar then the Java code depending on the rule may not work. The feature is currently implemented in pull request #490: Rule versioning artifact and in the unofficial sharwell/optimized fork of ANTLR 4.

Note by Sam Harwell: Rule versioning has proven to be an essential productivity aid during the development of ANTLRWorks 2 and GoWorks. I highly recommend all projects considering the use of ANTLR 4 take full advantage of this feature from the start of the project to reduce the risk of regression bugs and improve long-term maintainability of a code base.

2. Motivation

Large applications using complicated grammars, such as compilers or IDEs, frequently face productivity bottlenecks when changes to the core grammars are required.

  • Changes to the language specification must be implemented
  • An opportunity to alter the grammar for performance or clarity purposes is identified
  • Evaluation of proposed changes to the language within a fully-functional environment

When a rule in the grammar is altered, all code affected by the change must be updated to reflect the new form. In some applications, the Find Usages feature may be used to identify code either calling the rule method or using the return value of the method (rule foo could returns an instance of foo_return in ANTLR 3 or FooContext in ANTLR 4). However, this is not sufficient for identifying all dependent code. Consider the following code block from the code completion feature in nFringe (using the CSharp3 target for ANTLR 3).

if (tree.ChildCount == 2) {
    CommonTree decltree = tree.GetChild(1) as CommonTree;
    if (decltree != null) {
        CommonToken token = decltree.Token as CommonToken;
        if (token != null && !string.IsNullOrEmpty(token.Text))
            return declarations.Where(decl => decl.InsertText.Equals(token.Text, StringComparison.OrdinalIgnoreCase));
    }
}

In this code, tree is a CommonTree created for a token of type DEFAULT. The code handles the tree produced by the following rule, which handles UnrealScript expressions like ClassName.default.Property.

postfix_expression
    :   ( root_expression
            -> root_expression
        )
        ( DOT DEFAULT DOT primary_expression
            -> ^(DEFAULT $postfix_expression primary_expression)
        )*
    ;

The nFringe code does not use a variable of type postfix_expression_return, yet makes several assumptions about the tree produced by this rule.

  • The tree contains exactly two children.
  • The Token property of the second child contains the name of the referenced default property, i.e. "Property" for the above UnrealScript expression.

If the grammar were modified to include the DOT tokens in the AST (perhaps for assistance in some form of error reporting), the code completion feature would stop working for these expressions until the code was updated. Over many years of work on nFringe, these implicit dependencies have resulted in many problems.

  • Uses of properties (e.g. ChildCount) and methods (e.g. GetChild) from the ANTLR 3 trees API:
    • Appear frequently,
    • Are difficult to trace back to particular locations in the grammar,
    • Successfully compile whether or not the implied dependencies are valid.
  • Broken features rarely result in build failures, and often appear only in infrequent "edge-cases".
  • Tracing a broken feature back to the original source is difficult.
  • Fixing code affecting a broken feature often has other implicit dependencies, and any alteration often results in a different edge-case breaking.

3. Approach

The approach to rule versioning chosen for this implementation involves the following changes to ANTLR. The features are disabled by default, and included by specifying the -rule-versioning command line argument to the ANTLR Tool when generating parsers from a grammar.

  • Each generated rule method is annotated @RuleVersion(number). The version number defaults to 0 when not explicitly stated in the grammar.
  • The @RuleDependency annotation can be added to types, methods, and/or fields in Java code to declare dependencies. The @RuleDependencies annotation provides a way to include multiple dependency annotations on a single code element.
  • The RuleDependencyProcessor annotation processor validates declared rule dependencies at compile time. The processor registered in the classpath by adding a file to the META-INF/services folder in antlr4-dependency-analysis.jar, allowing Javac to automatically enable the validation process during compilation.

In its simplest form, declared dependencies specify a rule and a version number.

Note: Due to an implementation limitation in RuleDependencyProcessor, dependencies must currently specify a recognizer as well. If the related StackOverflow question produces a usable answer, this requirement may be removed in a future release. For the purposes of this article, the recognizer is always omitted from the RuleDependency annotations for brevity, since it is always just the declaring type of the rule constant.

fieldDef    : typeRef id;
primaryExpr : id | INT;
typeRef     : id;
id          : IDENTIFIER;

One Java method depending on the id rule might look like the following.

@RuleDependencies({
    @RuleDependency(rule=RULE_id, version=0),
    @RuleDependency(rule=RULE_fieldDef, version=0),
})
boolean isDefinition(IdContext id) {
    return id.getParent() instanceof FieldDefContext;
}

The isDefinition method makes the following assumptions about the parse tree produced by the grammar rules.

  • The id rule in the grammar is part of a definition if the parent context is a certain type of context. If a new rule were added to the grammar in the future, e.g. a methodDef rule, the change to the parents of the id rule could result in isDefinition returning an incorrect result.
  • The fieldDef rule in the grammar invokes the id rule. If the fieldDef rule is modified in the future to not directly call the id rule, the isDefinition rule may produce an incorrect result.

While reviewing the XPath Axes, it became apparent that many complicated rule dependencies could easily be expressed by providing support for specifying dependency type(s) in the @RuleDependency annotation. The most frequently encountered dependency types can be expressed as follows.

  • Self: the code depends on the specified rule.
  • Parents: the code depends on the parents of the specified rule. This can also be seen as code depending on the rule in one level of context.
  • Descendants: the code depends on all potential descendants of the specified rule.
  • Ancestors: the code depends on all potential ancestors of the specified rule. This can also be seen as code depending on the rule with its complete context.

The default dependency type is Parents, matching the most frequent case encountered in code using the initial implementation. Note that the Self dependency type is always implied, whether or not the particular @RuleDependency annotation specifies a set of dependency types. The Dependents.SELF enumeration value is provided so @RuleDependency annotations which clearly depend only on the rule itself can be documented as such. Consider the following grammar rule.

id : Identifier ('.' Identifier)*;

The following code shows an example of proper use of Dependents.SELF.

@RuleDependency(rule=RULE_id, dependents=Dependents.SELF)
boolean isQualified(IdContext ctx) {
    return ctx.Identifier().size() > 1;
}

4. Effective Use

4.1. Linked Dependency Validation

All dependents other than Self are considered linked dependencies. Proper validation of linked dependencies requires rule version numbers to satisfy particular conditions. The following rules for managing version numbers in a grammar ensure that linked dependencies will properly validate at compile time.

  • When a rule is added to the grammar, it is given a version number 1 greater than the maximum version number of any rule previously in the grammar.

  • When a non-trivial change is made to an existing rule, it is assigned a new version number 1 greater than the maximum version number of any rule previously in the grammar. If the changed rule no longer invokes a rule which was previously invoked by the rule, the update should be performed in two steps to ensure all linked dependencies are validated. For example,

    qualifiedName : id ('.' id)*;

    First becomes the following. Note that the guard clause only includes id once, even if id can appear multiple times in the previous version of the rule.

    qualifiedName
    @version{1}
        : Identifier ('.' Identifier)*
        | // guard clause for dependencies linked to qualifiedName through id.
            {false}? id
        ;

    After examining the affected rule dependencies, the guard clause can be removed. If the two steps are performed in succession, the rule version number does not need to be updated a second time. In addition, removing the guard clause may invalidate some rule dependency annotations in code, but if these two steps are performed in succession then the affected dependency annotations can be updated to match the versions reported by the annotation processor without rechecking the code again. After this step, the qualifiedName rule would be reduced to the following.

    qualifiedName
    @version{1}
        : Identifier ('.' Identifier)*
        ;
  • Before removing a rule, the name of the rule should be changed and the renamed rule should be given a version number 1 greater than the maximum version number of any rule previously in the grammar (e.g. foo could be renamed to foo_removed before deleting it). This two-step process provides validation for certain linked dependencies which may not be detected when a rule is simply deleted.

4.2. Choosing Dependents

  • Methods taking a context as input (e.g. as a method argument) should include PARENTS at a minimum unless there is absolutely no way the behavior can be affected by later reusing the rule in another context. When no dependents are explicitly specified for a @RuleDependency annotation, PARENTS is included by default.

  • Code which enumerates the ancestors of a rule context (or calls a findAncestor method, etc.) should include ANCESTORS in the dependents for that rule dependency.

  • Code which enumerates the descendents of a rule context (or calls a findDescendant method, etc.) should include DESCENDANTS in the dependents for that rule dependency.

  • Code which accesses tokens indirectly matched by a rule (e.g. by calling the getStart() on the context for a grammar rule which contains at least one outer alternative with a non-terminal on the left edge) should include DESCENDANTS.

  • Any rule context accessed by the method directly (e.g. via a generated accessor method) or indirectly (e.g. via the getChild(n) method) should declare SELF at a minimum, but the default of PARENTS is generally a safe bet.

4.3. Specify All Dependencies

The rule versioning feature assumes that all code which directly depends on the grammar properly declares explicit rule dependencies. Any implicit dependencies which are not declared may result in program failures if the grammar is modified.

4.4. Increment the Version Numbers Globally

Rule version numbers should always monotonically increase across the entire grammar. For example, suppose a change was made to the bar rule of the following grammar.

foo
@version{2}
    : 'x';
bar
@version{1}
    : 'y';

After the change, the bar rule should be marked at version 3, which is greater than the highest version number previously appearing in the grammar. If the bar rule was instead only incremented to version 2 (greater than the previous version of the bar rule, but not greater than the highest version appearing in the grammar), linked dependencies in code may not be properly invalidated.

5. Initial Approach

The initial implementation of rule versioning provided a minimal set of features. While the implementation was simpler and the API smaller, it soon became clear that this approach did not meet the needs of large applications. In this implementation, each rule was given a version number and a dependency consisted of a link from a code element to a particular version of a single rule.

This approach required particular attention when changes were made to the grammar. For example, consider the following grammar segment (repeated from above).

fieldDef    : typeRef id;
primaryExpr : id | INT;
typeRef     : id;
id          : IDENTIFIER;

One Java method depending on the id rule might look like the following.

@RuleDependencies({
    @RuleDependency(rule=RULE_id, version=0),
    @RuleDependency(rule=RULE_fieldDef, version=0),
})
boolean isDefinition(IdContext id) {
    return id.getParent() instanceof FieldDefContext;
}

Now suppose the grammar is modified by the addition of a methodDef rule.

methodDef   : typeRef id '(' ')';
fieldDef    : typeRef id;
primaryExpr : id | INT;
typeRef     : id;
id          : IDENTIFIER;

In this case, no change was made to the fieldDef or id rules in the grammar, yet the isDefinition method will no longer return the correct result for identifiers within a methodDef. The following rule-of-thumb became apparent.

In general, code which depends on a grammar rule tends to make assumptions about the immediate context of the rule.

The following practices provided a manual solution to this problem.

  • When a rule is added to the grammar,
    • Leave the new rule at the default version 0
    • Increment the version numbers of each rule invoked by the new rule, e.g. adding methodDef above requires incrementing the version number of the typeRef and id rules.
  • When a rule is removed from the grammar, increment the version numbers of each rule invoked by the rule being removed, e.g. removing typeRef above would require incrementing the version number of the id rule.
  • When a rule is changed,
    • Increment the version number of the changed rule
    • Increment the version numbers of each rule which is invoked by the updated rule but was not previously invoked by that rule
    • Increment the version numbers of each rule which was previously invoked by the rule but is no longer invoked by the updated rule
    • Optionally increment the version numbers of rules previously invoked by the rule and still invoked by the updated rule (doing so appeared safer yet led to more false-positives during the compilation process)

When the dependency validator reported compilation errors due to a change in a rule version, the code marked with the @RuleDependency annotation should be reviewed and updated when necessary to behave correctly with the updated grammar.

According to these rules, the updated grammar after adding the methodDef rule might look like the following.

methodDef               : typeRef id '(' ')';
fieldDef                : typeRef id;
primaryExpr             : id | INT;
typeRef     @version{1} : id;
id          @version{1} : IDENTIFIER;

The dependency validator would then report an error on the isDefinition method, which could be updated as follows.

@RuleDependencies({
    @RuleDependency(rule=RULE_id, version=1),        // this version was incremented
    @RuleDependency(rule=RULE_fieldDef, version=0),
    @RuleDependency(rule=RULE_methodDef, version=0), // this dependency was added
})
boolean isDefinition(IdContext id) {
    // the expression now handles both definition kinds
    return id.getParent() instanceof FieldDefContext
        || id.getParent() instanceof MethodDefContext;
}

At this point, most code dependencies could be declared in a straightforward manner, although some implicit dependencies still could not be expressed. For example, statements like the following implicitly depend on a chain of rule invocations. In particular, this code depends on the ancestors of the some rule in the grammar since a change to any rule between some and other could result in some no longer appearing in the context of other.

void checkContext(SomeContext ctx) {
    assert ctx.getAncestor(OtherContext.class) != null;
}

Code which directly references the tokens matched by a rule implicitly depends on a chain of rule invocations in the other direction. In particular, the following code depends on the descendants of the some rule in the grammar since direct access to the start and stop tokens makes assumptions about any non-terminals invoked between some and the terminals matched while parsing the rule.

void checkContext(SomeContext ctx) {
    assert ctx.getStart() == ctx.getStop();
}