Skip to content
RY edited this page Jan 16, 2022 · 24 revisions

The Library main purpose is to provide an easy, straightforward and flexible way to map a source text to a desired form with variety of options and customizations available. Basic components which are involved in the workflow and their relationships can be demonstrated by the following diagram:

  • Each terminal is defined by a regular expression and represents a basic source recognition unit(lexeme).
  • Non-terminals are used to reduce a list of terminals and non-terminals(production) to a higher level entity.
  • Both terminals and non-terminals with additional configuration defines a target grammar.
  • The grammar and custom reduction code are then used to compile a final parser.

Regular Expressions

Regular Expressions are constructed from nodes of the following types:

  • Text nodes to match a single position within a source which supports any combination of Unicode character codes, ranges and Unicode categories:
Rex.Char('r');
Rex.Char(@"a-z0-9");
Rex.Char(@"\u{0|10-ff|ccc|10000-10FFFF}");
Rex.Char(@"\p{Cc|Cf|Cn|Co|Cs|C|Ll|Lm|Lo|Lt|Lu|L|Mc|Me|Mn|M|Nd|Nl|No|N|Pc|Pd|Pe|Po|Ps|Pf|Pi|P|Sc|Sk|Sm|So|S|Zl|Zp|Zs|Z}");
Rex.Char(@"\a\b\t\r\n\v\f\\");

There is also a convenient way to configure a match for any character (or any set in general) excluding specific characters, ranges or categories:

Rex.Char(@"0-10ffff-[\p{L}]");
Rex.Char(@"0-10ffff-[a]");
Rex.Except(@"\p{L}");
Rex.Except('a');
  • Text nodes are later combined in the expression by OR, AND(concatenation) and REPEAT nodes:
Rex.Char('a').Then(Rex.Char('b')).Then(Rex.Char('b'));
Rex.Text("abb");
Rex.Or(Rex.Char('a'), Rex.Char('b')).NoneOrMore().Then("abb"); // (a|b)*abb

var a = Rex.Char('a');
var b = Rex.Char('b');
a.Or(b).NoneOrMore().Then(a).Then(b).Then(b); // also (a|b)*abb
  • Special kind of a conditional expressions are also available. The idea is to define a sub-expression, which is supposed to be checked at some position in the pattern and an evaluation is transferred to the state which follows only if the sub-expression has succeed(positive lookahead) or failed(negative lookahead). Every time a conditional expression evaluation is completed current text position is reset to the state where it's started:
Rex.IfNot("-->").Then(Rex.AnyChar).NoneOrMore();
Rex.Char('a').NoneOrMore().FollowedBy('b');

Compiled Expression

An individual regular expression can be compiled in a delegate of int RexEvaluator(string content, int offset, int length) type. Accepting a text input and boundaries the dynamic method will find and return a length of the longest sub-string has been matched or -1 if none succeed:

var digit  = Rex.Char("0-9");
var number = digit.OneOrMore();
var eval   = Rex.Compile(number);
var input  = "x123y";

Console.WriteLine(eval(input, 0, input.Length));     // outputs -1
Console.WriteLine(eval(input, 1, input.Length - 1)); // outputs  3
Console.WriteLine(eval(input, 1, 2));                // outputs  2

Lexical Analysis

Multiple regular expressions(each expression is associated with an appropriate token ID) then are converted into a DFA by a lexical builder. The DFA is represented by lexical states and transitions between them. Each lexical state corresponds one or more nodes from the source expressions. When some state contains a regular expression accept node(added automatically) it's considered as a final state. If there are several acceptance nodes, the one with the lowest token ID is selected as final for the state. When a source text (represented by a sequence of UTF-16 code points) is processed the following workflow to choose the next state is applied:

  • Each state contains ordered disjoint list of Unicode ranges.
  • If some range contains a pending char code then Unicode category for the char code is considered.
  • If there is a mapping between some category and the char code then the transition which is defined by this category is selected.
  • If no category is matched then the default transition for the range is selected.
  • If no range contains the char code then the default categories are tested.
  • If neither range nor default category corresponds to the char then the default state transition is selected.

Surrogates

Lexical analyzer provides natural support for the code points from the extended Unicode range(0x10000-0x10FFFF). To achieve this a supplementary surrogate state is generated when necessary. Then all codes which does not fit UTF-16 range are moved to this "extra" state. Finally, the surrogate state is connected to the root by introducing a special 'high-surrogate'(0xD800--0xDBFF) range transition.

Lookaheads

To support conditional regular expression nodes a lexical state is extended to have a tree-like structure. Each state contains False and True child state nodes. When both are null means we're dealing with a regular lexical state. But when any or both are set identifies a lookahead initial state case. If so, the lookahead state is processed individually and if a final state is encountered the current position is reset and True-state is considered as a current state. Otherwise False-state is chosen.

Grammar

Grammar allows to configure terminals, whitespaces and non-terminals which are used later for parser states generation. There are also several properties specified when the grammar is created:

  • Ignore Case: when specified each char-set in the grammar is extended with to-lower and to-upper variants. Also updates letter Unicode categories accordingly. False by default.

  • Conflict Resolver: when specified overrides the default conflict resolver with a custom one.

Terminals

Terminals are defined by a name, a regular expression and an incrementally generated token ID(so terminal defined earlier in the grammar has a priority over the later ones). The name then can be used to reference the terminal in a production body or in a reducer. Examples:

grammar.CreateTerminals("+", "-", "(", ")");
grammar.CreateTerminal("num", Rex.Char("0-9").OneOrMore());
grammar.CreateTerminal("expr:str", Rex.Char('"').Then(Rex.AnyText).Then('"'), lazy: true);
  • CreateTermials defines a list of static terminals where expression for each item is generated from its name.

  • expr:str-like construct defines a complex name where the terminal is automatically reduced to the 'expr' non-terminal(should be defined before this line). Equivalent to:

grammar.CreateTerminal("str", Rex.Char('"').Then(Rex.AnyText).Then('"'), lazy: true);
grammar.AddRule("expr:str", "str");
  • lazy:true prevents any transition from an expression final state once it's reached.

Whitespaces

Whitespaces define a special kind of terminals which can appear at any position within a production between other symbols:

grammar.CreateWhitespace("ws", Rex.Char(' '));
grammar.CreateWhitespace("lb", Rex.Char('\n'), isLineBreak: true);

isLineBreak: true tags a whitespace with a LineBreak attribute. In that case the whitespace when recognized will deny/allow productions which are constrained by [NoLB]/[LB] modifiers.

Non-Terminals

Non-terminals represent compound type of symbols. A non-terminal is defined by a name and a list of reduction rules (productions). Use CreateNonTerminals method to declare a list of non-terminals for the grammar:

grammar.CreateNonTerminals("S", "L", "R");

Productions

Productions describe how a sequences of symbols should be reduced to an owner non-terminal. A rule's name should be unique within a grammar where a name's prefix(separated by ':' character) defines a name of the parent non-terminal. If there is only one rule defined for a non-terminal, then ':' can be skipped and the production name will be equal to its parent name. Note, the rule is just added to an owner rules collection, no auxiliary symbols are created in this case.

grammar.AddRule("S:0", "L = R");
grammar.AddRule("S:1", "R");
grammar.AddRule("L:0", "id");
grammar.AddRule("L:1", "* R");
grammar.AddRule("R", "L");

There are two more special symbols are allowed in a production body: [LB] and [NoLB]:

grammar.AddRule("S:0", "A [NoLB] B");
grammar.AddRule("A:0", "a [LB]");
grammar.AddRule("A:1", "[LB]");
  • [LB] requires at least one whitespace marked as LineBreak is encountered before a production can continue.
  • In opposite [NoLB] modifier cancels a production execution in a case a 'LineBreak' is recognized at the position.

There are several more options available to configure:

  • ReduceOn and ShiftOn define a default conflict resolver behavior:
grammar.AddRule("expr:add", "expr + expr")
    .ReduceOn("+", "-")
    .ShiftOn("*", "/");

The code above forces the grammar to choose Reduce action when the expr + expr state is reached and either + or - terminal is encountered as a lookahead. In opposite, if * or / lookaheads appeared next - Shift action is preferred. Note, the same result can be achieved with a custom conflict resolver.

  • OverridesLookaheads restricts a lookahead-set allowed for a production. Consider the example:
grammar.AddRule("stmnt:expr", "expr ;");
grammar.AddRule("stmnt:expr_auto", "expr")
    .OverrideLookaheads("}", Symbol.EndOfSource, Symbol.LineBreak);

Here we have a declaration for an ECMA-like automatic semicolon insertion rule where stmnt:expr_auto is allowed to reduce only when } or end-of-source lookaheads are encountered next. The third ([LB]) symbol indicates that if a line-break appeared then any grammar-valid lookahead will be accepted.

Conflict Resolver

A conflict resolver should implement the following interface:

ParserAction ResolveShiftConflict(Symbol symbol, Production production)

Handles shift-reduce conflict. Method accepts a production which is ready for reduce and a symbol representing a lookahead followed. Returns the action has to be selected by a parser.

Production ResolveReduceConflict(Symbol symbol, Production first, Production second)

Handles reduce-reduce conflict. For two productions specified and a lookahead symbol resolver need to decide the correct one to reduce.

ParserItem[] ResolveCoreConflicts(Symbol symbol, ParserItem[] core);

Doesn't really represent a conflict but rather a convenient way to modify parser's state core items when required. It's executed every time a new state is created from the core. Consider the following case:

grammar.AddRule("expr:obj", "{ obj_init }");
grammar.AddRule("stmnt:expr", "expr ;");
grammar.AddRule("stmnt:block", "{ stmnt_list }");

And let's say we would like to exclude expr:obj production from the core where stmnt:block is also available. The following code does the job:

public override ParserItem[] ResolveCoreConflicts(Symbol symbol, ParserItem[] core)
{
    if (symbol.Name == "{" && core.Length == 2)
    {
        return core[0].Production.Name == "stmnt:block"
            ? new ParserItem[] { core[0] }
            : new ParserItem[] { core[1] };
    }

    return base.ResolveCoreConflicts(symbol, core);
}

Hence, the stmnt:block production prevails over the expr:obj.

Runtime

A user defined parser is responsible to handle tokens recognized during a text processing and build an output of appropriate type. The parser should be inherited from one of the base type provided by the library. Base type defines an abstract interface which is implemented during a compilation phase. Tokens and productions handlers must be marked with corresponding attributes are covered later in this section.

String Parser

A custom string parser should inherit the StringParser class from the library. The base class must be initialized with a string which content will be used as source.

Text Parser

A custom text parser should inherit the TextParser class from the library. The base class must be initialized with a TextReader instance and represents default implementation for a sequential parser that reads the source in chunks. Also supports asynchronous processing of the source.

Reducers

There are three type of reducers available:

[CompleteToken("num")]
protected int CompleteNumber() => Int32.Parse(GetLexeme());

A token handler is defined by a symbol name and called each time a terminal is recognized. Must be parameter-less and relies on the StartPosition and CurrentPosition parser's properties(both StringParser and TextParser implement GetLexeme() method for simplicity). May return a value which will be then placed on an evaluation stack. If handler has the void response type or there is no handler defined for the terminal at all(keywords, brackets, etc...) - the stack remains intact.

[Reduce("expr:add")]
protected int Add(int x, int y) => x + y;

A production handler is defined by a production name and executed when a reduce grammar action is performed(after appropriate lookahead is reached). The method parameters are resolved from the stack and are removed after. Method may return value which will be placed on the top of the stack. Note, there is no compile-time validation for parameters count and types accordance among the reducers, so you should design it carefully. Mismatch will lead to a runtime exception.

[Handle("<tag attrs />")]
[Handle("<script attrs />")]
[Handle("<script attrs > %script% </script >")]
protected void AppendElement()

A prefix handler is defined by a sequence of symbols are formed on the top of the stack and called immediately after the last symbol is pushed. Similar to a production handler it may read parameters from the stack, but the stack itself is not altered. Thus no return value is allowed. There is an important restriction - symbols should represent some valid production prefix. So, for example, you can handle expr, expr + or expr + expr prefixes, but not expr + expr * one. The [LB] and [NoLB] symbols are also not taken into account when the prefixes are being compared.