Proposal: Error recovery, partial connections, and format detection

Use Cases

Our goal is for Spicy to better handle four situations that currently just abort parsing altogether:

Recovery from parse errors. When running into a parsing problem, we want the parser to attempt recovery by finding a point in the subsequent input data where it can continue parsing. This will necessarily remain heuristic because once we fail to parse a part of the input, we can generally no longer be certain about the semantics of the remaining data. However, often we should be able to guess quite well what to do.
Gap recovery. A special-case of error recovery is recovering from gaps in the input stream, such as after packet loss: we want the parser to attempt finding a place after the gap where it can continue. What makes this slightly different from standard recovery is that we may have information about the size of the gap (e.g., through TCP sequence numbers), which can give us additional leverage for finding the right spot (example: if a gap is located fully inside an HTTP body, the Content-Length header will tell us exactly where to continue parsing the next request).
Partial connections. If the parser misses the beginning of the input stream for some reason—e.g., because Zeek didn't see a connection's initial packets—we want it to still try to parse the input. Similar to error recovery, this requires heuristically finding a suitable point in the input data where to start parsing.
Lack of knowledge which protocol/format is in use. It may not always be known upfront what protocol/format an input stream is actually using (e.g., connections on non-standard ports; files lacking a MIME type specification). In such cases, we want Spicy to try parsing the input in parallel with any relevant grammars to see which one can process it successfully (similar to Zeek's Dynamic Protocol Detection feature).

These four situations share some similarities: (1)-(3) all need to guess a location where to start/continue parsing. They may then all still fail if the location wasn’t guessed correctly. If so, they should continue searching for further starting points to use instead. While (4) doesn’t need to find a starting point, it may similarly end up running into failures when it’s not parsing the right input. In that case, we want Spicy to turn off that specific parser altogether.

Proposed Language support

We’ll start by looking at error recovery and then re-use what we develop here for the other two uses cases. The plan is to implement these capabilities in that order, too.

Error recovery

The basic idea is that once we encounter a parsing error, we stop normal processing and start searching the subsequent input for a point where we can resume parsing, skipping any data in between. We call this process synchronization.

During synchronization, to find a spot where to continue, we identify possible synchronization points inside the parser's Spicy grammar: unit fields following the failure point for which it's possible to (somewhat) reliably search for them, even if we have to skip other fields in between.

Here's an example:

type Foo = { ... }

type Bar = unit {
	magic: b"bar";
   ...
};

type Test = unit {
   a: Foo;
   b: Bar;
   c: uint32;
};

If there’s an error while parsing Test::a, we’d normally just abort processing. However, looking at the following field Test::b, we see that, due to the definition of its type Bar, it must always start with the literal bar. So, instead of aborting after the failure with Test::a, we can instead start searching subsequent input for bar. Once we find that, we resume processing there by directly jumping to parsing b: Bar from that location. We can then proceed normally from there on.

This is the basic approach: once an error occurs, we (1) identify the closest possible synchronization point (which may be further up in the parse tree, outside of the current unit), (2) start searching for it inside the incoming input stream, and (3) continue parsing where we find a match.

In practice, there are a couple additional twists to this basic scheme to make this viable:

We don’t want to consider just any possible synchronization point; otherwise, for example, any bytes literal would become a candidate. That's not desirable because: (1) depending on a format's semantics, some literals will be more characteristic of a good starting point than others; and (2) if recovery could continue at more or less arbitrary locations, parsing semantics might easily get into an ill-defined state, with some information present and others missing.

Therefore, we require the grammar writer to explicitly mark fields that they want consider as synchronization points through an attribute &synchronized. In the example above, we’ll hence actually have to write b: Bar &synchronized to get the desired recovery when parsing of Test::a fails.
Once we have located a synchronization point and started parsing there, we need a notion of success: at what point can we assume the parsing to indeed be back on track? We employ two criteria here:
1. For explicit confirmation, units may call a built-in method confirm() to signal success. They can likewise call reject() to flag that the parser is not doing well.
2. For implicit confirmation, TODO: we need to figure something out, like at the end of the current unit? Or do not want implicit at all?
If another parse error occurs after resuming parsing, but before confirmation, we switch back to looking for another synchronization point (without reporting a failure to the host application). We say that the parser is in “trial mode” during this period between the startif synchronization and eventual confirmation.

Note that “trial mode” is a global state of the parsing process: as a Spicy grammar is working its way through the various units, it will normally be in “standard mode”. Once a parsing error happens at a location where we can identify a synchronization point, we flip parsing into “trial mode” until synchronization concludes successfully.
A Spicy unit may be interested in tracking the state of the synchronization process. Therefore, we provide a couple of additional unit hooks as callbacks:
1. on %error() { … } : Executes for the initial parse error. (This hook already exists. It will not execute for errors during trial mode.)
2. on %synced() { … }: Executes when a synchronization point has been found and parsing resumes thre, just before the parser begins processing the corresponding field (with the parsing state, like current position, set up accordingly already).
3. on %confirmed() { … }: Executes when trial mode ends with success, through either implicit or explicit confirmation.
4. on %rejected() { … }: Executes when trial mode ends with failure, either through explicit rejection or another parser error.
A unit’s %error hook executes whenever one of its fields (or sub-fields) runs into trouble, as usual. %synced() executes for the unit that defines the synchronization point. %confirmed and %rejected execute for the unit where confirmation or rejections is triggered.

This leaves the question of what fields can actually act as synchronization points. In the example above, we leveraged a literal that the parser can search for. There are more options, though, per the following list.

Possible synchronization points are:

Fields whose parsing begins with a literal of a type that supports look-ahead matching: bytes, regexp, uint{8|16|32|64}, int{8|16|32|64}. The synchronization process will search for the literal.
Fields located a fixed amount of bytes after the beginning of the field triggering the parse error. Example:

type A = unit { ... }
type B = unit { ... }

type X = unit {
	a: A &size=10;
	b: B &synchronize;
};

Here, b is a synchronization point for a because we know that b must begin 10 bytes after the first byte of a. The synchronization process will simply skip over any bytes that have not been consumed yet.

The field’s type is a unit type that explicitly defines synchronization patterns to search for. For that, we add two new unit attributes: %synchronize-at = <LITERAL> and %synchronize-after = <LITERAL>. In both cases, LITERAL must be literal of a type that support look-ahead, similar to above. The synchronization process will search for this literal. Once found, it will resume parsing the unit itself either at the literal’s position (%synchronize-at) or right afterwards (%synchronize-after). Example:

type Foo = unit {};

type Bar = unit {
       %synchronize-after = /[dD][aA][tT][aA]-coming/;
       data: bytes &size=42;
};

type Test = unit {
	a: Foo;
	b: Bar &synchronize;
};

Here, Test::b is a synchronization point for Test::a If the latter fails parsing, Spicy will start searching for a match against the regular expression [dD][aA][tT][aA]-coming, skip ahead to the end of the match, and then parse Bar::data there.

Vectors of non-fixed size provide synchronization points if the type of the contained elements is itself a synchronization point. However, vector synchronization has slightly different semantics than other cases: if one of the vector elements fails parsing, synchronization will search for the start of the next element, and the resume there with trial mode. Example:

type Foo = unit {
    x: Bar[] &while(not_done) &synchronize;
};

type Bar = unit {
       : b"abc";
       x: bytes &length=2;
};

Here, the Foo::x is a vector of Bar and marked as a synchronization point. We know that parsing of each Bar must always start with abc . So if we encounter an error during parsing one Bar, we'll start searching for the next abc to resume there.

TODO: Are there more cases where fields can support synchronization?

Additional notes

Only parse errors trigger recovery, not logic errors (such as out-of-bounds array accesses, etc.). The latter will continue to abort processing altogether if not caught otherwise.
We should add a manual way of triggering an error-plus-resynchronization sequence, too.
There are more hints we could provide to the synchronization process about where to restart parsing. For example, it would be helpful to know where packet boundaries are inside the input stream. The old Spicy prototype had the concept for “marks” inside the bytes type for that. Marks would tag specific offsets that synchronization can then search for. We can look into this later.

Gap recovery

A gap in the input stream is a special-case of error recovery: while parsing breaks when we hit it, we may well be able to resynchronize on input following the gap by kicking off the synchronization process.

There are a couple of additional twists here:

We may handle a gap better than standard error recovery if we know its exact size (which, e.g., TCP may give us through sequence numbers). In that case, we may be able to identify further synchronization points based on absolute offsets (rather than fixed distance to a predecessor field).
We currently don’t have a way to pass gap information into a Spicy parser. The key here is that that information needs it be injected into the input stream so that the parser can work its way up to the gap, skip over it, and then continue. To support that, we can extend the runtime library’s implementation of the bytes type to record gap information at the corresponding offset through a special chunk.

Partial connections

We can handle partial connections with our error recovery machinery as well. For that, we need to extend the parser’s external API so that host applications can flag to the parser that it is missing initial data. The parser will then kick off synchronization right away, using trial mode and confirmation just as during error recovery.

Format Detection

The error recovery machinery can likewise support format detection. We can extend the API for generated parsers with a flag indicating that it’s uncertain if the input data actually matches the format the parser expects. If so, the parser will start out directly in trial mode (without any prior synchronization). If the input gets confirmed, we can continue parsing normally. If the input gets rejected, we abort the parser completely (rather than resume trial mode).

In addition, we will need an API to start multiple parsers in parallel on the same input stream. They’d all start in trial mode, with one hopefully succeeding to parse the input. The old prototype had try_connect* functions for this.

Implementation

Synchronization

Synchronization may be triggered through different means (a parser error; directly in case partial connections, etc), but then always proceeds in the same way:

If there’s a synchronization point available, we switch to a different code path that advances the input position based on the field type’s synchronization strategy. If we need to move to a specific offset, we just do that; if we need to search for a literal, we start scanning the input for it; etc. Then:
1. If we do not find what we’re looking for, we report a parse error to the host application.
2. If we find what we're looking for, we switch the parser into “trial mode” (see below for more on its implementation) and then go back to the standard parsing code path.
If there’s no synchronization point available, we abort just as we’ve always done for parse errors.

Note that the condition “is there a synchronization point available?” depends on the current field, but can be answered statically at the time we generate the parser. We do not need to evaluate this at runtime. That also implies that “synchronization” is not a mode we need to track at runtime; we activate the synchronization process simply by jumping to a different code path during parsing.

Trial Mode

The parsing code for a generated Spicy parser tracks a new boolean flag around as part of the runtime parsing state: trial_mode. If trial mode is off, all processing happens normally. If trial mode is on, that indicates that we are currently not certain if we're parsing the input with the correct part of the grammar (or if the input matches our grammar at all).

In trial mode, parsing proceeds just as normal except for two things:

We change what happens when we encounter a parse error: instead of bailing out, we switch into synchronization mode (see above).
If a unit calls confirm() , we declare success by: switching off trial mode, executing the %confirmed hook, and then proceeding parsing normally.
If a unit calls reject(), we declare failure by: executing the %reject hook and then either enabling synchronization mode or turning off the parser altogether, depending which use case we're in. (Note: Not quite sure, but maybe trial_mode is actually not a boolean, but a three-state flag to track whether we want synchronize or abort on rejection.)

Pointers

Tests from old test suite here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly