Parse Error Handling
Clone this wiki locally
The proper handling of illegal input (with regard to the language defined by your grammar) is a key feature of any parser that is to be used in real-world projects, and it’s one of the big drawbacks of regular expressions. For example, if you have a user provide input in a custom DSL you can be sure he or she will make syntactic and/or semantic mistakes at some point. The latter ones will have to be caught and reported by higher levels of your application, however, the syntactic ones can and should be caught, reported and dealt with directly in the parser.
parboiled gives you a choice of how you would like your parser to deal with parse errors by offering 4 different standard ParseRunner implementations.
The BasicParseRunner is the simplest option. It does not perform any error handling and simply causes the parsing run to mismatch if the input is invalid with regard to the given grammar rule. In that regard it behaves just like regular expression engines. It performs exactly one parsing run and is the fastest way to determine whether a given input conforms to the language defined by the parser grammar.
The RecordingParseRunner adds only one little feature to the BasicParseRunner: It keeps track of the furthest input location successfully matched in the stream of input characters. The location immediately following it must be an error location if the given grammar root rule did not match. Just like the BasicParseRunner the RecordingParseRunner always performs exactly one parsing run.
The ReportingParseRunner does not offer any error recovery functionality but properly reports the first parse error found in the given input. It performs exactly one parsing run on error free input but internally triggers two more runs if the input contains a parse error. During the second run the ReportingParseRunner records the error location of the first parse error is and during the third run it “watches” the behavior of the parser as it tries to match the erroneous character in order to create a meaningful error message for the user. It then instantiates an InvalidInputError and adds it to the list of parse errors returned in the ParsingResult.
The RecoveringParseRunner is the most complex of the 4 standard ParseRunner implementations as it provides for automatic, intelligent parse error recovery and enables the complete parsing of the input text even in the presence of parse errors. The strategy that parboiled uses is similar (though somewhat superior) to the one employed by ANTLR, which itself is based on the thoughts of many smart people having thought about parsers already a long time ago. Essentially, parboiled tries to perform single-character deletion, insertion or replacement upon having recognized an unexpected input, if possible. If not, parboiled finds a suitable resynchronization rule in the current rule stack and consumes all illegal characters up to the point where the parser can be resynchronized to continue parsing.
Parse Error Recovery in Detail
Just like the ReportingParseRunner the RecoveringParseRunner first tries a fast basic run to determine, if the input is error free. If the input does not contain any invalid input all is well and the RecoveringParseRunner finishes immediately.
If the first run yields the presence of errors the RecoveringParseRunner executes the following algorithm to overcome the error:
- First a recording run is performed to determine the location of the current error.
- Secondly a reporting run is performed and a meaningful InvalidInputError is added to the list of parse errors.
- Then the character at the current error location is temporarily removed and another recording run performed in order to determine the next error location, i.e. the one after the current one. If there is no further error this means that the error was successfully overcome and the RecoveringParseRunner can finish immediately.
- During the reporting run the runner collected all rules that expected a certain (set of) character(s) at the error location and failed. For each of these rules the runner now tries to temporarily insert a fitting “virtual” character into the input stream and reruns a recording run to determine the next error location. If one of these insertions manages to render the complete input error-free the runner can finish immediately.
- The runner tries to replace the error char with a fitting “virtual” char (just like in the previous step). If this makes all input errors disappear the runner finishes right away.
- At this step the runner knows that no single character recovery is able to fix the input completely. However it now knows whether at least one action (single char deletion, insertion or replacement) allows the parser to continue parsing beyond the current error location. If this is the case the error can be overcome with a single char recovery and the runner applies the fix that allows the parser to continue parsing the furthest into the input stream. So, the runner always selects the best single char fix if one is available.
- If no single char fix was able to push the next error location beyond the current one the runner is forced to resynchronize and re-perform a recording run to determine the next error location. In order to do that the runner first identifies a resynchronization rule, which is the first parent rule that is a Sequence and has already matched at least one character. The runner determines, which characters are allowed to follow this resynchronization rule in a legal parse and skips over all characters that do not qualify.
- Now that the current parse error has been overcome by either the best single character fix or a resynchronization the runner continues parsing up to the next parse error and begins recovering from that one with step 1.
Consequences for Action Design
Since the parse error recovery strategies outlined above allow parsing even in the presence of parse errors your parser actions should be be capable of dealing with certain unexpected situations if you decide to use the RecoveringParseRunner.
Single character errors usually do not have any consequences for your actions since the matched input text seen by your parser actions already include the error correction, i.e. illegal characters are excluded and characters, which have been virtually inserted during error recovery, are included in their results. However, error recovery by resynchronization leads to the case that an unmatched rule sequence is “magically” matched, even when not all sequence sub rules matched or were even run. Since this can throw off the expected setup of the value stack parboiled executes all minimally required parser actions in and underneath the resynchronization sequence during resynchronization. All these actions will see an empty match and can therefore supply meaningful default values.