Skip to content

Latest commit

 

History

History
96 lines (64 loc) · 4.79 KB

README.md

File metadata and controls

96 lines (64 loc) · 4.79 KB

Project stage: Experimental Build

License Github Tag Maven Central

fastparse-memoize - A speedup for Li Haoyi's fastparse

The library fastparse-memoize adds the memoize method to fastparse's parsers.

When fastparse processes grammars that require backtracking, parsing rules may be applied repeatedly at the same place in the input. In most cases, the parsing results will also be the same.

Memoizing means that a parser will not be tried again at the same place; instead, the previous parsing result is fetched from the cache and is immediately available.

Usage

  • Import the Memoize symbol and use Memoize.parse instead of fastparse.parse. Use Memoize.parseInputRaw instead of fastparse.parseInputRaw.
  • Import the Memoize.MemoizeParser symbol and use .memoize on selected parsing rules.

Example

Consider a simple language consisting of the digit 1, the + operation, and parentheses. Example strings in that language are 1+1 and 1+(1+1+(1))+(1+1).

This language is parsed by the following straightforwardly written grammar:

import fastparse._

def program1[$: P]: P[String] = P(expr1 ~ End).!
def expr1[$: P]               = P(plus1 | other1)
def plus1[$: P]: P[_]         = P(other1 ~ "+" ~ expr1)
def other1[$: P]: P[_]        = P("1" | ("(" ~ expr1 ~ ")"))

val n = 30
val input = "(" * n + "1" + ")" * n  // The string ((( ... (((1))) ... ))) is the input for the parser.
fastparse.parse(input, program1(_)) // Very slow.

This program works but is exponentially slow on certain valid expressions, such as ((((1)))) with many parentheses.

The reason for slowness is that the parser tries plus1 before other1. For the string ((((1)))), the plus1 rule will first try applying other1 but will eventually fail because it will not find the symbol +. When plus1 fails, it backtracks one symbol and retries other1 again. So, the amount of work for other1 is doubled on every backtracking attempt. This leads to $2^n$ attempts to parse with other1. The exponential slowness is already apparent with $n=25$ or so.

JVM warmup does not lead to any speedup of the parsing, because the slowness is algorithmic.

The repeated parsing with the rule other1 can be avoided if the resuls of the parsing are memoized. The revised code attaches a .memoize call to other1 but leaves all other parsing rules unchanged:

import io.chymyst.fastparse.Memoize
import io.chymyst.fastparse.Memoize.MemoizeParser

def program2[$: P]: P[String] = P(expr2 ~ End).!
def expr2[$: P]               = P(plus2 | other2)
def plus2[$: P]: P[_]         = P(other2 ~ "+" ~ expr2)
def other2[$: P]: P[_]        = P("1" | ("(" ~ expr2 ~ ")")).memoize

val n = 30
val input = "(" * n + "1" + ")" * n  // The string ((( ... (((1))) ... ))) is the input for the parser.
Memoize.parse(input, program1(_)) // Very fast.

How it works

The parsing rule such as other1 is a function of type P[_] => P[_]. When that parsing rule is tried, its argument of type P[_] contains the current parsing context, including the current position in the input text. The result is an updated parsing context, including the information about success or failure. The entire updated parsing context is cached. Whenever the same parsing rule is tried again at the same position in the input text, the cached result is returned immediately.

This works for most rules that do not have user-visible side effects.

Memoization should be applied carefully and tested. In some cases, memoization leads to incorrect parsing results.