The library fastparse-memoize
adds the memoize
method to fastparse
's parsers.
When fastparse
processes grammars that require backtracking, parsing rules may be applied repeatedly at the same place in the input.
In most cases, the parsing results will also be the same.
Memoizing means that a parser will not be tried again at the same place; instead, the previous parsing result is fetched from the cache and is immediately available.
- Import the
Memoize
symbol and useMemoize.parse
instead offastparse.parse
. UseMemoize.parseInputRaw
instead offastparse.parseInputRaw
. - Import the
Memoize.MemoizeParser
symbol and use.memoize
on selected parsing rules.
Consider a simple language consisting of the digit 1
, the +
operation, and parentheses.
Example strings in that language are 1+1
and 1+(1+1+(1))+(1+1)
.
This language is parsed by the following straightforwardly written grammar:
import fastparse._
def program1[$: P]: P[String] = P(expr1 ~ End).!
def expr1[$: P] = P(plus1 | other1)
def plus1[$: P]: P[_] = P(other1 ~ "+" ~ expr1)
def other1[$: P]: P[_] = P("1" | ("(" ~ expr1 ~ ")"))
val n = 30
val input = "(" * n + "1" + ")" * n // The string ((( ... (((1))) ... ))) is the input for the parser.
fastparse.parse(input, program1(_)) // Very slow.
This program works but is exponentially slow on certain valid expressions, such as ((((1))))
with many parentheses.
The reason for slowness is that the parser tries plus1
before other1
.
For the string ((((1))))
, the plus1
rule will first try applying other1
but will eventually fail because it will not find the symbol +
.
When plus1
fails, it backtracks one symbol and retries other1
again.
So, the amount of work for other1
is doubled on every backtracking attempt.
This leads to other1
.
The exponential slowness is already apparent with
JVM warmup does not lead to any speedup of the parsing, because the slowness is algorithmic.
The repeated parsing with the rule other1
can be avoided if the resuls of the parsing are memoized.
The revised code attaches a .memoize
call to other1
but leaves all other parsing rules unchanged:
import io.chymyst.fastparse.Memoize
import io.chymyst.fastparse.Memoize.MemoizeParser
def program2[$: P]: P[String] = P(expr2 ~ End).!
def expr2[$: P] = P(plus2 | other2)
def plus2[$: P]: P[_] = P(other2 ~ "+" ~ expr2)
def other2[$: P]: P[_] = P("1" | ("(" ~ expr2 ~ ")")).memoize
val n = 30
val input = "(" * n + "1" + ")" * n // The string ((( ... (((1))) ... ))) is the input for the parser.
Memoize.parse(input, program1(_)) // Very fast.
The parsing rule such as other1
is a function of type P[_] => P[_]
.
When that parsing rule is tried, its argument of type P[_]
contains the current parsing context, including the current position in the input text.
The result is an updated parsing context, including the information about success or failure.
The entire updated parsing context is cached.
Whenever the same parsing rule is tried again at the same position in the input text, the cached result is returned immediately.
This works for most rules that do not have user-visible side effects.
Memoization should be applied carefully and tested. In some cases, memoization leads to incorrect parsing results.