You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
automata: fix incorrect offsets reported by reverse inner optimization
Sadly it seems that my days of squashing optimization bugs are still
before me. In this particular case, the reverse inner literal
optimization (which is a new optimization introduced in regex 1.9)
resulted in reporting incorrect match offsets in some cases. The
offending case here is:
$ regex-cli find match meta --no-table -p '(?:(\d+)[:.])?(\d{1,2})[:.](\d{2})' -y '888:77:66'
0:1:9:888:77:66
The above reports a match at 1..9, but the correct match is 0..9. The
problem here is that the reverse inner literal optimization is being
applied, which splits the regex into three (conceptual) pieces:
1. `(?:(\d+)[:.])?(\d{1,2})`
2. `[:.]`
3. `(\d{2})`
The reverse inner optimization works by looking for occurrences of (2)
first, then matching (1) in reverse to find the start position of the
match and then searching for (3) in the forward direction to find the
end of the match.
The problem in this particular case is that (2) matches at position `3`
in the `888:77:66` haystack. Since the first section of numbers is
optional, the reverse inner optimization believes a match exists at
offset `1` by virtue of matching (1) in reverse. That is, the
`(\d{1,2})` matches at 1..3 while the `(?:(\d+)[:.])?` doesn't match at
all. The reverse search here is correct in isolation, but it leads to an
overall incorrect result by stopping the search early. The issue is that
the true leftmost match requires (2) to match at 6..7, but since it
matched at 3..4 first, it is considered first and leads to an incorrect
overall match.
To fix this, we add another "trip wire" to the reverse inner
optimization (of which there are already several) that tries to detect
cases where it cannot prove that the match it found is actually the
leftmost match. Namely, if it reports a match offset greater than the
start of the search and otherwise *could* have kept searching, then we
don't know whether we have the true leftmost match. In that case, we
bail on the optimization and let a slower path take over.
This is yet another example of how the nature of regex searching, and in
particular leftmost searching, inhibits the composition of different
regex strategies. Or at least, makes them incredibly subtle.
Fixesrust-lang#1060
0 commit comments