Skip to content

Commit 73f7889

Browse files
committed
automata: fix incorrect offsets reported by reverse inner optimization
Sadly it seems that my days of squashing optimization bugs are still before me. In this particular case, the reverse inner literal optimization (which is a new optimization introduced in regex 1.9) resulted in reporting incorrect match offsets in some cases. The offending case here is: $ regex-cli find match meta --no-table -p '(?:(\d+)[:.])?(\d{1,2})[:.](\d{2})' -y '888:77:66' 0:1:9:888:77:66 The above reports a match at 1..9, but the correct match is 0..9. The problem here is that the reverse inner literal optimization is being applied, which splits the regex into three (conceptual) pieces: 1. `(?:(\d+)[:.])?(\d{1,2})` 2. `[:.]` 3. `(\d{2})` The reverse inner optimization works by looking for occurrences of (2) first, then matching (1) in reverse to find the start position of the match and then searching for (3) in the forward direction to find the end of the match. The problem in this particular case is that (2) matches at position `3` in the `888:77:66` haystack. Since the first section of numbers is optional, the reverse inner optimization believes a match exists at offset `1` by virtue of matching (1) in reverse. That is, the `(\d{1,2})` matches at 1..3 while the `(?:(\d+)[:.])?` doesn't match at all. The reverse search here is correct in isolation, but it leads to an overall incorrect result by stopping the search early. The issue is that the true leftmost match requires (2) to match at 6..7, but since it matched at 3..4 first, it is considered first and leads to an incorrect overall match. To fix this, we add another "trip wire" to the reverse inner optimization (of which there are already several) that tries to detect cases where it cannot prove that the match it found is actually the leftmost match. Namely, if it reports a match offset greater than the start of the search and otherwise *could* have kept searching, then we don't know whether we have the true leftmost match. In that case, we bail on the optimization and let a slower path take over. This is yet another example of how the nature of regex searching, and in particular leftmost searching, inhibits the composition of different regex strategies. Or at least, makes them incredibly subtle. Fixes rust-lang#1060
1 parent bbf0b38 commit 73f7889

File tree

2 files changed

+64
-0
lines changed

2 files changed

+64
-0
lines changed

regex-automata/src/meta/limited.rs

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,41 @@ pub(crate) fn dfa_try_search_half_rev(
8888
return Err(RetryError::Quadratic(RetryQuadraticError::new()));
8989
}
9090
}
91+
let was_dead = dfa.is_dead_state(sid);
9192
dfa_eoi_rev(dfa, input, &mut sid, &mut mat)?;
93+
// If we reach the beginning of the search and we could otherwise still
94+
// potentially keep matching if there was more to match, then we actually
95+
// return an error to indicate giving up on this optimization. Why? Because
96+
// we can't prove that the real match begins at where we would report it.
97+
//
98+
// This only happens when all of the following are true:
99+
//
100+
// 1) We reach the starting point of our search span.
101+
// 2) The match we found is before the starting point.
102+
// 3) The FSM reports we could possibly find a longer match.
103+
//
104+
// We need (1) because otherwise the search stopped before the starting
105+
// point and there is no possible way to find a more leftmost position.
106+
//
107+
// We need (2) because if the match found has an offset equal to the minimum
108+
// possible offset, then there is no possible more leftmost match.
109+
//
110+
// We need (3) because if the FSM couldn't continue anyway (i.e., it's in
111+
// a dead state), then we know we couldn't find anything more leftmost
112+
// than what we have. (We have to check the state we were in prior to the
113+
// EOI transition since the EOI transition will usually bring us to a dead
114+
// state by virtue of it represents the end-of-input.)
115+
if at == input.start()
116+
&& mat.map_or(false, |m| m.offset() > input.start())
117+
&& !was_dead
118+
{
119+
trace!(
120+
"reached beginning of search at offset {} without hitting \
121+
a dead state, quitting to avoid potential false positive match",
122+
at,
123+
);
124+
return Err(RetryError::Quadratic(RetryQuadraticError::new()));
125+
}
92126
Ok(mat)
93127
}
94128

@@ -140,7 +174,20 @@ pub(crate) fn hybrid_try_search_half_rev(
140174
return Err(RetryError::Quadratic(RetryQuadraticError::new()));
141175
}
142176
}
177+
let was_dead = sid.is_dead();
143178
hybrid_eoi_rev(dfa, cache, input, &mut sid, &mut mat)?;
179+
// See the comments in the full DFA routine above for why we need this.
180+
if at == input.start()
181+
&& mat.map_or(false, |m| m.offset() > input.start())
182+
&& !was_dead
183+
{
184+
trace!(
185+
"reached beginning of search at offset {} without hitting \
186+
a dead state, quitting to avoid potential false positive match",
187+
at,
188+
);
189+
return Err(RetryError::Quadratic(RetryQuadraticError::new()));
190+
}
144191
Ok(mat)
145192
}
146193

testdata/regression.toml

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -739,3 +739,20 @@ matches = [[0, 9]]
739739
utf8 = false
740740
match-kind = "all"
741741
search-kind = "overlapping"
742+
743+
# See: https://github.com/rust-lang/regex/issues/1060
744+
[[test]]
745+
name = "reverse-inner-plus-shorter-than-expected"
746+
regex = '(?:(\d+)[:.])?(\d{1,2})[:.](\d{2})'
747+
haystack = '102:12:39'
748+
matches = [[[0, 9], [0, 3], [4, 6], [7, 9]]]
749+
750+
# Like reverse-inner-plus-shorter-than-expected, but using a far simpler regex
751+
# to demonstrate the extent of the rot. Sigh.
752+
#
753+
# See: https://github.com/rust-lang/regex/issues/1060
754+
[[test]]
755+
name = "reverse-inner-short"
756+
regex = '(?:([0-9][0-9][0-9]):)?([0-9][0-9]):([0-9][0-9])'
757+
haystack = '102:12:39'
758+
matches = [[[0, 9], [0, 3], [4, 6], [7, 9]]]

0 commit comments

Comments
 (0)