Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upimplement backwards regex matching #96
Comments
BurntSushi
self-assigned this
Jun 25, 2015
BurntSushi
added
the
enhancement
label
Jun 25, 2015
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
BurntSushi
Jul 1, 2015
Member
I've been noodling on this for the past week. Supporting suffixes seems somewhat straight-forward assuming we get backwards search working. However, supporting the string-in-the-middle optimization is harder. In particular, we want to discover the set of all literal substrings such that at least one of them must appear in the search text in order for the regex to match. rsc gives an algebraic formulation here. But we need to go further.
Say we have our set of literal substrings and we build an automaton to match them (via AC). So we run it on the search text and find a match. Now what? In grep, it does line-based matching, so it just needs to extend in both directions until it finds a new line character or beginning/end of input. But in the general case, this can't be done. However, if we knew where the required literal started in the regex instruction sequence, then we could start the machine at the substring match and run it both forwards and backwards (ignoring capture groups). This would give us the start & end locations, after which, we could run the full NFA on just that substring.
rsc's algebraic formulation doesn't give us this information because it is algebraic. We instead need to perform the same operation on the regex instruction sequence, which forms a graph. Doing it this way seems a little simpler because we just need to find all paths from the start of the graph to the match state, then for each path, concatenate the literals. Unioning all of the concatenations should yield the set of required literals. (This process will of course need to be bounded, particularly around character classes!)
|
I've been noodling on this for the past week. Supporting suffixes seems somewhat straight-forward assuming we get backwards search working. However, supporting the string-in-the-middle optimization is harder. In particular, we want to discover the set of all literal substrings such that at least one of them must appear in the search text in order for the regex to match. rsc gives an algebraic formulation here. But we need to go further. Say we have our set of literal substrings and we build an automaton to match them (via AC). So we run it on the search text and find a match. Now what? In rsc's algebraic formulation doesn't give us this information because it is algebraic. We instead need to perform the same operation on the regex instruction sequence, which forms a graph. Doing it this way seems a little simpler because we just need to find all paths from the start of the graph to the match state, then for each path, concatenate the literals. Unioning all of the concatenations should yield the set of required literals. (This process will of course need to be bounded, particularly around character classes!) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
WillEngler
Jul 1, 2015
@BurntSushi Sounds promising! I haven't read the codesearch code or rsc's Trigram Index post deeply enough to follow why the algebraic formulation can't give start and end locations, but it does seem like the solution you're outlining is simpler.
One question about "... then for each path, concatenate the literals, ..." By literals, do you mean all Char and Ranges instructions? I'm trying to visualize how this would work with byte matching.
WillEngler
commented
Jul 1, 2015
|
@BurntSushi Sounds promising! I haven't read the codesearch code or rsc's Trigram Index post deeply enough to follow why the algebraic formulation can't give start and end locations, but it does seem like the solution you're outlining is simpler. One question about "... then for each path, concatenate the literals, ..." By literals, do you mean all |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
BurntSushi
Jul 1, 2015
Member
The algebraic formulation is on the abstract syntax of a regex. The abstract syntax is, well, abstract. It's divorced from the automaton, which means extracting literals from it and the instruction at which that literal starts is impossible (or very hard).
One question about "... then for each path, concatenate the literals, ..." By literals, do you mean all Char and Ranges instructions? I'm trying to visualize how this would work with byte matching.
Yeah, that's right. The key is that the Ranges instruction is superfluous in the current instruction set. It could technically be implemented in terms of Split instructions, but it is much more efficient to define a separate instruction for binary search. So in my formulation here, I have to treat ranges as splits in the graph. e.g., a[cd]z should produce the same literals as a(c|d)z: acz and adz.
The part I still have to iron out is how to bound it. e.g., a.z could also be considered as an a followed by many splits corresponding to every character (sans \n). The idea I have handles this just fine, but obviously it needs to be bounded! I think this is where byte ranges might need to be handle specially, but it shouldn't be too bad. (i.e., You probably shouldn't cut the literal in the middle of an encoded codepoint!)
|
The algebraic formulation is on the abstract syntax of a regex. The abstract syntax is, well, abstract. It's divorced from the automaton, which means extracting literals from it and the instruction at which that literal starts is impossible (or very hard).
Yeah, that's right. The key is that the The part I still have to iron out is how to bound it. e.g., |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
BurntSushi
Jul 1, 2015
Member
(When I understand my approach better, I'll write out some simpler rules. But I think I have to play with code first.)
|
(When I understand my approach better, I'll write out some simpler rules. But I think I have to play with code first.) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
WillEngler
Jul 1, 2015
The algebraic formulation is on the abstract syntax of a regex. The abstract syntax is, well, abstract. It's divorced from the automaton, which means extracting literals from it and the instruction at which that literal starts is impossible (or very hard).
Thanks, that clears things up.
The part I still have to iron out is how to bound it. e.g., a.z could also be considered as an a followed by many splits corresponding to every character (sans \n).
Ah, interesting! I'm curious to see how you'll handle it.
WillEngler
commented
Jul 1, 2015
Thanks, that clears things up.
Ah, interesting! I'm curious to see how you'll handle it. |
BurntSushi
removed their assignment
Sep 21, 2015
added a commit
that referenced
this issue
Mar 27, 2016
added a commit
that referenced
this issue
Mar 27, 2016
added a commit
that referenced
this issue
Mar 28, 2016
BurntSushi
closed this
in
#190
Mar 28, 2016
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
BurntSushi
Mar 28, 2016
Member
This was done in 31a317e. It isn't quite ambitious as what I had in mind, but it does indeed scan for suffix literals, and if one is found, runs the DFA in reverse to find the start location of the match.
|
This was done in 31a317e. It isn't quite ambitious as what I had in mind, but it does indeed scan for suffix literals, and if one is found, runs the DFA in reverse to find the start location of the match. |
BurntSushi commentedJun 25, 2015
With @WillEngler working on the DFA, I'm itching to do more. One idea I got from
grep(and RE2) is the capability of running automata backwards over the input.For example, today, we are very good at literal prefix matching because the prefixes get compiled down to a DFA. But what if the regex has a literal suffix? We don't do anything clever. What if we had the ability to run the engines backwards? We could identify suffixes just like prefixes, and then run the engine backwards from the suffix.
The possibilities get even better. What if the regex has neither a literal prefix nor a literal suffix, but has a string somewhere in the middle that must be matched? e.g.,
\w+foo\w+. We could use a fast substring search to findfooand then, in theory, run the engine both backwards and forwards to find the start and ending locations.Finally, we'll want machine reversal to make the most out of a DFA.