Skip to content

Commit

Permalink
Add more backtracking control verbs to regex engine (?CUT), (?ERROR)
Browse files Browse the repository at this point in the history
Message-ID: <9b18b3110611020335h7ea469a8g28ca483f6832816d@mail.gmail.com>

p4raw-id: //depot/perl@29189
  • Loading branch information
demerphq authored and rgs committed Nov 2, 2006
1 parent 68d3ba5 commit 24b23f3
Show file tree
Hide file tree
Showing 13 changed files with 352 additions and 154 deletions.
6 changes: 3 additions & 3 deletions embed.fnc
Expand Up @@ -1357,9 +1357,9 @@ Es |U8 |regtail_study |NN struct RExC_state_t *state|NN regnode *p|NN const regn
#endif

#if defined(PERL_IN_REGEXEC_C) || defined(PERL_DECL_PROT)
ERs |I32 |regmatch |NN const regmatch_info *reginfo|NN regnode *prog
ERs |I32 |regmatch |NN regmatch_info *reginfo|NN regnode *prog
ERs |I32 |regrepeat |NN const regexp *prog|NN const regnode *p|I32 max
ERs |I32 |regtry |NN const regmatch_info *reginfo|NN char *startpos
ERs |I32 |regtry |NN regmatch_info *reginfo|NN char **startpos
ERs |bool |reginclass |NULLOK const regexp *prog|NN const regnode *n|NN const U8 *p|NULLOK STRLEN *lenp\
|bool do_utf8sv_is_utf8
Es |CHECKPOINT|regcppush |I32 parenfloor
Expand All @@ -1369,7 +1369,7 @@ ERsn |U8* |reghop3 |NN U8 *pos|I32 off|NN const U8 *lim
ERsn |U8* |reghop4 |NN U8 *pos|I32 off|NN const U8 *llim|NN const U8 *rlim
#endif
ERsn |U8* |reghopmaybe3 |NN U8 *pos|I32 off|NN const U8 *lim
ERs |char* |find_byclass |NN regexp * prog|NN const regnode *c|NN char *s|NN const char *strend|NULLOK const regmatch_info *reginfo
ERs |char* |find_byclass |NN regexp * prog|NN const regnode *c|NN char *s|NN const char *strend|NULLOK regmatch_info *reginfo
Es |void |to_utf8_substr |NN regexp * prog
Es |void |to_byte_substr |NN regexp * prog
ERs |I32 |reg_check_named_buff_matched |NN const regexp *rex|NN const regnode *prog
Expand Down
51 changes: 34 additions & 17 deletions ext/re/re.pm
Expand Up @@ -95,24 +95,14 @@ Turns on debug output related to the process of parsing the pattern.
Enables output related to the optimisation phase of compilation.
=item TRIE_COMPILE
=item TRIEC
Detailed info about trie compilation.
=item DUMP
Dump the final program out after it is compiled and optimised.
=item OFFSETS
Dump offset information. This can be used to see how regops correlate
to the pattern. Output format is
NODENUM:POSITION[LENGTH]
Where 1 is the position of the first char in the string. Note that position
can be 0, or larger than the actual length of the pattern, likewise length
can be zero.
=back
Expand All @@ -128,7 +118,7 @@ Turns on all execute related debug options.
Turns on debugging of the main matching loop.
=item TRIE_EXECUTE
=item TRIEE
Extra debugging of how tries execute.
Expand All @@ -146,12 +136,38 @@ Enable debugging of start point optimisations.
Turns on all "extra" debugging options.
=item TRIE_MORE
=item TRIEM
Enable enhanced TRIE debugging. Enhances both TRIEE
and TRIEC.
=item STATE
Enable debugging of states in the engine.
=item STACK
Enable enhanced TRIE debugging. Enhances both TRIE_EXECUTE
and TRIE_COMPILE.
Enable debugging of the recursion stack in the engine. Enabling
or disabling this option automatically does the same for debugging
states as well. This output from this can be quite large.
=item OPTIMISEM
Enable enhanced optimisation debugging and start point optimisations.
Probably not useful except when debugging the regex engine itself.
=item OFFSETS
Dump offset information. This can be used to see how regops correlate
to the pattern. Output format is
NODENUM:POSITION[LENGTH]
Where 1 is the position of the first char in the string. Note that position
can be 0, or larger than the actual length of the pattern, likewise length
can be zero.
=item OFFSETS_DEBUG
=item OFFSETSDBG
Enable debugging of offsets information. This emits copious
amounts of trace information and doesn't mesh well with other
Expand Down Expand Up @@ -182,7 +198,7 @@ Enable DUMP and all execute options. Equivalent to:
=item More
Enable TRIE_MORE and all execute compile and execute options.
Enable TRIEM and all execute compile and execute options.
=back
Expand Down Expand Up @@ -239,6 +255,7 @@ my %flags = (
OFFSETSDBG => 0x040000,
STATE => 0x080000,
OPTIMISEM => 0x100000,
STACK => 0x280000,
);
$flags{ALL} = -1;
$flags{All} = $flags{all} = $flags{DUMP} | $flags{EXECUTE};
Expand Down
6 changes: 6 additions & 0 deletions pod/perl595delta.pod
Expand Up @@ -104,6 +104,12 @@ similar to non-greedy matching, except instead of using a '?' as the modifier
the '+' is used. Thus C<?+>, C<*+>, C<++>, C<{min,max}+> are now legal
quantifiers. (Yves Orton)

=item Backtracking control verbs

The regex engine now supports a number of special purpose backtrack
control verbs: (?COMMIT), (?CUT), (?ERROR) and (?FAIL). See L<perlre>
for their descriptions.

=back

=head2 The C<_> prototype
Expand Down
42 changes: 42 additions & 0 deletions pod/perlre.pod
Expand Up @@ -1094,6 +1094,48 @@ Any number of C<(?COMMIT)> assertions may be used in a pattern.
See also C<< (?>pattern) >> and possessive quantifiers for other
ways to control backtracking.

=item C<(?CUT)>
X<(?CUT)>

This zero-width pattern is similar to C<(?COMMIT)>, except that on
failure it also signifies that whatever text that was matched leading
up to the C<(?CUT)> pattern cannot match, I<even from another
starting point>.

Compare the following to the examples in C<(?COMMIT)>, note the string
is twice as long:

'aaabaaab'=~/a+b?(?CUT)(?{print "$&\n"; $count++})(?FAIL)/;
print "Count=$count\n";

outputs

aaab
aaab
Count=2

Once the 'aaab' at the start of the string has matched and the C<(?CUT)>
executed the next startpoint will be where the cursor was when the
C<(?CUT)> was executed.

=item C<(?ERROR)>
X<(?ERROR)>

This zero-width pattern is similar to C<(?CUT)> except that it causes
the match to fail outright. No attempts to match will occur again.

'aaabaaab'=~/a+b?(?ERROR)(?{print "$&\n"; $count++})(?FAIL)/;
print "Count=$count\n";

outputs

aaab
Count=1

In other words, once the C<(?ERROR)> has been entered and then pattern
does not match then the regex engine will not try any further matching at
all on the rest of the string.

=item C<(?(condition)yes-pattern|no-pattern)>
X<(?()>

Expand Down
6 changes: 3 additions & 3 deletions proto.h
Expand Up @@ -3691,7 +3691,7 @@ STATIC U8 S_regtail_study(pTHX_ struct RExC_state_t *state, regnode *p, const re
#endif

#if defined(PERL_IN_REGEXEC_C) || defined(PERL_DECL_PROT)
STATIC I32 S_regmatch(pTHX_ const regmatch_info *reginfo, regnode *prog)
STATIC I32 S_regmatch(pTHX_ regmatch_info *reginfo, regnode *prog)
__attribute__warn_unused_result__
__attribute__nonnull__(pTHX_1)
__attribute__nonnull__(pTHX_2);
Expand All @@ -3701,7 +3701,7 @@ STATIC I32 S_regrepeat(pTHX_ const regexp *prog, const regnode *p, I32 max)
__attribute__nonnull__(pTHX_1)
__attribute__nonnull__(pTHX_2);

STATIC I32 S_regtry(pTHX_ const regmatch_info *reginfo, char *startpos)
STATIC I32 S_regtry(pTHX_ regmatch_info *reginfo, char **startpos)
__attribute__warn_unused_result__
__attribute__nonnull__(pTHX_1)
__attribute__nonnull__(pTHX_2);
Expand Down Expand Up @@ -3733,7 +3733,7 @@ STATIC U8* S_reghopmaybe3(U8 *pos, I32 off, const U8 *lim)
__attribute__nonnull__(1)
__attribute__nonnull__(3);

STATIC char* S_find_byclass(pTHX_ regexp * prog, const regnode *c, char *s, const char *strend, const regmatch_info *reginfo)
STATIC char* S_find_byclass(pTHX_ regexp * prog, const regnode *c, char *s, const char *strend, regmatch_info *reginfo)
__attribute__warn_unused_result__
__attribute__nonnull__(pTHX_1)
__attribute__nonnull__(pTHX_2)
Expand Down
26 changes: 24 additions & 2 deletions regcomp.c
Expand Up @@ -4717,7 +4717,7 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
case ':': /* (?:...) */
case '>': /* (?>...) */
break;
case 'C':
case 'C': /* (?CUT) and (?COMMIT) */
if (RExC_parse[0] == 'O' &&
RExC_parse[1] == 'M' &&
RExC_parse[2] == 'M' &&
Expand All @@ -4727,12 +4727,34 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
{
RExC_parse+=5;
ret = reg_node(pRExC_state, COMMIT);
} else if (
RExC_parse[0] == 'U' &&
RExC_parse[1] == 'T' &&
RExC_parse[2] == ')')
{
RExC_parse+=2;
ret = reg_node(pRExC_state, CUT);
} else {
vFAIL("Sequence (?C... not terminated");
}
nextchar(pRExC_state);
return ret;
break;
case 'E': /* (?ERROR) */
if (RExC_parse[0] == 'R' &&
RExC_parse[1] == 'R' &&
RExC_parse[2] == 'O' &&
RExC_parse[3] == 'R' &&
RExC_parse[4] == ')')
{
RExC_parse+=4;
ret = reg_node(pRExC_state, OPERROR);
} else {
vFAIL("Sequence (?E... not terminated");
}
nextchar(pRExC_state);
return ret;
break;
case 'F':
if (RExC_parse[0] == 'A' &&
RExC_parse[1] == 'I' &&
Expand Down Expand Up @@ -8669,7 +8691,7 @@ S_dumpuntil(pTHX_ const regexp *r, const regnode *start, const regnode *node,
(dist ? this_trie + dist : next) - start);
if (dist) {
if (!nextbranch)
nextbranch = this_trie + trie->jump[0];
nextbranch= this_trie + trie->jump[0];
DUMPUNTIL(this_trie + dist, nextbranch);
}
if (nextbranch && PL_regkind[OP(nextbranch)]==BRANCH)
Expand Down
4 changes: 4 additions & 0 deletions regcomp.h
Expand Up @@ -624,6 +624,8 @@ re.pm, especially to the documentation.
#define RE_DEBUG_EXTRA_OFFDEBUG 0x040000
#define RE_DEBUG_EXTRA_STATE 0x080000
#define RE_DEBUG_EXTRA_OPTIMISE 0x100000
/* combined */
#define RE_DEBUG_EXTRA_STACK 0x280000

#define RE_DEBUG_FLAG(x) (re_debug_flags & x)
/* Compile */
Expand Down Expand Up @@ -657,6 +659,8 @@ re.pm, especially to the documentation.
if (re_debug_flags & RE_DEBUG_EXTRA_OFFSETS) x )
#define DEBUG_STATE_r(x) DEBUG_r( \
if (re_debug_flags & RE_DEBUG_EXTRA_STATE) x )
#define DEBUG_STACK_r(x) DEBUG_r( \
if (re_debug_flags & RE_DEBUG_EXTRA_STACK) x )
#define DEBUG_OPTIMISE_MORE_r(x) DEBUG_r( \
if ((RE_DEBUG_EXTRA_OPTIMISE|RE_DEBUG_COMPILE_OPTIMISE) == \
(re_debug_flags & (RE_DEBUG_EXTRA_OPTIMISE|RE_DEBUG_COMPILE_OPTIMISE)) ) x )
Expand Down
27 changes: 18 additions & 9 deletions regcomp.pl
Expand Up @@ -48,7 +48,7 @@ BEGIN
$ind++;
$name[$ind]="$real$suffix";
$type[$ind]=$type;
$rest[$ind]="Regmatch state for $type";
$rest[$ind]="state for $type";
}
}
}
Expand Down Expand Up @@ -92,13 +92,16 @@ BEGIN
-$width, REGMATCH_STATE_MAX => $tot - 1
;

$ind = 0;
while (++$ind <= $tot) {

for ($ind=1; $ind <= $lastregop ; $ind++) {
my $oind = $ind - 1;
printf OUT "#define\t%*s\t%d\t/* %#04x %s */\n",
-$width, $name[$ind], $ind-1, $ind-1, $rest[$ind];
print OUT "\n\t/* ------------ States ------------- */\n\n"
if $ind == $lastregop and $lastregop != $tot;
}
print OUT "\t/* ------------ States ------------- */\n";
for ( ; $ind <= $tot ; $ind++) {
printf OUT "#define\t%*s\t(REGNODE_MAX + %d)\t/* %s */\n",
-$width, $name[$ind], $ind - $lastregop, $rest[$ind];
}

print OUT <<EOP;
Expand Down Expand Up @@ -164,13 +167,19 @@ BEGIN
EOP

$ind = 0;
my $ofs = 1;
my $sym = "";
while (++$ind <= $tot) {
my $size = $longj[$ind] || 0;

printf OUT "\t%*s\t/* %#04x */\n",
-3-$width,qq("$name[$ind]",),$ind-1;
print OUT "\t/* ------------ States ------------- */\n"
if $ind == $lastregop and $lastregop != $tot;
printf OUT "\t%*s\t/* $sym%#04x */\n",
-3-$width,qq("$name[$ind]",), $ind - $ofs;
if ($ind == $lastregop and $lastregop != $tot) {
print OUT "\t/* ------------ States ------------- */\n";
$ofs = $lastregop;
$sym = 'REGNODE_MAX +';
}

}

print OUT <<EOP;
Expand Down
7 changes: 5 additions & 2 deletions regcomp.sym
Expand Up @@ -170,7 +170,9 @@ DEFINEP DEFINEP, none 1 Never execute directly.

#*Bactracking
OPFAIL OPFAIL, none Same as (?!)
COMMIT COMMIT, node Pattern fails if backtracking through this
COMMIT COMMIT, none Pattern fails if backtracking through this
CUT COMMIT, none ... and restarts at the cursor point
OPERROR OPERROR,none Pattern fails outright if backtracking through this

# NEW STUFF ABOVE THIS LINE -- Please update counts below.

Expand Down Expand Up @@ -207,4 +209,5 @@ BRANCH next:FAIL
CURLYM A,B:FAIL
IFMATCH A:FAIL
CURLY B_min_known,B_min,B_max:FAIL
COMMIT next:FAIL
COMMIT next:FAIL

0 comments on commit 24b23f3

Please sign in to comment.