Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Fetching contributors…

Octocat-spinner-32-eaf2f5

Cannot retrieve contributors at this time

file 258 lines (205 sloc) 12.418 kb
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257
# regcomp.sym
#
# File has two sections, divided by a line of dashes '-'.
#
# Lines beginning with # are ignored, except for those that start with #*
# which are included in pod/perldebguts.pod. # within a line may be part
# of a description.
#
# First section is for regops, second section is for regmatch-states
#
# Note that the order in this file is important.
#
# Format for first section:
# NAME \s+ TYPE, arg-description [num-args] [flags] [longjump-len] ; DESCRIPTION
# flag <S> means is REGNODE_SIMPLE; flag <V> means is REGNODE_VARIES
#
#
# run perl regen.pl after editing this file



#* Exit points

END END, no ; End of program.
SUCCEED END, no ; Return from a subroutine, basically.

#* Anchors:

BOL BOL, no ; Match "" at beginning of line.
MBOL BOL, no ; Same, assuming multiline.
SBOL BOL, no ; Same, assuming singleline.
EOS EOL, no ; Match "" at end of string.
EOL EOL, no ; Match "" at end of line.
MEOL EOL, no ; Same, assuming multiline.
SEOL EOL, no ; Same, assuming singleline.
# The regops that have varieties that vary depending on the character set regex
# modifiers have to ordered thusly: /d, /l, /u, /a, /aa. This is because code
# in regcomp.c uses the enum value of the modifier as an offset from the /d
# version. The complements must come after the non-complements.
# BOUND, POSIX and their complements are affected, as well as EXACTF.
BOUND BOUND, no ; Match "" at any word boundary using native charset semantics for non-utf8
BOUNDL BOUND, no ; Match "" at any locale word boundary
BOUNDU BOUND, no ; Match "" at any word boundary using Unicode semantics
BOUNDA BOUND, no ; Match "" at any word boundary using ASCII semantics
# All NBOUND nodes are required by code in regexec.c to be greater than all BOUND ones
NBOUND NBOUND, no ; Match "" at any word non-boundary using native charset semantics for non-utf8
NBOUNDL NBOUND, no ; Match "" at any locale word non-boundary
NBOUNDU NBOUND, no ; Match "" at any word non-boundary using Unicode semantics
NBOUNDA NBOUND, no ; Match "" at any word non-boundary using ASCII semantics
GPOS GPOS, no ; Matches where last m//g left off.

#* [Special] alternatives:

REG_ANY REG_ANY, no 0 S ; Match any one character (except newline).
SANY REG_ANY, no 0 S ; Match any one character.
CANY REG_ANY, no 0 S ; Match any one byte.
ANYOF ANYOF, sv 0 S ; Match character in (or not in) this class, single char match only
ANYOF_WARN_SUPER ANYOF, sv 0 S ; Match character in (or not in) this class, warn (if enabled) upon matching a char above Unicode max;
ANYOF_SYNTHETIC ANYOF, sv 0 S ; Synthetic start class

# Order of the below is important. See ordering comment above.
POSIXD POSIXD, none 0 S ; Some [[:class:]] under /d; the FLAGS field gives which one
POSIXL POSIXD, none 0 S ; Some [[:class:]] under /l; the FLAGS field gives which one
POSIXU POSIXD, none 0 S ; Some [[:class:]] under /u; the FLAGS field gives which one
POSIXA POSIXD, none 0 S ; Some [[:class:]] under /a; the FLAGS field gives which one
NPOSIXD NPOSIXD, none 0 S ; complement of POSIXD, [[:^class:]]
NPOSIXL NPOSIXD, none 0 S ; complement of POSIXL, [[:^class:]]
NPOSIXU NPOSIXD, none 0 S ; complement of POSIXU, [[:^class:]]
NPOSIXA NPOSIXD, none 0 S ; complement of POSIXA, [[:^class:]]
# End of order is important

CLUMP CLUMP, no 0 V ; Match any extended grapheme cluster sequence

#* Alternation

#* BRANCH The set of branches constituting a single choice are
#* hooked together with their "next" pointers, since
#* precedence prevents anything being concatenated to
#* any individual branch. The "next" pointer of the last
#* BRANCH in a choice points to the thing following the
#* whole choice. This is also where the final "next"
#* pointer of each individual branch points; each branch
#* starts with the operand node of a BRANCH node.
#*
BRANCH BRANCH, node 0 V ; Match this alternative, or the next...

#*Back pointer

#* BACK Normal "next" pointers all implicitly point forward;
#* BACK exists to make loop structures possible.
#* not used
BACK BACK, no 0 V ; Match "", "next" ptr points backward.

#*Literals
# NOTE: the relative ordering of these types is important do not change it

EXACT EXACT, str ; Match this string (preceded by length).
EXACTF EXACT, str ; Match this non-UTF-8 string (not guaranteed to be folded) using /id rules (w/len).
EXACTFL EXACT, str ; Match this string (not guaranteed to be folded) using /il rules (w/len).
EXACTFU EXACT, str ; Match this string (folded iff in UTF-8, length in folding doesn't change if not in UTF-8) using /iu rules (w/len).
EXACTFA EXACT, str ; Match this string (not guaranteed to be folded) using /iaa rules (w/len).
EXACTFU_SS EXACT, str ; Match this string (folded iff in UTF-8, length in folding may change even if not in UTF-8) using /iu rules (w/len).
EXACTFA_NO_TRIE EXACT, str ; Match this string (which is not trie-able; not guaranteed to be folded) using /iaa rules (w/len).

#*Do nothing types

NOTHING NOTHING, no ; Match empty string.
#*A variant of above which delimits a group, thus stops optimizations
TAIL NOTHING, no ; Match empty string. Can jump here from outside.

#*Loops

#* STAR,PLUS '?', and complex '*' and '+', are implemented as
#* circular BRANCH structures using BACK. Simple cases
#* (one character per match) are implemented with STAR
#* and PLUS for speed and to minimize recursive plunges.
#*
STAR STAR, node 0 V ; Match this (simple) thing 0 or more times.
PLUS PLUS, node 0 V ; Match this (simple) thing 1 or more times.

CURLY CURLY, sv 2 V ; Match this simple thing {n,m} times.
CURLYN CURLY, no 2 V ; Capture next-after-this simple thing
CURLYM CURLY, no 2 V ; Capture this medium-complex thing {n,m} times.
CURLYX CURLY, sv 2 V ; Match this complex thing {n,m} times.

#*This terminator creates a loop structure for CURLYX
WHILEM WHILEM, no 0 V ; Do curly processing and see if rest matches.

#*Buffer related

#*OPEN,CLOSE,GROUPP ...are numbered at compile time.
OPEN OPEN, num 1 ; Mark this point in input as start of #n.
CLOSE CLOSE, num 1 ; Analogous to OPEN.

REF REF, num 1 V ; Match some already matched string
REFF REF, num 1 V ; Match already matched string, folded using native charset semantics for non-utf8
REFFL REF, num 1 V ; Match already matched string, folded in loc.
# N?REFF[AU] could have been implemented using the FLAGS field of the
# regnode, but by having a separate node type, we can use the existing switch
# statement to avoid some tests
REFFU REF, num 1 V ; Match already matched string, folded using unicode semantics for non-utf8
REFFA REF, num 1 V ; Match already matched string, folded using unicode semantics for non-utf8, no mixing ASCII, non-ASCII

#*Named references. Code in regcomp.c assumes that these all are after
#*the numbered references
NREF REF, no-sv 1 V ; Match some already matched string
NREFF REF, no-sv 1 V ; Match already matched string, folded using native charset semantics for non-utf8
NREFFL REF, no-sv 1 V ; Match already matched string, folded in loc.
NREFFU REF, num 1 V ; Match already matched string, folded using unicode semantics for non-utf8
NREFFA REF, num 1 V ; Match already matched string, folded using unicode semantics for non-utf8, no mixing ASCII, non-ASCII

IFMATCH BRANCHJ, off 1 . 2 ; Succeeds if the following matches.
UNLESSM BRANCHJ, off 1 . 2 ; Fails if the following matches.
SUSPEND BRANCHJ, off 1 V 1 ; "Independent" sub-RE.
IFTHEN BRANCHJ, off 1 V 1 ; Switch, should be preceded by switcher.
GROUPP GROUPP, num 1 ; Whether the group matched.

#*Support for long RE

LONGJMP LONGJMP, off 1 . 1 ; Jump far away.
BRANCHJ BRANCHJ, off 1 V 1 ; BRANCH with long offset.

#*The heavy worker

EVAL EVAL, evl 1 ; Execute some Perl code.

#*Modifiers

MINMOD MINMOD, no ; Next operator is not greedy.
LOGICAL LOGICAL, no ; Next opcode should set the flag only.

#*This is not used yet
RENUM BRANCHJ, off 1 . 1 ; Group with independently numbered parens.

#*Trie Related

#* Behave the same as A|LIST|OF|WORDS would. The '..C' variants
#* have inline charclass data (ascii only), the 'C' store it in the
#* structure.
# NOTE: the relative order of the TRIE-like regops is significant

TRIE TRIE, trie 1 ; Match many EXACT(F[ALU]?)? at once. flags==type
TRIEC TRIE,trie charclass ; Same as TRIE, but with embedded charclass data

# For start classes, contains an added fail table.
AHOCORASICK TRIE, trie 1 ; Aho Corasick stclass. flags==type
AHOCORASICKC TRIE,trie charclass ; Same as AHOCORASICK, but with embedded charclass data

#*Regex Subroutines
GOSUB GOSUB, num/ofs 2L ; recurse to paren arg1 at (signed) ofs arg2
GOSTART GOSTART, no ; recurse to start of pattern

#*Special conditionals
NGROUPP NGROUPP, no-sv 1 ; Whether the group matched.
INSUBP INSUBP, num 1 ; Whether we are in a specific recurse.
DEFINEP DEFINEP, none 1 ; Never execute directly.

#*Backtracking Verbs
ENDLIKE ENDLIKE, none ; Used only for the type field of verbs
OPFAIL ENDLIKE, none ; Same as (?!)
ACCEPT ENDLIKE, parno 1 ; Accepts the current matched string.


#*Verbs With Arguments
VERB VERB, no-sv 1 ; Used only for the type field of verbs
PRUNE VERB, no-sv 1 ; Pattern fails at this startpoint if no-backtracking through this
MARKPOINT VERB, no-sv 1 ; Push the current location for rollback by cut.
SKIP VERB, no-sv 1 ; On failure skip forward (to the mark) before retrying
COMMIT VERB, no-sv 1 ; Pattern fails outright if backtracking through this
CUTGROUP VERB, no-sv 1 ; On failure go to the next alternation in the group

#*Control what to keep in $&.
KEEPS KEEPS, no ; $& begins here.

#*New charclass like patterns
LNBREAK LNBREAK, none ; generic newline pattern

# NEW STUFF SOMEWHERE ABOVE THIS LINE

################################################################################

#*SPECIAL REGOPS

#* This is not really a node, but an optimized away piece of a "long"
#* node. To simplify debugging output, we mark it as if it were a node
OPTIMIZED NOTHING, off ; Placeholder for dump.

#* Special opcode with the property that no opcode in a compiled program
#* will ever be of this type. Thus it can be used as a flag value that
#* no other opcode has been seen. END is used similarly, in that an END
#* node cant be optimized. So END implies "unoptimizable" and PSEUDO
#* mean "not seen anything to optimize yet".
PSEUDO PSEUDO, off ; Pseudo opcode for internal use.

-------------------------------------------------------------------------------
# Format for second section:
# REGOP \t typelist [ \t typelist]
# typelist= namelist
# = namelist:FAIL
# = name:count

# Anything below is a state
#
#
TRIE next:FAIL
EVAL AB:FAIL
CURLYX end:FAIL
WHILEM A_pre,A_min,A_max,B_min,B_max:FAIL
BRANCH next:FAIL
CURLYM A,B:FAIL
IFMATCH A:FAIL
CURLY B_min_known,B_min,B_max:FAIL
COMMIT next:FAIL
MARKPOINT next:FAIL
SKIP next:FAIL
CUTGROUP next:FAIL
KEEPS next:FAIL
Something went wrong with that request. Please try again.