forked from schacon/perl
/
perlre.pod
3203 lines (2428 loc) · 124 KB
/
perlre.pod
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
=head1 NAME
X<regular expression> X<regex> X<regexp>
perlre - Perl regular expressions
=head1 DESCRIPTION
This page describes the syntax of regular expressions in Perl.
If you haven't used regular expressions before, a tutorial introduction
is available in L<perlretut>. If you know just a little about them,
a quick-start introduction is available in L<perlrequick>.
Except for L</The Basics> section, this page assumes you are familiar
with regular expression basics, like what is a "pattern", what does it
look like, and how it is basically used. For a reference on how they
are used, plus various examples of the same, see discussions of C<m//>,
C<s///>, C<qr//> and C<"??"> in L<perlop/"Regexp Quote-Like Operators">.
New in v5.22, L<C<use re 'strict'>|re/'strict' mode> applies stricter
rules than otherwise when compiling regular expression patterns. It can
find things that, while legal, may not be what you intended.
=head2 The Basics
X<regular expression, version 8> X<regex, version 8> X<regexp, version 8>
Regular expressions are strings with the very particular syntax and
meaning described in this document and auxiliary documents referred to
by this one. The strings are called "patterns". Patterns are used to
determine if some other string, called the "target", has (or doesn't
have) the characteristics specified by the pattern. We call this
"matching" the target string against the pattern. Usually the match is
done by having the target be the first operand, and the pattern be the
second operand, of one of the two binary operators C<=~> and C<!~>,
listed in L<perlop/Binding Operators>; and the pattern will have been
converted from an ordinary string by one of the operators in
L<perlop/"Regexp Quote-Like Operators">, like so:
$foo =~ m/abc/
This evaluates to true if and only if the string in the variable C<$foo>
contains somewhere in it, the sequence of characters "a", "b", then "c".
(The C<=~ m>, or match operator, is described in
L<perlop/m/PATTERN/msixpodualngc>.)
Patterns that aren't already stored in some variable must be delimitted,
at both ends, by delimitter characters. These are often, as in the
example above, forward slashes, and the typical way a pattern is written
in documentation is with those slashes. In most cases, the delimitter
is the same character, fore and aft, but there are a few cases where a
character looks like it has a mirror-image mate, where the opening
version is the beginning delimiter, and the closing one is the ending
delimiter, like
$foo =~ m<abc>
Most times, the pattern is evaluated in double-quotish context, but it
is possible to choose delimiters to force single-quotish, like
$foo =~ m'abc'
If the pattern contains its delimiter within it, that delimiter must be
escaped. Prefixing it with a backslash (I<e.g.>, C<"/foo\/bar/">)
serves this purpose.
Any single character in a pattern matches that same character in the
target string, unless the character is a I<metacharacter> with a special
meaning described in this document. A sequence of non-metacharacters
matches the same sequence in the target string, as we saw above with
C<m/abc/>.
Only a few characters (all of them being ASCII punctuation characters)
are metacharacters. The most commonly used one is a dot C<".">, which
normally matches almost any character (including a dot itself).
You can cause characters that normally function as metacharacters to be
interpreted literally by prefixing them with a C<"\">, just like the
pattern's delimiter must be escaped if it also occurs within the
pattern. Thus, C<"\."> matches just a literal dot, C<"."> instead of
its normal meaning. This means that the backslash is also a
metacharacter, so C<"\\"> matches a single C<"\">. And a sequence that
contains an escaped metacharacter matches the same sequence (but without
the escape) in the target string. So, the pattern C</blur\\fl/> would
match any target string that contains the sequence C<"blur\fl">.
The metacharacter C<"|"> is used to match one thing or another. Thus
$foo =~ m/this|that/
is TRUE if and only if C<$foo> contains either the sequence C<"this"> or
the sequence C<"that">. Like all metacharacters, prefixing the C<"|">
with a backslash makes it match the plain punctuation character; in its
case, the VERTICAL LINE.
$foo =~ m/this\|that/
is TRUE if and only if C<$foo> contains the sequence C<"this|that">.
You aren't limited to just a single C<"|">.
$foo =~ m/fee|fie|foe|fum/
is TRUE if and only if C<$foo> contains any of those 4 sequences from
the children's story "Jack and the Beanstalk".
As you can see, the C<"|"> binds less tightly than a sequence of
ordinary characters. We can override this by using the grouping
metacharacters, the parentheses C<"("> and C<")">.
$foo =~ m/th(is|at) thing/
is TRUE if and only if C<$foo> contains either the sequence S<C<"this
thing">> or the sequence S<C<"that thing">>. The portions of the string
that match the portions of the pattern enclosed in parentheses are
normally made available separately for use later in the pattern,
substitution, or program. This is called "capturing", and it can get
complicated. See L</Capture groups>.
The first alternative includes everything from the last pattern
delimiter (C<"(">, C<"(?:"> (described later), I<etc>. or the beginning
of the pattern) up to the first C<"|">, and the last alternative
contains everything from the last C<"|"> to the next closing pattern
delimiter. That's why it's common practice to include alternatives in
parentheses: to minimize confusion about where they start and end.
Alternatives are tried from left to right, so the first
alternative found for which the entire expression matches, is the one that
is chosen. This means that alternatives are not necessarily greedy. For
example: when matching C<foo|foot> against C<"barefoot">, only the C<"foo">
part will match, as that is the first alternative tried, and it successfully
matches the target string. (This might not seem important, but it is
important when you are capturing matched text using parentheses.)
Besides taking away the special meaning of a metacharacter, a prefixed
backslash changes some letter and digit characters away from matching
just themselves to instead have special meaning. These are called
"escape sequences", and all such are described in L<perlrebackslash>. A
backslash sequence (of a letter or digit) that doesn't currently have
special meaning to Perl will raise a warning if warnings are enabled,
as those are reserved for potential future use.
One such sequence is C<\b>, which matches a boundary of some sort.
C<\b{wb}> and a few others give specialized types of boundaries.
(They are all described in detail starting at
L<perlrebackslash/\b{}, \b, \B{}, \B>.) Note that these don't match
characters, but the zero-width spaces between characters. They are an
example of a L<zero-width assertion|/Assertions>. Consider again,
$foo =~ m/fee|fie|foe|fum/
It evaluates to TRUE if, besides those 4 words, any of the sequences
"feed", "field", "Defoe", "fume", and many others are in C<$foo>. By
judicious use of C<\b> (or better (because it is designed to handle
natural language) C<\b{wb}>), we can make sure that only the Giant's
words are matched:
$foo =~ m/\b(fee|fie|foe|fum)\b/
$foo =~ m/\b{wb}(fee|fie|foe|fum)\b{wb}/
The final example shows that the characters C<"{"> and C<"}"> are
metacharacters.
Another use for escape sequences is to specify characters that cannot
(or which you prefer not to) be written literally. These are described
in detail in L<perlrebackslash/Character Escapes>, but the next three
paragraphs briefly describe some of them.
Various control characters can be written in C language style: C<"\n">
matches a newline, C<"\t"> a tab, C<"\r"> a carriage return, C<"\f"> a
form feed, I<etc>.
More generally, C<\I<nnn>>, where I<nnn> is a string of three octal
digits, matches the character whose native code point is I<nnn>. You
can easily run into trouble if you don't have exactly three digits. So
always use three, or since Perl 5.14, you can use C<\o{...}> to specify
any number of octal digits.
Similarly, C<\xI<nn>>, where I<nn> are hexadecimal digits, matches the
character whose native ordinal is I<nn>. Again, not using exactly two
digits is a recipe for disaster, but you can use C<\x{...}> to specify
any number of hex digits.
Besides being a metacharacter, the C<"."> is an example of a "character
class", something that can match any single character of a given set of
them. In its case, the set is just about all possible characters. Perl
predefines several character classes besides the C<".">; there is a
separate reference page about just these, L<perlrecharclass>.
You can define your own custom character classes, by putting into your
pattern in the appropriate place(s), a list of all the characters you
want in the set. You do this by enclosing the list within C<[]> bracket
characters. These are called "bracketed character classes" when we are
being precise, but often the word "bracketed" is dropped. (Dropping it
usually doesn't cause confusion.) This means that the C<"["> character
is another metacharacter. It doesn't match anything just by itelf; it
is used only to tell Perl that what follows it is a bracketed character
class. If you want to match a literal left square bracket, you must
escape it, like C<"\[">. The matching C<"]"> is also a metacharacter;
again it doesn't match anything by itself, but just marks the end of
your custom class to Perl. It is an example of a "sometimes
metacharacter". It isn't a metacharacter if there is no corresponding
C<"[">, and matches its literal self:
print "]" =~ /]/; # prints 1
The list of characters within the character class gives the set of
characters matched by the class. C<"[abc]"> matches a single "a" or "b"
or "c". But if the first character after the C<"["> is C<"^">, the
class instead matches any character not in the list. Within a list, the
C<"-"> character specifies a range of characters, so that C<a-z>
represents all characters between "a" and "z", inclusive. If you want
either C<"-"> or C<"]"> itself to be a member of a class, put it at the
start of the list (possibly after a C<"^">), or escape it with a
backslash. C<"-"> is also taken literally when it is at the end of the
list, just before the closing C<"]">. (The following all specify the
same class of three characters: C<[-az]>, C<[az-]>, and C<[a\-z]>. All
are different from C<[a-z]>, which specifies a class containing
twenty-six characters, even on EBCDIC-based character sets.)
There is lots more to bracketed character classes; full details are in
L<perlrecharclass/Bracketed Character Classes>.
=head3 Metacharacters
X<metacharacter>
X<\> X<^> X<.> X<$> X<|> X<(> X<()> X<[> X<[]>
L</The Basics> introduced some of the metacharacters. This section
gives them all. Most of them have the same meaning as in the I<egrep>
command.
Only the C<"\"> is always a metacharacter. The others are metacharacters
just sometimes. The following tables lists all of them, summarizes
their use, and gives the contexts where they are metacharacters.
Outside those contexts or if prefixed by a C<"\">, they match their
corresponding punctuation character. In some cases, their meaning
varies depending on various pattern modifiers that alter the default
behaviors. See L</Modifiers>.
PURPOSE WHERE
\ Escape the next character Always, except when
escaped by another \
^ Match the beginning of the string Not in []
(or line, if /m is used)
^ Complement the [] class At the beginning of []
. Match any single character except newline Not in []
(under /s, includes newline)
$ Match the end of the string Not in [], but can
(or before newline at the end of the mean interpolate a
string; or before any newline if /m is scalar
used)
| Alternation Not in []
() Grouping Not in []
[ Start Bracketed Character class Not in []
] End Bracketed Character class Only in [], and
not first
* Matches the preceding element 0 or more Not in []
times
+ Matches the preceding element 1 or more Not in []
times
? Matches the preceding element 0 or 1 Not in []
times
{ Starts a sequence that gives number(s) Not in []
of times the preceding element can be
matched
{ when following certain escape sequences
starts a modifier to the meaning of the
sequence
} End sequence started by {
- Indicates a range Only in [] interior
# Beginning of comment, extends to line end Only with /x modifier
Notice that most of the metacharacters lose their special meaning when
they occur in a bracketed character class, except C<"^"> has a different
meaning when it is at the beginning of such a class. And C<"-"> and C<"]">
are metacharacters only at restricted positions within bracketed
character classes; while C<"}"> is a metacharacter only when closing a
special construct started by C<"{">.
In double-quotish context, as is usually the case, you need to be
careful about C<"$"> and the non-metacharacter C<"@">. Those could
interpolate variables, which may or may not be what you intended.
These rules were designed for compactness of expression, rather than
legibility and maintainability. The L</E<sol>x and E<sol>xx> pattern
modifiers allow you to insert white space to improve readability. And
use of S<C<L<re 'strict'|re/'strict' mode>>> adds extra checking to
catch some typos that might silently compile into something unintended.
By default, the C<"^"> character is guaranteed to match only the
beginning of the string, the C<"$"> character only the end (or before the
newline at the end), and Perl does certain optimizations with the
assumption that the string contains only one line. Embedded newlines
will not be matched by C<"^"> or C<"$">. You may, however, wish to treat a
string as a multi-line buffer, such that the C<"^"> will match after any
newline within the string (except if the newline is the last character in
the string), and C<"$"> will match before any newline. At the
cost of a little more overhead, you can do this by using the
L</C<E<sol>m>> modifier on the pattern match operator. (Older programs
did this by setting C<$*>, but this option was removed in perl 5.10.)
X<^> X<$> X</m>
To simplify multi-line substitutions, the C<"."> character never matches a
newline unless you use the L<C<E<sol>s>|/s> modifier, which in effect tells
Perl to pretend the string is a single line--even if it isn't.
X<.> X</s>
=head2 Modifiers
=head3 Overview
The default behavior for matching can be changed, using various
modifiers. Modifiers that relate to the interpretation of the pattern
are listed just below. Modifiers that alter the way a pattern is used
by Perl are detailed in L<perlop/"Regexp Quote-Like Operators"> and
L<perlop/"Gory details of parsing quoted constructs">.
=over 4
=item B<C<m>>
X</m> X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
Treat the string being matched against as multiple lines. That is, change C<"^"> and C<"$"> from matching
the start of the string's first line and the end of its last line to
matching the start and end of each line within the string.
=item B<C<s>>
X</s> X<regex, single-line> X<regexp, single-line>
X<regular expression, single-line>
Treat the string as single line. That is, change C<"."> to match any character
whatsoever, even a newline, which normally it would not match.
Used together, as C</ms>, they let the C<"."> match any character whatsoever,
while still allowing C<"^"> and C<"$"> to match, respectively, just after
and just before newlines within the string.
=item B<C<i>>
X</i> X<regex, case-insensitive> X<regexp, case-insensitive>
X<regular expression, case-insensitive>
Do case-insensitive pattern matching. For example, "A" will match "a"
under C</i>.
If locale matching rules are in effect, the case map is taken from the
current
locale for code points less than 255, and from Unicode rules for larger
code points. However, matches that would cross the Unicode
rules/non-Unicode rules boundary (ords 255/256) will not succeed, unless
the locale is a UTF-8 one. See L<perllocale>.
There are a number of Unicode characters that match a sequence of
multiple characters under C</i>. For example,
C<LATIN SMALL LIGATURE FI> should match the sequence C<fi>. Perl is not
currently able to do this when the multiple characters are in the pattern and
are split between groupings, or when one or more are quantified. Thus
"\N{LATIN SMALL LIGATURE FI}" =~ /fi/i; # Matches
"\N{LATIN SMALL LIGATURE FI}" =~ /[fi][fi]/i; # Doesn't match!
"\N{LATIN SMALL LIGATURE FI}" =~ /fi*/i; # Doesn't match!
# The below doesn't match, and it isn't clear what $1 and $2 would
# be even if it did!!
"\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i; # Doesn't match!
Perl doesn't match multiple characters in a bracketed
character class unless the character that maps to them is explicitly
mentioned, and it doesn't match them at all if the character class is
inverted, which otherwise could be highly confusing. See
L<perlrecharclass/Bracketed Character Classes>, and
L<perlrecharclass/Negation>.
=item B<C<x>> and B<C<xx>>
X</x>
Extend your pattern's legibility by permitting whitespace and comments.
Details in L</E<sol>x and E<sol>xx>
=item B<C<p>>
X</p> X<regex, preserve> X<regexp, preserve>
Preserve the string matched such that C<${^PREMATCH}>, C<${^MATCH}>, and
C<${^POSTMATCH}> are available for use after matching.
In Perl 5.20 and higher this is ignored. Due to a new copy-on-write
mechanism, C<${^PREMATCH}>, C<${^MATCH}>, and C<${^POSTMATCH}> will be available
after the match regardless of the modifier.
=item B<C<a>>, B<C<d>>, B<C<l>>, and B<C<u>>
X</a> X</d> X</l> X</u>
These modifiers, all new in 5.14, affect which character-set rules
(Unicode, I<etc>.) are used, as described below in
L</Character set modifiers>.
=item B<C<n>>
X</n> X<regex, non-capture> X<regexp, non-capture>
X<regular expression, non-capture>
Prevent the grouping metacharacters C<()> from capturing. This modifier,
new in 5.22, will stop C<$1>, C<$2>, I<etc>... from being filled in.
"hello" =~ /(hi|hello)/; # $1 is "hello"
"hello" =~ /(hi|hello)/n; # $1 is undef
This is equivalent to putting C<?:> at the beginning of every capturing group:
"hello" =~ /(?:hi|hello)/; # $1 is undef
C</n> can be negated on a per-group basis. Alternatively, named captures
may still be used.
"hello" =~ /(?-n:(hi|hello))/n; # $1 is "hello"
"hello" =~ /(?<greet>hi|hello)/n; # $1 is "hello", $+{greet} is
# "hello"
=item Other Modifiers
There are a number of flags that can be found at the end of regular
expression constructs that are I<not> generic regular expression flags, but
apply to the operation being performed, like matching or substitution (C<m//>
or C<s///> respectively).
Flags described further in
L<perlretut/"Using regular expressions in Perl"> are:
c - keep the current position during repeated matching
g - globally match the pattern repeatedly in the string
Substitution-specific modifiers described in
L<perlop/"s/PATTERN/REPLACEMENT/msixpodualngcer"> are:
e - evaluate the right-hand side as an expression
ee - evaluate the right side as a string then eval the result
o - pretend to optimize your code, but actually introduce bugs
r - perform non-destructive substitution and return the new value
=back
Regular expression modifiers are usually written in documentation
as I<e.g.>, "the C</x> modifier", even though the delimiter
in question might not really be a slash. The modifiers C</imnsxadlup>
may also be embedded within the regular expression itself using
the C<(?...)> construct, see L</Extended Patterns> below.
=head3 Details on some modifiers
Some of the modifiers require more explanation than given in the
L</Overview> above.
=head4 C</x> and C</xx>
A single C</x> tells
the regular expression parser to ignore most whitespace that is neither
backslashed nor within a bracketed character class. You can use this to
break up your regular expression into more readable parts.
Also, the C<"#"> character is treated as a metacharacter introducing a
comment that runs up to the pattern's closing delimiter, or to the end
of the current line if the pattern extends onto the next line. Hence,
this is very much like an ordinary Perl code comment. (You can include
the closing delimiter within the comment only if you precede it with a
backslash, so be careful!)
Use of C</x> means that if you want real
whitespace or C<"#"> characters in the pattern (outside a bracketed character
class, which is unaffected by C</x>), then you'll either have to
escape them (using backslashes or C<\Q...\E>) or encode them using octal,
hex, or C<\N{}> escapes.
It is ineffective to try to continue a comment onto the next line by
escaping the C<\n> with a backslash or C<\Q>.
You can use L</(?#text)> to create a comment that ends earlier than the
end of the current line, but C<text> also can't contain the closing
delimiter unless escaped with a backslash.
A common pitfall is to forget that C<"#"> characters begin a comment under
C</x> and are not matched literally. Just keep that in mind when trying
to puzzle out why a particular C</x> pattern isn't working as expected.
Starting in Perl v5.26, if the modifier has a second C<"x"> within it,
it does everything that a single C</x> does, but additionally
non-backslashed SPACE and TAB characters within bracketed character
classes are also generally ignored, and hence can be added to make the
classes more readable.
/ [d-e g-i 3-7]/xx
/[ ! @ " # $ % ^ & * () = ? <> ' ]/xx
may be easier to grasp than the squashed equivalents
/[d-eg-i3-7]/
/[!@"#$%^&*()=?<>']/
Taken together, these features go a long way towards
making Perl's regular expressions more readable. Here's an example:
# Delete (most) C comments.
$program =~ s {
/\* # Match the opening delimiter.
.*? # Match a minimal number of characters.
\*/ # Match the closing delimiter.
} []gsx;
Note that anything inside
a C<\Q...\E> stays unaffected by C</x>. And note that C</x> doesn't affect
space interpretation within a single multi-character construct. For
example in C<\x{...}>, regardless of the C</x> modifier, there can be no
spaces. Same for a L<quantifier|/Quantifiers> such as C<{3}> or
C<{5,}>. Similarly, C<(?:...)> can't have a space between the C<"(">,
C<"?">, and C<":">. Within any delimiters for such a
construct, allowed spaces are not affected by C</x>, and depend on the
construct. For example, C<\x{...}> can't have spaces because hexadecimal
numbers don't have spaces in them. But, Unicode properties can have spaces, so
in C<\p{...}> there can be spaces that follow the Unicode rules, for which see
L<perluniprops/Properties accessible through \p{} and \P{}>.
X</x>
The set of characters that are deemed whitespace are those that Unicode
calls "Pattern White Space", namely:
U+0009 CHARACTER TABULATION
U+000A LINE FEED
U+000B LINE TABULATION
U+000C FORM FEED
U+000D CARRIAGE RETURN
U+0020 SPACE
U+0085 NEXT LINE
U+200E LEFT-TO-RIGHT MARK
U+200F RIGHT-TO-LEFT MARK
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR
=head4 Character set modifiers
C</d>, C</u>, C</a>, and C</l>, available starting in 5.14, are called
the character set modifiers; they affect the character set rules
used for the regular expression.
The C</d>, C</u>, and C</l> modifiers are not likely to be of much use
to you, and so you need not worry about them very much. They exist for
Perl's internal use, so that complex regular expression data structures
can be automatically serialized and later exactly reconstituted,
including all their nuances. But, since Perl can't keep a secret, and
there may be rare instances where they are useful, they are documented
here.
The C</a> modifier, on the other hand, may be useful. Its purpose is to
allow code that is to work mostly on ASCII data to not have to concern
itself with Unicode.
Briefly, C</l> sets the character set to that of whatever B<L>ocale is in
effect at the time of the execution of the pattern match.
C</u> sets the character set to B<U>nicode.
C</a> also sets the character set to Unicode, BUT adds several
restrictions for B<A>SCII-safe matching.
C</d> is the old, problematic, pre-5.14 B<D>efault character set
behavior. Its only use is to force that old behavior.
At any given time, exactly one of these modifiers is in effect. Their
existence allows Perl to keep the originally compiled behavior of a
regular expression, regardless of what rules are in effect when it is
actually executed. And if it is interpolated into a larger regex, the
original's rules continue to apply to it, and only it.
The C</l> and C</u> modifiers are automatically selected for
regular expressions compiled within the scope of various pragmas,
and we recommend that in general, you use those pragmas instead of
specifying these modifiers explicitly. For one thing, the modifiers
affect only pattern matching, and do not extend to even any replacement
done, whereas using the pragmas gives consistent results for all
appropriate operations within their scopes. For example,
s/foo/\Ubar/il
will match "foo" using the locale's rules for case-insensitive matching,
but the C</l> does not affect how the C<\U> operates. Most likely you
want both of them to use locale rules. To do this, instead compile the
regular expression within the scope of C<use locale>. This both
implicitly adds the C</l>, and applies locale rules to the C<\U>. The
lesson is to C<use locale>, and not C</l> explicitly.
Similarly, it would be better to use C<use feature 'unicode_strings'>
instead of,
s/foo/\Lbar/iu
to get Unicode rules, as the C<\L> in the former (but not necessarily
the latter) would also use Unicode rules.
More detail on each of the modifiers follows. Most likely you don't
need to know this detail for C</l>, C</u>, and C</d>, and can skip ahead
to L<E<sol>a|/E<sol>a (and E<sol>aa)>.
=head4 /l
means to use the current locale's rules (see L<perllocale>) when pattern
matching. For example, C<\w> will match the "word" characters of that
locale, and C<"/i"> case-insensitive matching will match according to
the locale's case folding rules. The locale used will be the one in
effect at the time of execution of the pattern match. This may not be
the same as the compilation-time locale, and can differ from one match
to another if there is an intervening call of the
L<setlocale() function|perllocale/The setlocale function>.
Prior to v5.20, Perl did not support multi-byte locales. Starting then,
UTF-8 locales are supported. No other multi byte locales are ever
likely to be supported. However, in all locales, one can have code
points above 255 and these will always be treated as Unicode no matter
what locale is in effect.
Under Unicode rules, there are a few case-insensitive matches that cross
the 255/256 boundary. Except for UTF-8 locales in Perls v5.20 and
later, these are disallowed under C</l>. For example, 0xFF (on ASCII
platforms) does not caselessly match the character at 0x178, C<LATIN
CAPITAL LETTER Y WITH DIAERESIS>, because 0xFF may not be C<LATIN SMALL
LETTER Y WITH DIAERESIS> in the current locale, and Perl has no way of
knowing if that character even exists in the locale, much less what code
point it is.
In a UTF-8 locale in v5.20 and later, the only visible difference
between locale and non-locale in regular expressions should be tainting
(see L<perlsec>).
This modifier may be specified to be the default by C<use locale>, but
see L</Which character set modifier is in effect?>.
X</l>
=head4 /u
means to use Unicode rules when pattern matching. On ASCII platforms,
this means that the code points between 128 and 255 take on their
Latin-1 (ISO-8859-1) meanings (which are the same as Unicode's).
(Otherwise Perl considers their meanings to be undefined.) Thus,
under this modifier, the ASCII platform effectively becomes a Unicode
platform; and hence, for example, C<\w> will match any of the more than
100_000 word characters in Unicode.
Unlike most locales, which are specific to a language and country pair,
Unicode classifies all the characters that are letters I<somewhere> in
the world as
C<\w>. For example, your locale might not think that C<LATIN SMALL
LETTER ETH> is a letter (unless you happen to speak Icelandic), but
Unicode does. Similarly, all the characters that are decimal digits
somewhere in the world will match C<\d>; this is hundreds, not 10,
possible matches. And some of those digits look like some of the 10
ASCII digits, but mean a different number, so a human could easily think
a number is a different quantity than it really is. For example,
C<BENGALI DIGIT FOUR> (U+09EA) looks very much like an
C<ASCII DIGIT EIGHT> (U+0038). And, C<\d+>, may match strings of digits
that are a mixture from different writing systems, creating a security
issue. L<Unicode::UCD/num()> can be used to sort
this out. Or the C</a> modifier can be used to force C<\d> to match
just the ASCII 0 through 9.
Also, under this modifier, case-insensitive matching works on the full
set of Unicode
characters. The C<KELVIN SIGN>, for example matches the letters "k" and
"K"; and C<LATIN SMALL LIGATURE FF> matches the sequence "ff", which,
if you're not prepared, might make it look like a hexadecimal constant,
presenting another potential security issue. See
L<http://unicode.org/reports/tr36> for a detailed discussion of Unicode
security issues.
This modifier may be specified to be the default by C<use feature
'unicode_strings>, C<use locale ':not_characters'>, or
C<L<use 5.012|perlfunc/use VERSION>> (or higher),
but see L</Which character set modifier is in effect?>.
X</u>
=head4 /d
This modifier means to use the "Default" native rules of the platform
except when there is cause to use Unicode rules instead, as follows:
=over 4
=item 1
the target string is encoded in UTF-8; or
=item 2
the pattern is encoded in UTF-8; or
=item 3
the pattern explicitly mentions a code point that is above 255 (say by
C<\x{100}>); or
=item 4
the pattern uses a Unicode name (C<\N{...}>); or
=item 5
the pattern uses a Unicode property (C<\p{...}> or C<\P{...}>); or
=item 6
the pattern uses a Unicode break (C<\b{...}> or C<\B{...}>); or
=item 7
the pattern uses L</C<(?[ ])>>
=item 8
the pattern uses L<C<(*script_run: ...)>|/Script Runs>
=back
Another mnemonic for this modifier is "Depends", as the rules actually
used depend on various things, and as a result you can get unexpected
results. See L<perlunicode/The "Unicode Bug">. The Unicode Bug has
become rather infamous, leading to yet another (printable) name for this
modifier, "Dodgy".
Unless the pattern or string are encoded in UTF-8, only ASCII characters
can match positively.
Here are some examples of how that works on an ASCII platform:
$str = "\xDF"; # $str is not in UTF-8 format.
$str =~ /^\w/; # No match, as $str isn't in UTF-8 format.
$str .= "\x{0e0b}"; # Now $str is in UTF-8 format.
$str =~ /^\w/; # Match! $str is now in UTF-8 format.
chop $str;
$str =~ /^\w/; # Still a match! $str remains in UTF-8 format.
This modifier is automatically selected by default when none of the
others are, so yet another name for it is "Default".
Because of the unexpected behaviors associated with this modifier, you
probably should only explicitly use it to maintain weird backward
compatibilities.
=head4 /a (and /aa)
This modifier stands for ASCII-restrict (or ASCII-safe). This modifier
may be doubled-up to increase its effect.
When it appears singly, it causes the sequences C<\d>, C<\s>, C<\w>, and
the Posix character classes to match only in the ASCII range. They thus
revert to their pre-5.6, pre-Unicode meanings. Under C</a>, C<\d>
always means precisely the digits C<"0"> to C<"9">; C<\s> means the five
characters C<[ \f\n\r\t]>, and starting in Perl v5.18, the vertical tab;
C<\w> means the 63 characters
C<[A-Za-z0-9_]>; and likewise, all the Posix classes such as
C<[[:print:]]> match only the appropriate ASCII-range characters.
This modifier is useful for people who only incidentally use Unicode,
and who do not wish to be burdened with its complexities and security
concerns.
With C</a>, one can write C<\d> with confidence that it will only match
ASCII characters, and should the need arise to match beyond ASCII, you
can instead use C<\p{Digit}> (or C<\p{Word}> for C<\w>). There are
similar C<\p{...}> constructs that can match beyond ASCII both white
space (see L<perlrecharclass/Whitespace>), and Posix classes (see
L<perlrecharclass/POSIX Character Classes>). Thus, this modifier
doesn't mean you can't use Unicode, it means that to get Unicode
matching you must explicitly use a construct (C<\p{}>, C<\P{}>) that
signals Unicode.
As you would expect, this modifier causes, for example, C<\D> to mean
the same thing as C<[^0-9]>; in fact, all non-ASCII characters match
C<\D>, C<\S>, and C<\W>. C<\b> still means to match at the boundary
between C<\w> and C<\W>, using the C</a> definitions of them (similarly
for C<\B>).
Otherwise, C</a> behaves like the C</u> modifier, in that
case-insensitive matching uses Unicode rules; for example, "k" will
match the Unicode C<\N{KELVIN SIGN}> under C</i> matching, and code
points in the Latin1 range, above ASCII will have Unicode rules when it
comes to case-insensitive matching.
To forbid ASCII/non-ASCII matches (like "k" with C<\N{KELVIN SIGN}>),
specify the C<"a"> twice, for example C</aai> or C</aia>. (The first
occurrence of C<"a"> restricts the C<\d>, I<etc>., and the second occurrence
adds the C</i> restrictions.) But, note that code points outside the
ASCII range will use Unicode rules for C</i> matching, so the modifier
doesn't really restrict things to just ASCII; it just forbids the
intermixing of ASCII and non-ASCII.
To summarize, this modifier provides protection for applications that
don't wish to be exposed to all of Unicode. Specifying it twice
gives added protection.
This modifier may be specified to be the default by C<use re '/a'>
or C<use re '/aa'>. If you do so, you may actually have occasion to use
the C</u> modifier explicitly if there are a few regular expressions
where you do want full Unicode rules (but even here, it's best if
everything were under feature C<"unicode_strings">, along with the
C<use re '/aa'>). Also see L</Which character set modifier is in
effect?>.
X</a>
X</aa>
=head4 Which character set modifier is in effect?
Which of these modifiers is in effect at any given point in a regular
expression depends on a fairly complex set of interactions. These have
been designed so that in general you don't have to worry about it, but
this section gives the gory details. As
explained below in L</Extended Patterns> it is possible to explicitly
specify modifiers that apply only to portions of a regular expression.
The innermost always has priority over any outer ones, and one applying
to the whole expression has priority over any of the default settings that are
described in the remainder of this section.
The C<L<use re 'E<sol>foo'|re/"'/flags' mode">> pragma can be used to set
default modifiers (including these) for regular expressions compiled
within its scope. This pragma has precedence over the other pragmas
listed below that also change the defaults.
Otherwise, C<L<use locale|perllocale>> sets the default modifier to C</l>;
and C<L<use feature 'unicode_strings|feature>>, or
C<L<use 5.012|perlfunc/use VERSION>> (or higher) set the default to
C</u> when not in the same scope as either C<L<use locale|perllocale>>
or C<L<use bytes|bytes>>.
(C<L<use locale ':not_characters'|perllocale/Unicode and UTF-8>> also
sets the default to C</u>, overriding any plain C<use locale>.)
Unlike the mechanisms mentioned above, these
affect operations besides regular expressions pattern matching, and so
give more consistent results with other operators, including using
C<\U>, C<\l>, I<etc>. in substitution replacements.
If none of the above apply, for backwards compatibility reasons, the
C</d> modifier is the one in effect by default. As this can lead to
unexpected results, it is best to specify which other rule set should be
used.
=head4 Character set modifier behavior prior to Perl 5.14
Prior to 5.14, there were no explicit modifiers, but C</l> was implied
for regexes compiled within the scope of C<use locale>, and C</d> was
implied otherwise. However, interpolating a regex into a larger regex
would ignore the original compilation in favor of whatever was in effect
at the time of the second compilation. There were a number of
inconsistencies (bugs) with the C</d> modifier, where Unicode rules
would be used when inappropriate, and vice versa. C<\p{}> did not imply
Unicode rules, and neither did all occurrences of C<\N{}>, until 5.12.
=head2 Regular Expressions
=head3 Quantifiers
Quantifiers are used when a particular portion of a pattern needs to
match a certain number (or numbers) of times. If there isn't a
quantifier the number of times to match is exactly one. The following
standard quantifiers are recognized:
X<metacharacter> X<quantifier> X<*> X<+> X<?> X<{n}> X<{n,}> X<{n,m}>
* Match 0 or more times
+ Match 1 or more times
? Match 1 or 0 times
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times
(If a non-escaped curly bracket occurs in a context other than one of
the quantifiers listed above, where it does not form part of a
backslashed sequence like C<\x{...}>, it is either a fatal syntax error,
or treated as a regular character, generally with a deprecation warning
raised. To escape it, you can precede it with a backslash (C<"\{">) or
enclose it within square brackets (C<"[{]">).
This change will allow for future syntax extensions (like making the
lower bound of a quantifier optional), and better error checking of
quantifiers).
The C<"*"> quantifier is equivalent to C<{0,}>, the C<"+">
quantifier to C<{1,}>, and the C<"?"> quantifier to C<{0,1}>. I<n> and I<m> are limited
to non-negative integral values less than a preset limit defined when perl is built.
This is usually 32766 on the most common platforms. The actual limit can
be seen in the error message generated by code such as this:
$_ **= $_ , / {$_} / for 2 .. 42;
By default, a quantified subpattern is "greedy", that is, it will match as
many times as possible (given a particular starting location) while still
allowing the rest of the pattern to match. If you want it to match the
minimum number of times possible, follow the quantifier with a C<"?">. Note
that the meanings don't change, just the "greediness":
X<metacharacter> X<greedy> X<greediness>
X<?> X<*?> X<+?> X<??> X<{n}?> X<{n,}?> X<{n,m}?>
*? Match 0 or more times, not greedily
+? Match 1 or more times, not greedily
?? Match 0 or 1 time, not greedily
{n}? Match exactly n times, not greedily (redundant)
{n,}? Match at least n times, not greedily
{n,m}? Match at least n but not more than m times, not greedily
Normally when a quantified subpattern does not allow the rest of the
overall pattern to match, Perl will backtrack. However, this behaviour is
sometimes undesirable. Thus Perl provides the "possessive" quantifier form
as well.
*+ Match 0 or more times and give nothing back
++ Match 1 or more times and give nothing back
?+ Match 0 or 1 time and give nothing back
{n}+ Match exactly n times and give nothing back (redundant)
{n,}+ Match at least n times and give nothing back
{n,m}+ Match at least n but not more than m times and give nothing back
For instance,
'aaaa' =~ /a++a/
will never match, as the C<a++> will gobble up all the C<"a">'s in the
string and won't leave any for the remaining part of the pattern. This
feature can be extremely useful to give perl hints about where it
shouldn't backtrack. For instance, the typical "match a double-quoted
string" problem can be most efficiently performed when written as:
/"(?:[^"\\]++|\\.)*+"/
as we know that if the final quote does not match, backtracking will not
help. See the independent subexpression
L</C<< (?>pattern) >>> for more details;
possessive quantifiers are just syntactic sugar for that construct. For
instance the above example could also be written as follows:
/"(?>(?:(?>[^"\\]+)|\\.)*)"/
Note that the possessive quantifier modifier can not be be combined
with the non-greedy modifier. This is because it would make no sense.
Consider the follow equivalency table:
Illegal Legal
------------ ------
X??+ X{0}
X+?+ X{1}
X{min,max}?+ X{min}
=head3 Escape sequences
Because patterns are processed as double-quoted strings, the following
also work:
\t tab (HT, TAB)
\n newline (LF, NL)
\r return (CR)
\f form feed (FF)
\a alarm (bell) (BEL)
\e escape (think troff) (ESC)
\cK control char (example: VT)
\x{}, \x00 character whose ordinal is the given hexadecimal number
\N{name} named Unicode character or character sequence
\N{U+263D} Unicode character (example: FIRST QUARTER MOON)
\o{}, \000 character whose ordinal is the given octal number
\l lowercase next char (think vi)
\u uppercase next char (think vi)
\L lowercase until \E (think vi)
\U uppercase until \E (think vi)
\Q quote (disable) pattern metacharacters until \E
\E end either case modification or quoted section, think vi
Details are in L<perlop/Quote and Quote-like Operators>.
=head3 Character Classes and other Special Escapes
In addition, Perl defines the following:
X<\g> X<\k> X<\K> X<backreference>
Sequence Note Description
[...] [1] Match a character according to the rules of the
bracketed character class defined by the "...".
Example: [a-z] matches "a" or "b" or "c" ... or "z"
[[:...:]] [2] Match a character according to the rules of the POSIX
character class "..." within the outer bracketed
character class. Example: [[:upper:]] matches any
uppercase character.
(?[...]) [8] Extended bracketed character class
\w [3] Match a "word" character (alphanumeric plus "_", plus
other connector punctuation chars plus Unicode
marks)
\W [3] Match a non-"word" character
\s [3] Match a whitespace character
\S [3] Match a non-whitespace character
\d [3] Match a decimal digit character
\D [3] Match a non-digit character
\pP [3] Match P, named property. Use \p{Prop} for longer names
\PP [3] Match non-P
\X [4] Match Unicode "eXtended grapheme cluster"
\1 [5] Backreference to a specific capture group or buffer.
'1' may actually be any positive integer.
\g1 [5] Backreference to a specific or previous group,
\g{-1} [5] The number may be negative indicating a relative
previous group and may optionally be wrapped in
curly brackets for safer parsing.
\g{name} [5] Named backreference
\k<name> [5] Named backreference
\K [6] Keep the stuff left of the \K, don't include it in $&
\N [7] Any character but \n. Not affected by /s modifier
\v [3] Vertical whitespace
\V [3] Not vertical whitespace