-
Notifications
You must be signed in to change notification settings - Fork 528
/
flex.texi
executable file
·9140 lines (7470 loc) · 308 KB
/
flex.texi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\input texinfo.tex @c -*-texinfo-*-
@c %**start of header
@setfilename flex.info
@include version.texi
@settitle Lexical Analysis With Flex, for Flex @value{VERSION}
@set authors Vern Paxson, Will Estes and John Millaway
@c "User Hooks" index
@defindex hk
@c "Options" index
@defindex op
@dircategory Programming
@direntry
* flex: (flex). Fast lexical analyzer generator (lex replacement).
@end direntry
@c %**end of header
@copying
The flex manual is placed under the same licensing conditions as the
rest of flex:
Copyright @copyright{} 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2012
The Flex Project.
Copyright @copyright{} 1990, 1997 The Regents of the University of California.
All rights reserved.
This code is derived from software contributed to Berkeley by
Vern Paxson.
The United States Government has rights in this work pursuant
to contract no. DE-AC03-76SF00098 between the United States
Department of Energy and the University of California.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
@enumerate
@item
Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
@item
Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
@end enumerate
Neither the name of the University nor the names of its contributors
may be used to endorse or promote products derived from this software
without specific prior written permission.
THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE.
@end copying
@titlepage
@title Lexical Analysis with Flex
@subtitle Edition @value{EDITION}, @value{UPDATED}
@author @value{authors}
@page
@vskip 0pt plus 1filll
@insertcopying
@end titlepage
@contents
@ifnottex
@node Top, Copyright, (dir), (dir)
@top flex
This manual describes @code{flex}, a tool for generating programs that
perform pattern-matching on text. The manual includes both tutorial and
reference sections.
This edition of @cite{The flex Manual} documents @code{flex} version
@value{VERSION}. It was last updated on @value{UPDATED}.
This manual was written by @value{authors}.
@menu
* Copyright::
* Reporting Bugs::
* Introduction::
* Simple Examples::
* Format::
* Patterns::
* Matching::
* Actions::
* Generated Scanner::
* Start Conditions::
* Multiple Input Buffers::
* EOF::
* Misc Controls::
* User Values::
* Yacc::
* Scanner Options::
* Performance::
* Cxx::
* Reentrant::
* Lex and Posix::
* Memory Management::
* Serialized Tables::
* Diagnostics::
* Limitations::
* Bibliography::
* FAQ::
* Appendices::
* Indices::
@detailmenu
--- The Detailed Node Listing ---
Format of the Input File
* Definitions Section::
* Rules Section::
* User Code Section::
* Comments in the Input::
Scanner Options
* Options for Specifying Filenames::
* Options Affecting Scanner Behavior::
* Code-Level And API Options::
* Options for Scanner Speed and Size::
* Debugging Options::
* Miscellaneous Options::
Reentrant C Scanners
* Reentrant Uses::
* Reentrant Overview::
* Reentrant Example::
* Reentrant Detail::
* Reentrant Functions::
The Reentrant API in Detail
* Specify Reentrant::
* Extra Reentrant Argument::
* Global Replacement::
* Init and Destroy Functions::
* Accessor Methods::
* Extra Data::
* About yyscan_t::
Memory Management
* The Default Memory Management::
* Overriding The Default Memory Management::
* A Note About yytext And Memory::
Serialized Tables
* Creating Serialized Tables::
* Loading and Unloading Serialized Tables::
* Tables File Format::
FAQ
* When was flex born?::
* How do I expand backslash-escape sequences in C-style quoted strings?::
* Why do flex scanners call fileno if it is not ANSI compatible?::
* Does flex support recursive pattern definitions?::
* How do I skip huge chunks of input (tens of megabytes) while using flex?::
* Flex is not matching my patterns in the same order that I defined them.::
* My actions are executing out of order or sometimes not at all.::
* How can I have multiple input sources feed into the same scanner at the same time?::
* Can I build nested parsers that work with the same input file?::
* How can I match text only at the end of a file?::
* How can I make yyreject() cascade across start condition boundaries?::
* Why cant I use fast or full tables with interactive mode?::
* How much faster is -F or -f than -C?::
* If I have a simple grammar cant I just parse it with flex?::
* Why doesn't yyrestart() set the start state back to INITIAL?::
* How can I match C-style comments?::
* The period isn't working the way I expected.::
* Can I get the flex manual in another format?::
* Does there exist a "faster" NDFA->DFA algorithm?::
* How does flex compile the DFA so quickly?::
* How can I use more than 8192 rules?::
* How do I abandon a file in the middle of a scan and switch to a new file?::
* How do I execute code only during initialization (only before the first scan)?::
* How do I execute code at termination?::
* Where else can I find help?::
* Can I include comments in the "rules" section of the file?::
* I get an error about undefined yywrap().::
* How can I change the matching pattern at run time?::
* How can I expand macros in the input?::
* How can I build a two-pass scanner?::
* How do I match any string not matched in the preceding rules?::
* I am trying to port code from AT&T lex that uses yysptr and yysbuf.::
* Is there a way to make flex treat NUL like a regular character?::
* Whenever flex can not match the input it says "flex scanner jammed".::
* Why doesn't flex have non-greedy operators like perl does?::
* Memory leak - 16386 bytes allocated by malloc.::
* How do I track the byte offset for lseek()?::
* How do I use my own I/O classes in a C++ scanner?::
* How do I skip as many chars as possible?::
* deleteme00::
* Are certain equivalent patterns faster than others?::
* Is backing up a big deal?::
* Can I fake multi-byte character support?::
* deleteme01::
* Can you discuss some flex internals?::
* yyunput() messes up yyatbol::
* The | operator is not doing what I want::
* Why can't flex understand this variable trailing context pattern?::
* The ^ operator isn't working::
* Trailing context is getting confused with trailing optional patterns::
* Is flex GNU or not?::
* ERASEME53::
* I need to scan if-then-else blocks and while loops::
* ERASEME55::
* ERASEME56::
* ERASEME57::
* Is there a repository for flex scanners?::
* How can I conditionally compile or preprocess my flex input file?::
* Where can I find grammars for lex and yacc?::
* I get an end-of-buffer message for each character scanned.::
* unnamed-faq-62::
* unnamed-faq-63::
* unnamed-faq-64::
* unnamed-faq-65::
* unnamed-faq-66::
* unnamed-faq-67::
* unnamed-faq-68::
* unnamed-faq-69::
* unnamed-faq-70::
* unnamed-faq-71::
* unnamed-faq-72::
* unnamed-faq-73::
* unnamed-faq-74::
* unnamed-faq-75::
* unnamed-faq-76::
* unnamed-faq-77::
* unnamed-faq-78::
* unnamed-faq-79::
* unnamed-faq-80::
* unnamed-faq-81::
* unnamed-faq-82::
* unnamed-faq-83::
* unnamed-faq-84::
* unnamed-faq-85::
* unnamed-faq-86::
* unnamed-faq-87::
* unnamed-faq-88::
* unnamed-faq-90::
* unnamed-faq-91::
* unnamed-faq-92::
* unnamed-faq-93::
* unnamed-faq-94::
* unnamed-faq-95::
* unnamed-faq-96::
* unnamed-faq-97::
* unnamed-faq-98::
* unnamed-faq-99::
* unnamed-faq-100::
* unnamed-faq-101::
* Why do I get "conflicting types for yylex" error?::
* How do I access the values set in a Flex action from within a Bison action?::
Appendices
* Makefiles and Flex::
* Bison Bridge::
* M4 Dependency::
* Common Patterns::
* Adding More Target Languages
Indices
* Concept Index::
* Index of Functions::
* Index of Variables::
* Index of Data Types::
* Index of Hooks::
* Index of Scanner Options::
@end detailmenu
@end menu
@end ifnottex
@node Copyright, Reporting Bugs, Top, Top
@chapter Copyright
@cindex copyright of flex
@cindex distributing flex
@insertcopying
@node Reporting Bugs, Introduction, Copyright, Top
@chapter Reporting Bugs
@cindex bugs, reporting
@cindex reporting bugs
If you find a bug in @code{flex}, please report it using
GitHub's issue tracking facility at @url{https://github.com/westes/flex/issues/}
@node Introduction, Simple Examples, Reporting Bugs, Top
@chapter Introduction
@cindex scanner, definition of
@code{flex} is a tool for generating @dfn{scanners}. A scanner is a
program which recognizes lexical patterns in text. The @code{flex}
program reads the given input files, or its standard input if no file
names are given, for a description of a scanner to generate. The
description is in the form of pairs of regular expressions and
fragments of source code
called @dfn{rules}. @code{flex} generates as output a source file
in your target language which defines a routine @code{yylex()}.
This file can be compiled and (if you are using the C/C++ back end)
optionally linked with the flex runtime library to
produce an executable. When the executable is run, it analyzes its
input for occurrences of the regular expressions. Whenever it finds
one, it executes the corresponding rule code.
When your target language is C, the name of the generated scanner
@file{lex.yy.c} by default. Other languages will glue the suffix they
normally use for source-code files to the prefix @file{lex.yy}.
The examples in this manual are in C, which is Flex's default target
language and until release 2.6.4 it is the only one.
@node Simple Examples, Format, Introduction, Top
@chapter Some Simple Examples
First some simple examples to get the flavor of how one uses
@code{flex}.
@cindex username expansion
The following @code{flex} input specifies a scanner which, when it
encounters the string @samp{username} will replace it with the user's
login name:
@example
@verbatim
%%
username printf( "%s", getlogin() );
@end verbatim
@end example
@cindex default rule
@cindex rules, default
By default, any text not matched by a @code{flex} scanner is copied to
the output, so the net effect of this scanner is to copy its input file
to its output with each occurrence of @samp{username} expanded. In this
input, there is just one rule. @samp{username} is the @dfn{pattern} and
the @samp{printf} is the @dfn{action}. The @samp{%%} symbol marks the
beginning of the rules.
Here's another simple example:
@cindex counting characters and lines; reentrant
@example
@verbatiminclude example_r.lex
@end example
If you have looked at older versions of the Flex manual, you might
have seen a version of the above example that looked more like this:
@cindex counting characters and lines; non-reentrant
@example
@verbatiminclude example_nr.lex
@end example
Both versions count the number of characters and the number of lines in
its input. Both produces no output other than the final report on the
character and line counts. The first code line declares two globals,
@code{num_lines} and @code{num_chars}, which are accessible both inside
@code{yylex()} and in the @code{main()} routine declared after the
second @samp{%%}. There are two rules, one which matches a newline
(@samp{\n}) and increments both the line count and the character count,
and one which matches any character other than a newline (indicated by
the @samp{.} regular expression).
The difference between these two variants is that the first uses
Flex's @emph{reentrant} interface, which bundles the scanner state
into a yyscan_t structure; the second uses the @emph{non-reentrant}
interface, in which the scanner's state is exposed through global
variables.
The non-reentrant interface is a relic from the early 1970s when Lex,
the ancestor of Flex, was designed. Modern programming practice frowns
on hidden global variables; thus when Flex generates a scanner in any
language other than the original C/C++ non-reentrancy is not even an
option. Most likely it will make you some kind of scanner class
that you instantiate, with methods and fields rather than exposed globals.
Thus it's a good idea to get used to not relying on the exposed
globals of the original interface from the beginning of your Flex
programming. This is so even though the reentrant example above is a
rather poor one; it avoids exposing the scanner state in globals but
creates globals of its own. There is a mechanism for including
user-defined fields in the scanner structure which will be explained
in detail at @xref{Extra Data}. For now, consider this:
@example
@verbatiminclude example_er.lex
@end example
While it requires a bit more ceremony, several instances of this
scanner can be run concurrently without stepping on each others'
storage.
(The @code{%option noyywrap} in these examples is helpful in
making them run standalone, but does not change the behavior of the scsnner.)
A somewhat more complicated example:
@cindex Pascal-like language
@example
@verbatim
/* scanner for a toy Pascal-like language */
%{
/* need this for the call to atof() below */
#include <math.h>
%}
DIGIT [0-9]
ID [a-z][a-z0-9]*
%%
{DIGIT}+ {
printf( "An integer: %s (%d)\n", yytext,
atoi( yytext ) );
}
{DIGIT}+"."{DIGIT}* {
printf( "A float: %s (%g)\n", yytext,
atof( yytext ) );
}
if|then|begin|end|procedure|function {
printf( "A keyword: %s\n", yytext );
}
{ID} printf( "An identifier: %s\n", yytext );
"+"|"-"|"*"|"/" printf( "An operator: %s\n", yytext );
"{"[^{}\n]*"}" /* eat up one-line comments */
[ \t\n]+ /* eat up whitespace */
. printf( "Unrecognized character: %s\n", yytext );
%%
int main( int argc, char **argv ) {
++argv, --argc; /* skip over program name */
if ( argc > 0 ) {
yyin = fopen( argv[0], "r" );
} else {
yyin = stdin;
}
yylex();
}
@end verbatim
@end example
This is the beginnings of a simple scanner for a language like Pascal.
It identifies different types of @dfn{tokens} and reports on what it has
seen.
The details of this example will be explained in the following
sections.
@node Format, Patterns, Simple Examples, Top
@chapter Format of the Input File
@cindex format of flex input
@cindex input, format of
@cindex file format
@cindex sections of flex input
The @code{flex} input file consists of three sections, separated by a
line containing only @samp{%%}.
@cindex format of input file
@example
@verbatim
definitions
%%
rules
%%
user code
@end verbatim
@end example
@menu
* Definitions Section::
* Rules Section::
* User Code Section::
* Comments in the Input::
@end menu
@node Definitions Section, Rules Section, Format, Format
@section Format of the Definitions Section
@cindex input file, Definitions section
@cindex Definitions, in flex input
The @dfn{definitions section} contains declarations of simple @dfn{name}
definitions to simplify the scanner specification, and declarations of
@dfn{start conditions}, which are explained in a later section.
@cindex aliases, how to define
@cindex pattern aliases, how to define
Name definitions have the form:
@example
@verbatim
name definition
@end verbatim
@end example
The @samp{name} is a word beginning with a letter or an underscore
(@samp{_}) followed by zero or more letters, digits, @samp{_}, or
@samp{-} (dash). The definition is taken to begin at the first
non-whitespace character following the name and continuing to the end of
the line. The definition can subsequently be referred to using
@samp{@{name@}}, which will expand to @samp{(definition)}. For example,
@cindex pattern aliases, defining
@cindex defining pattern aliases
@example
@verbatim
DIGIT [0-9]
ID [a-z][a-z0-9]*
@end verbatim
@end example
Defines @samp{DIGIT} to be a regular expression which matches a single
digit, and @samp{ID} to be a regular expression which matches a letter
followed by zero-or-more letters-or-digits. A subsequent reference to
@cindex pattern aliases, use of
@example
@verbatim
{DIGIT}+"."{DIGIT}*
@end verbatim
@end example
is identical to
@example
@verbatim
([0-9])+"."([0-9])*
@end verbatim
@end example
and matches one-or-more digits followed by a @samp{.} followed by
zero-or-more digits.
@cindex comments in flex input
An unindented comment (i.e., a line
beginning with @samp{/*}) is copied verbatim to the output up
to the next @samp{*/}.
@cindex %@{ and %@}, in Definitions Section
@cindex embedding C code in flex input
@cindex C code in flex input
Any @emph{indented} text or text enclosed in @samp{%@{} and @samp{%@}}
is also copied verbatim to the output (with the %@{ and %@} symbols
removed). The %@{ and %@} symbols must appear unindented on lines by
themselves.
@cindex %top
A @code{%top} block is similar to a @samp{%@{} ... @samp{%@}} block, except
that the code in a @code{%top} block is relocated to the @emph{top} of the
generated file, before any flex definitions @footnote{Actually, in the
C/C++ back end,
@code{yyIN_HEADER} is defined before the @samp{%top} block.}.
The @code{%top} block is useful when you want definitions to be
evaluated or certain files to be included before the generated code.
The single characters, @samp{@{} and @samp{@}} are used to delimit the
@code{%top} block, as show in the example below:
@example
@verbatim
%top{
/* This code goes at the "top" of the generated file. */
#include <stdint.h>
#include <inttypes.h>
}
@end verbatim
@end example
Multiple @code{%top} blocks are allowed, and their order is preserved.
@node Rules Section, User Code Section, Definitions Section, Format
@section Format of the Rules Section
@cindex input file, Rules Section
@cindex rules, in flex input
The @dfn{rules} section of the @code{flex} input contains a series of
rules of the form:
@example
@verbatim
pattern action
@end verbatim
@end example
where the pattern must be unindented and the action must begin
on the same line.
@xref{Patterns}, for a further description of patterns and actions.
In the rules section, any indented or %@{ %@} enclosed text appearing
before the first rule may be used to declare variables which are local
to the scanning routine and (after the declarations) code which is to be
executed whenever the scanning routine is entered. Other indented or
%@{ %@} text in the rule section is still copied to the output, but its
meaning is not well-defined and it may well cause compile-time errors
(this feature is present for @acronym{POSIX} compliance. @xref{Lex and
Posix}, for other such features).
Any @emph{indented} text or text enclosed in @samp{%@{} and @samp{%@}}
is copied verbatim to the output (with the %@{ and %@} symbols
removed). The %@{ and %@} symbols must appear unindented on lines by
themselves. Because whitespace is easy to mangle without noticing,
it's good style to use the explicit %@{ and %@} delimiters.
@node User Code Section, Comments in the Input, Rules Section, Format
@section Format of the User Code Section
@cindex input file, user code Section
@cindex user code, in flex input
The user code section is simply copied to @file{lex.yy.c} verbatim. It
is used for companion routines which call or are called by the scanner.
The presence of this section is optional; if it is missing, the second
@samp{%%} in the input file may be skipped, too.
@node Comments in the Input, , User Code Section, Format
@section Comments in the Input
@cindex comments, syntax of
Flex supports C-style comments, that is, anything between @samp{/*} and
@samp{*/} is
considered a comment in the parts of the file Flex
interprets. Whenever flex encounters a comment, it copies the entire
comment verbatim to the generated source code. Comments may appear
just about anywhere, but with the following exceptions:
@itemize
@cindex comments, in rules section
@item
Comments may not appear in the Rules Section wherever flex is expecting
a regular expression. This means comments may not appear at the
beginning of a line, or immediately following a list of scanner states.
@item
Comments may not appear on an @samp{%option} line in the Definitions
Section.
@end itemize
If you want to follow a simple rule, then always begin a comment on a
new line, with one or more whitespace characters before the initial
@samp{/*}). This rule will work anywhere in the input file.
All the comments in the following example are valid:
@cindex comments, valid uses of
@cindex comments in the input
@example
@verbatim
%{
/* C code block - other target languages might have different comment syntax */
%}
/* Definitions Section */
%x STATE_X
%%
/* Rules Section */
ruleA /* after regex */ { /* C code block */ } /* after code block */
/* Rules Section (indented) */
<STATE_X>{
ruleC yyecho();
ruleD yyecho();
%{
/* C code block */
%}
}
%%
/* User C Code Section */
@end verbatim
@end example
If the target language is something other than C/C++, you will need to use
its normal comment syntax in actions and code blocks. Note that the
optional @{ and @} delimiters around actions a Flex syntax, not C
syntax; you will be able to use those even if, e,g., your target
language is Pascal-like and delimits blocs with begin/end.
@node Patterns, Matching, Format, Top
@chapter Patterns
@cindex patterns, in rules section
@cindex regular expressions, in patterns
The patterns in the input (see @ref{Rules Section}) are written using an
extended set of regular expressions. These are:
@cindex patterns, syntax
@cindex patterns, syntax
@table @samp
@item x
match the character 'x'
@item .
any character (byte) except newline
@cindex [] in patterns
@cindex character classes in patterns, syntax of
@cindex POSIX, character classes in patterns, syntax of
@item [xyz]
a @dfn{character class}; in this case, the pattern
matches either an 'x', a 'y', or a 'z'
@cindex ranges in patterns
@item [abj-oZ]
a "character class" with a range in it; matches
an 'a', a 'b', any letter from 'j' through 'o',
or a 'Z'
@cindex ranges in patterns, negating
@cindex negating ranges in patterns
@item [^A-Z]
a "negated character class", i.e., any character
but those in the class. In this case, any
character EXCEPT an uppercase letter.
@item [^A-Z\n]
any character EXCEPT an uppercase letter or
a newline
@item [a-z]@{-@}[aeiou]
the lowercase consonants
@item r*
zero or more r's, where r is any regular expression
@item r+
one or more r's
@item r?
zero or one r's (that is, ``an optional r'')
@cindex braces in patterns
@item r@{2,5@}
anywhere from two to five r's
@item r@{2,@}
two or more r's
@item r@{4@}
exactly 4 r's
@cindex pattern aliases, expansion of
@item @{name@}
the expansion of the @samp{name} definition
(@pxref{Format}).
@cindex literal text in patterns, syntax of
@cindex verbatim text in patterns, syntax of
@item "[xyz]\"foo"
the literal string: @samp{[xyz]"foo}
@cindex escape sequences in patterns, syntax of
@item \X
if X is @samp{a}, @samp{b}, @samp{f}, @samp{n}, @samp{r}, @samp{t}, or
@samp{v}, then the ANSI-C interpretation of @samp{\x}. Otherwise, a
literal @samp{X} (used to escape operators such as @samp{*})
@cindex NUL character in patterns, syntax of
@item \0
a NUL character (ASCII code 0)
@cindex octal characters in patterns
@item \123
the character with octal value 123
@item \x2a
the character with hexadecimal value 2a
@item (r)
match an @samp{r}; parentheses are used to override precedence (see below)
@item (?r-s:pattern)
apply option @samp{r} and omit option @samp{s} while interpreting pattern.
Options may be zero or more of the characters @samp{i}, @samp{s}, or @samp{x}.
@samp{i} means case-insensitive. @samp{-i} means case-sensitive.
@samp{s} alters the meaning of the @samp{.} syntax to match any single byte whatsoever.
@samp{-s} alters the meaning of @samp{.} to match any byte except @samp{\n}.
@samp{x} ignores comments and whitespace in patterns. Whitespace is ignored unless
it is backslash-escaped, contained within @samp{""}s, or appears inside a
character class.
The following are all valid:
@verbatim
(?:foo) same as (foo)
(?i:ab7) same as ([aA][bB]7)
(?-i:ab) same as (ab)
(?s:.) same as [\x00-\xFF]
(?-s:.) same as [^\n]
(?ix-s: a . b) same as ([Aa][^\n][bB])
(?x:a b) same as ("ab")
(?x:a\ b) same as ("a b")
(?x:a" "b) same as ("a b")
(?x:a[ ]b) same as ("a b")
(?x:a
/* comment */
b
c) same as (abc)
@end verbatim
@item (?# comment )
omit everything within @samp{()}. The first @samp{)}
character encountered ends the pattern. It is not possible to for the comment
to contain a @samp{)} character. The comment may span lines.
@cindex concatenation, in patterns
@item rs
the regular expression @samp{r} followed by the regular expression @samp{s}; called
@dfn{concatenation}
@item r|s
either an @samp{r} or an @samp{s}
@cindex trailing context, in patterns
@item r/s
an @samp{r} but only if it is followed by an @samp{s}. The text matched by @samp{s} is
included when determining whether this rule is the longest match, but is
then returned to the input before the action is executed. So the action
only sees the text matched by @samp{r}. This type of pattern is called
@dfn{trailing context}. (There are some combinations of @samp{r/s} that flex
cannot match correctly. @xref{Limitations}, regarding dangerous trailing
context.)
@cindex beginning of line, in patterns
@cindex BOL, in patterns
@item ^r
an @samp{r}, but only at the beginning of a line (i.e.,
when just starting to scan, or right after a
newline has been scanned).
@cindex end of line, in patterns
@cindex EOL, in patterns
@item r$
an @samp{r}, but only at the end of a line (i.e., just before a
newline). Equivalent to @samp{r/\n}.
@cindex newline, matching in patterns
Note that @code{flex}'s notion of ``newline'' is exactly
whatever the C compiler used to compile @code{flex}
interprets @samp{\n} as; in particular, on some DOS
systems you must either filter out @samp{\r}s in the
input yourself, or explicitly use @samp{r/\r\n} for @samp{r$}.
@cindex start conditions, in patterns
@item <s>r
an @samp{r}, but only in start condition @code{s} (see @ref{Start
Conditions} for discussion of start conditions).
@item <s1,s2,s3>r
same, but in any of start conditions @code{s1}, @code{s2}, or @code{s3}.
@item <*>r
an @samp{r} in any start condition, even an exclusive one.
@cindex end of file, in patterns
@cindex EOF in patterns, syntax of
@item <<EOF>>
an end-of-file.
@item <s1,s2><<EOF>>
an end-of-file when in start condition @code{s1} or @code{s2}
@end table
Note that inside of a character class, all regular expression operators
lose their special meaning except escape (@samp{\}) and the character class
operators, @samp{-}, @samp{]}, and, at the beginning of the class, @samp{^}.
Additionally, @samp{-} and @samp{]} lose their special meaning if they
immediately follow the @samp{[} or @samp{[^} that start the class. Finally,
@samp{-} loses its special meaning if it immediately precedes the @samp{]}
that ends the class.
@cindex patterns, precedence of operators
The regular expressions listed above are grouped according to
precedence, from highest precedence at the top to lowest at the bottom.
Those grouped together have equal precedence (see special note on the
precedence of the repeat operator, @samp{@{@}}, under the documentation
for the @samp{--posix} POSIX compliance option). For example,
@cindex patterns, grouping and precedence
@example
@verbatim
foo|bar*
@end verbatim
@end example
is the same as
@example
@verbatim
(foo)|(ba(r*))
@end verbatim
@end example
since the @samp{*} operator has higher precedence than concatenation,
and concatenation higher than alternation (@samp{|}). This pattern
therefore matches @emph{either} the string @samp{foo} @emph{or} the
string @samp{ba} followed by zero-or-more @samp{r}'s. To match
@samp{foo} or zero-or-more repetitions of the string @samp{bar}, use:
@example
@verbatim
foo|(bar)*
@end verbatim
@end example
And to match a sequence of zero or more repetitions of @samp{foo} and
@samp{bar}:
@cindex patterns, repetitions with grouping
@example
@verbatim
(foo|bar)*
@end verbatim
@end example
@cindex character classes in patterns
In addition to characters and ranges of characters, character classes
can also contain @dfn{character class expressions}. These are
expressions enclosed inside @samp{[:} and @samp{:]} delimiters (which
themselves must appear between the @samp{[} and @samp{]} of the
character class. Other elements may occur inside the character class,
too). The valid expressions are:
@cindex patterns, valid character classes
@example
@verbatim
[:alnum:] [:alpha:] [:blank:]
[:cntrl:] [:digit:] [:graph:]
[:lower:] [:print:] [:punct:]
[:space:] [:upper:] [:xdigit:]
@end verbatim
@end example
These expressions all designate a set of characters equivalent to the
corresponding standard C @code{isXXX} function. For example,
@samp{[:alnum:]} designates those characters for which @code{isalnum()}
returns true - i.e., any alphabetic or numeric character. Some systems
don't provide @code{isblank()}, so flex defines @samp{[:blank:]} as a
blank or a tab.
For example, the following character classes are all equivalent:
@cindex character classes, equivalence of
@cindex patterns, character class equivalence
@example
@verbatim
[[:alnum:]]
[[:alpha:][:digit:]]
[[:alpha:]0-9]
[a-zA-Z0-9]
@end verbatim
@end example
A word of caution. Character classes are expanded immediately when seen in the @code{flex} input.
This means the character classes are sensitive to the locale in which @code{flex}
is executed, and the resulting scanner will not be sensitive to the runtime locale.
This may or may not be desirable.
@itemize
@cindex case-insensitive, effect on character classes
@item If your scanner is case-insensitive (the @samp{-i} flag), then
@samp{[:upper:]} and @samp{[:lower:]} are equivalent to
@samp{[:alpha:]}.
@anchor{case and character ranges}
@item Character classes with ranges, such as @samp{[a-Z]}, should be used with
caution in a case-insensitive scanner if the range spans upper or lowercase
characters. Flex does not know if you want to fold all upper and lowercase
characters together, or if you want the literal numeric range specified (with
no case folding). When in doubt, flex will assume that you meant the literal
numeric range, and will issue a warning. The exception to this rule is a
character range such as @samp{[a-z]} or @samp{[S-W]} where it is obvious that you
want case-folding to occur. Here are some examples with the @samp{-i} flag
enabled: