Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mismatched YYBACKUP and YYRESTORE #284

fanf2 opened this issue May 30, 2020 · 6 comments

mismatched YYBACKUP and YYRESTORE #284

fanf2 opened this issue May 30, 2020 · 6 comments


Copy link

fanf2 commented May 30, 2020

Hello! I'm trying to use re2c to improve the lexer in unifdef. I
really like the way re2c works, but I have run into a problem, and I'm
not sure if I have made a mistake or misunderstood something or if it
is a bug in re2c.

When I run the program below, the lexer generated by re2c calls
YYRESTORE() without previously calling YYBACKUP(). I thought that
this should not happen, because all the examples in the documentation
start the lexer with YYMARKER uninitialized. I cautiously initialize
YYMARKER to NULL, which means that after a mismatched YYRESTORE()
the lexer tries to dereference a NULL pointer.

The same thing happens with both re2c 1.2 and 1.3.

I have tried this program with and without --input custom and it
seems to behave the same in both cases. However, I'm not confident
that it is compiled correcctly when it is built without --input custom because in that case I can't protect against undefined
behaviour from dereferencing NULL.

I tried to initialize YYMARKER to the same as YYCURSOR instead of NULL,
and that causes the complete version of my lexer to go into a loop.

I found this problem by testing my rather large lexer with LLVM
libfuzzer. I have tried to find something close to a minimal example
that reproduces the failure mode.

Here is the source code (called

#include <stdio.h>

#define YYCTYPE		char

#define myreturn()	return(YYDEBUG(6666,0), YYCURSOR == 0)

#define V const void *
#define YYDEBUG(st, ch)							\
	printf("re2c base %p cursor %p limit %p "			\
	       "accept %d state %d char %c \n",				\
	       (V)ptr, (V)YYCURSOR, (V)YYLIMIT,				\
	       yyaccept, st, ch)

#define yyaccept 0

int main(void) {
	const char *ptr = "/*";
	const char *YYCURSOR = ptr;
	const char *YYLIMIT  = ptr + 2;
	const char *YYMARKER = NULL;

/*!re2c	re2c:flags:input = custom;
        re2c:yyfill:enable = 0;
        re2c:eof = 0;

$	{ myreturn(); }

*	{ myreturn(); }

"%:%:"	{ myreturn(); }

"/*"([^*]*[*]+[^/*])*[^*]*[*]+[/]	{ myreturn(); }


And here is the script I use to build and run it:

#!/bin/sh -x

re2c -W -Werror --debug-output -o 2020-05-30.c &&
cc -Wall -Wextra -Werror -O2 -g -o 2020-05-30 2020-05-30.c &&
./2020-05-30 ||

The output from the debug printf()s does not contain 8888 (YYBACKUP) before 7777 (YYRESTORE):

re2c base 0x10bc48f72 cursor 0x10bc48f72 limit 0x10bc48f74 accept 0 state 0 char / 
re2c base 0x10bc48f72 cursor 0x10bc48f72 limit 0x10bc48f74 accept 0 state 5 char / 
re2c base 0x10bc48f72 cursor 0x10bc48f73 limit 0x10bc48f74 accept 0 state 8 char * 
re2c base 0x10bc48f72 cursor 0x10bc48f74 limit 0x10bc48f74 accept 0 state 9 char  
re2c base 0x10bc48f72 cursor 0x10bc48f74 limit 0x10bc48f74 accept 0 state 9999 char  
re2c base 0x10bc48f72 cursor 0x10bc48f74 limit 0x10bc48f74 accept 0 state 7 char  
re2c base 0x10bc48f72 cursor 0x10bc48f74 limit 0x10bc48f74 accept 0 state 7777 char  
re2c base 0x10bc48f72 cursor 0x0 limit 0x10bc48f74 accept 0 state 3 char  
re2c base 0x10bc48f72 cursor 0x0 limit 0x10bc48f74 accept 0 state 6666 char  
Copy link

skvadrik commented May 30, 2020

Hi! This is a bug that has been fixed by commit 9bb515e. Thanks for reporting. Your understanding that YYRESTORE should not happen without a previous YYBACKUP is correct. And you don't need to initialize YYMARKER, re2c-generated lexer should always take care of that.

Copy link
Contributor Author

fanf2 commented May 30, 2020

Wow, thanks for the super fast response!

I've tried building re2c from the tip of the git master branch, and I'm running libfuzzer against my program, and it is not crashing or hanging. Much better, thanks!

Is there a particular git revision of re2c that I should use (or avoid)? And is there likely to be a new release soon?

Copy link

skvadrik commented May 31, 2020

The current git state should be ok (no particular revision). The release is likely to be within a month. Code-wise it is almost ready, but there is a lot of documentation work to do (this release adds a new language backend for golang).

Copy link

skvadrik commented May 31, 2020

By the way, is your program open-source? Would it make sense to add a test for it in re2c? It is always good to have real-world tests, even if they get out of sync with the original program.

Copy link
Contributor Author

fanf2 commented May 31, 2020

Awesome, thanks, I'll continue using a version from git and I'll keep an eye out for the new release. Sounds like it might be popular :-)

Please feel free to use the code in this issue under CC0

The program it came from will also be CC0 but is currently unfinished work in progress. I think the re2c part is mostly done - it has about 120-130 lines of patterns. I'm using the YYPEEK / YYSKIP hooks to handle backslash-newline escapes (they can happen in the middle of any token which is a massive pain but re2c has made it much easier to handle). I'll see if I can extract the relevant parts to make a pull request under re2c/test/real_world/.

The libfuzzer hook is almost trivial: it just tries to lex the raw junk from the fuzzer :-)

Copy link

skvadrik commented May 31, 2020

Thank you!

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Sep 20, 2020
2.0.3 (2020-08-22)

- Fix issues when building re2c as a CMake subproject
  (`#302 <>`_:

- Final corrections in the SIMPA article "RE2C: A lexer generator based on

2.0.2 (2020-08-08)

- Enable re2go building by default.

- Package CMake files into release tarball.

2.0.1 (2020-07-29)

- Updated version for CMake build system (forgotten in release 2.0).

- Added a short article about re2c for the Software Impacts journal.

2.0 (2020-07-20)

- Added new code generation backend for Go and a new ``re2go`` program
  (`#272 <>`_: Go support).
  Added option ``--lang <c | go>``.

- Added CMake build system as an alternative to Autotools
  (`#275 <>`_:
  Add a CMake build system (thanks to ligfx),
  `#244 <>`_: Switching to CMake).

- Changes in generic API:

  + Removed primitives ``YYSTAGPD`` and ``YYMTAGPD``.
  + Added primitives ``YYSHIFT``, ``YYSHIFTSTAG``, ``YYSHIFTMTAG``
    that allow to express fixed tags in terms of generic API.
  + Added configurations ``re2c:api:style`` and ``re2c:api:sigil``.
  + Added named placeholders in interpolated configuration strings.

- Changes in reuse mode (``-r, --reuse`` option):

  + Do not reset API-related configurations in each `use:re2c` block
    (`#291 <>`_:
    Defines in rules block are not propagated to use blocks).
  + Use block-local options instead of last block options.
  + Do not accumulate options from rules/reuse blocks in whole-program options.
  + Generate non-overlapping YYFILL labels for reuse blocks.
  + Generate start label for each reuse block in storable state mode.

- Changes in start-conditions mode (``-c, --start-conditions`` option):

  + Allow to use normal (non-conditional) blocks in `-c` mode
    (`#263 <>`_:
    allow mixing conditional and non-conditional blocks with -c,
    `#296 <>`_:
    Conditions required for all lexers when using '-c' option).
  + Generate condition switch in every re2c block
    (`#295 <>`_:
    Condition switch generated for only one lexer per file).

- Changes in the generated labels:

  + Use ``yyeof`` label prefix instead of ``yyeofrule``.
  + Use ``yyfill`` label prefix instead of ``yyFillLabel``.
  + Decouple start label and initial label (affects label numbering).

- Removed undocumented configuration ``re2c🎏o``, ``re2c🎏output``.

- Changes in ``re2c🎏t``, ``re2c🎏type-header`` configuration:
  filename is now relative to the output file directory.

- Added option ``--case-ranges`` and configuration ``re2c🎏case-ranges``.

- Extended fixed tags optimization for the case of fixed-counter repetition.

- Fixed bugs related to EOF rule:

  + `#276 <>`_:
    Example in docs is broken
  + `#280 <>`_:
    EOF rules with multiple blocks
  + `#284 <>`_:
    mismatched YYBACKUP and YYRESTORE
    (Add missing fallback states with EOF rule)

- Fixed miscellaneous bugs:

  + `#286 <>`_:
    Incorrect submatch values with fixed-length trailing context.
  + `#297 <>`_:
    configure error on ubuntu 18.04 / cmake 3.10

- Changed bootstrap process (require explicit configuration flags and a path to
  re2c executable to regenerate the lexers).

- Added internal options ``--posix-prectable <naive | complex>``.

- Added debug option ``--dump-dfa-tree``.

- Major revision of the paper "Efficient POSIX submatch extraction on NFA".


1.3 (2019-12-14)

- Added option: ``--stadfa``.

- Added warning: ``-Wsentinel-in-midrule``.

- Added generic API primitives:

  + ``YYSTAGPD``
  + ``YYMTAGPD``

- Added configurations:

  + ``re2c:sentinel = 0;``
  + ``re2c:define:YYSTAGPD = "YYSTAGPD";``
  + ``re2c:define:YYMTAGPD = "YYMTAGPD";``

- Worked on reproducible builds
  (`#258 <>`_:
  Make the build reproducible).


1.2.1 (2019-08-11)

- Fixed bug `#253 <>`_:
  re2c should install somewhere.

- Fixed bug `#254 <>`_:
  Turn off re2c:eof = 0.

1.2 (2019-08-02)

- Added EOF rule ``$`` and configuration ``re2c:eof``.

- Added ``/*!include:re2c ... */`` directive and ``-I`` option.

- Added ``/*!header:re2c:on*/`` and ``/*!header:re2c:off*/`` directives.

- Added ``--input-encoding <ascii | utf8>`` option.

  + `#237 <>`_:
    Handle non-ASCII encoded characters in regular expressions
  + `#250 <>`_
    UTF8 enoding

- Added include file with a list of definitions for Unicode character classes.

  + `#235 <>`_:
    Unicode character classes

- Added ``--location-format <gnu | msvc>`` option.

  + `#195 <>`_:
    Please consider using Gnu format for error messages

- Added ``--verbose`` option that prints "success" message if re2c exits
  without errors.

- Added configurations for options:

  + ``-o --output`` (specify output file)
  + ``-t --type-header`` (specify header file)

- Removed configurations for internal/debug options.

- Extended ``-r`` option: allow to mix multiple ``/*!rules:re2c*/``,
  ``/*!use:re2c*/`` and ``/*!re2c*/`` blocks.

  + `#55 <>`_:
    allow standard re2c blocks in reuse mode

- Fixed ``-F --flex-support`` option: parsing and operator precedence.

  + `#229 <>`_:
    re2c option -F (flex syntax) broken
  + `#242 <>`_:
    Operator precedence with --flex-syntax is broken

- Changed difference operator ``/`` to apply before encoding expansion of

  + `#236 <>`_:
    Support range difference with variable-length encodings

- Changed output generation of output file to be atomic.

  + `#245 <>`_:
    re2c output is not atomic

- Authored research paper "Efficient POSIX Submatch Extraction on NFA"
  together with Dr Angelo Borsotti.

- Added experimental libre2c library (``--enable-libs`` configure option) with
  the following algorithms:

  + TDFA with leftmost-greedy disambiguation
  + TDFA with POSIX disambiguation (Okui-Suzuki algorithm)
  + TNFA with leftmost-greedy disambiguation
  + TNFA with POSIX disambiguation (Okui-Suzuki algorithm)
  + TNFA with lazy POSIX disambiguation (Okui-Suzuki algorithm)
  + TNFA with POSIX disambiguation (Kuklewicz algorithm)
  + TNFA with POSIX disambiguation (Cox algorithm)

- Added debug subsystem (``--enable-debug`` configure option) and new debug

  + ``-dump-cfg`` (dump control flow graph of tag variables)
  + ``-dump-interf`` (dump interference table of tag variables)
  + ``-dump-closure-stats`` (dump epsilon-closure statistics)

- Added internal options:

  + ``--posix-closure <gor1 | gtop>`` (switch between shortest-path algorithms
    used for the construction of POSIX closure)

- Fixed a number of crashes found by American Fuzzy Lop fuzzer:

  + `#226 <>`_,
    `#227 <>`_,
    `#228 <>`_,
    `#231 <>`_,
    `#232 <>`_,
    `#233 <>`_,
    `#234 <>`_,
    `#238 <>`_

- Fixed handling of newlines:

  + correctly parse multi-character newlines CR LF in ``#line`` directives
  + consistently convert all newlines in the generated file to Unix-style LF

- Changed default tarball format from .gz to .xz.

  + `#221 <>`_:
    big source tarball

- Fixed a number of other bugs and resolved issues:

  + `#2 <>`_: abort
  + `#6 <>`_: segfault
  + `#10 <>`_:
    lessons/002_upn_calculator/calc_002 doesn't produce a useful example program
  + `#44 <>`_:
    Access violation when translating the attached file
  + `#49 <>`_:
    wildcard state \000 rules makes lexer behave weard
  + `#98 <>`_:
    Transparent handling of #line directives in input files
  + `#104 <>`_:
    Improve const-correctness
  + `#105 <>`_:
    Conversion of pointer parameters into references
  + `#114 <>`_:
    Possibility of fixing bug 2535084
  + `#120 <>`_:
    condition consisting of default rule only is ignored
  + `#167 <>`_:
    Add word boundary support
  + `#168 <>`_:
    Wikipedia's article on re2c
  + `#180 <>`_:
    Comment syntax?
  + `#182 <>`_:
    yych being set by YYPEEK () and then not used
  + `#196 <>`_:
    Implicit type conversion warnings
  + `#198 <>`_:
    no match for ‘operator!=’ in ‘i != std::vector<_Tp, _Alloc>::rend() [with _Tp = re2c::bitmap_t, _Alloc = std::allocator<re2c::bitmap_t>]()’
  + `#210 <>`_:
    How to build re2c in windows?
  + `#215 <>`_:
    A memory read overrun issue in
  + `#220 <>`_:
    src/dfa/dfa.h: simplify constructor to avoid g++-3.4 bug
  + `#223 <>`_:
    Fix typo
  + `#224 <>`_:
    src/dfa/ pack() tweaks
  + `#225 <>`_:
    Documentation link is broken in libre2c/README
  + `#230 <>`_:
    Changes for upcoming Travis' infra migration
  + `#239 <>`_:
    Push model example has wrong re2c invocation, breaks guide
  + `#241 <>`_:
    Guidance on how to use re2c for full-duplex command & response protocol
  + `#243 <>`_:
    A code generated for period (.) requires 4 bytes
  + `#246 <>`_:
    Please add a license to this repo
  + `#247 <>`_:
    Build failure on current Cygwin, probably caused by force-fed c++98 mode
  + `#248 <>`_:
    distcheck still looks for README
  + `#251 <>`_:
    Including what you use is find, but not without inclusion guards

- Updated documentation and website.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
None yet

No branches or pull requests

2 participants