Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle non-ASCII encoded characters in regular expressions. #237

Closed
terpstra opened this issue Dec 25, 2018 · 5 comments
Closed

Handle non-ASCII encoded characters in regular expressions. #237

terpstra opened this issue Dec 25, 2018 · 5 comments

Comments

@terpstra
Copy link

For example,
[*\xD7] { return op_type(10, 1); } // multiply
is not the same as
[*×] { return op_type(10, 1); }

The later is clearly more reader-friendly.

@terpstra terpstra changed the title Unicode characters in classes does not work as expected Unicode characters in classes do not work as expected Dec 25, 2018
@skvadrik
Copy link
Owner

Hmm... at first I confused Unicode × with ASCII x.

As of now, re2c expects input file in ASCII. It will parse and compile non-ASCII input as well (e.g. UTF-8), but it will treat it a a stream of bytes, so [×] will be processed as a 2-charachter range [\xC3\x97]. If the same character occurred in the part of the code not transformed by re2c, it would be pasted verbatim into the output file, so that the next tool in the pipeline (presumably C/C++ compiler) would see it undamaged.

The problem with handling non-ASCII input is not in decoding it (re2c supports that already), but to decide on the following questions: (1) how to guess the input encoding, and (2) which parts of the syntax should allow non-ASCII characters (it probably doesn't make sense to allow Unicode spaces between rules, for example).

@skvadrik skvadrik changed the title Unicode characters in classes do not work as expected Handle non-ASCII encoded characters in regular expressions. Dec 25, 2018
@terpstra
Copy link
Author

(1) Why do you need to guess the input encoding? Users would be happy to tell re2c on the command-line.
(2) I can't imagine wanting this for anything other than inside ""s or []s.

@terpstra
Copy link
Author

terpstra commented Dec 26, 2018

If I understand correctly, when I put "×"as a string into a re2c input file right now, it is only working because both my re2c input file encoding is utf-8 and my generated parser is utf-8. If they did not match, re2c would do the wrong thing? ie: strings are correct by coincidence, while []s are always broken (because the code point is treated as it's elements).

@skvadrik
Copy link
Owner

(1) Why do you need to guess the input encoding? Users would be happy to tell re2c on the command-line.
(2) I can't imagine wanting this for anything other than inside ""s or []s.

Ok, makes sense.

If I understand correctly, when I put "×"as a string into a re2c input file right now, it is only working because both my re2c input file encoding is utf-8 and my generated parser is utf-8. If they did not match, re2c would do the wrong thing? ie: strings are correct by coincidence, while []s are always broken (because the code point is treated as it's elements).

No. Both strings and classes "don't work", meaning that they are not handled like you expect. Let's see in detail what happens with character classes, and then with strings. First, re2c has to parse input. Parser is encoding-insensitive: it handles everything as raw bytes (ASCII extended to 8-bit range). When you write [*×], parser sees this byte sequence (assuming that your input file is UTF-8 encoded, as on my system):

$ echo -n [*×] | hexdump -C
00000000  5b 2a c3 97 5d                                    |[*..]|

So, the parser sees opening bracket 5b and understands that this is the beginning of a character class. In this mode it can recognize certain escape sequences starting with \. Everything else is treated as a standalone code point, so we get a range consisting of three code points: 2a, c3 and 97. Next comes the closing bracket 5d.

After parsing, re2c performs encoding-sensitive transformation of code points into sequences of code units. This is where -8 option first comes into play. Without it, re2c would assume 8-bit ASCII and transform code points 2a, c3 and 97 into code units 2a, c3 and 97, producing this lexer:

$ echo '/*!re2c [*×] {} */' | ./re2c -i -
/* Generated by re2c 1.1.1 on Wed Dec 26 10:39:01 2018 */

{
        YYCTYPE yych;
        if (YYLIMIT <= YYCURSOR) YYFILL(1);
        yych = *YYCURSOR;
        switch (yych) {
        case '*':
        case 0x97:
        case 0xC3:      goto yy3;
        default:        goto yy2;
        }
yy2:
yy3:
        ++YYCURSOR;
        {}
}

Now that you have set -8, re2c treats each code point as a Unicode symbol, so 2a is +U002a (*, star), c3 is +U00c3 (Ã, Latin capital letter A with tilde) and 97 is +U0097 (control). Next, -8 option tells re2c to encode these symbols in UTF-8, so 2a becomes 2a, c3 becomes c3 83 and 97 becomes c2 97. The resulting lexer is:

$ echo '/*!re2c [*×] {} */' | ./re2c -i8 -
/* Generated by re2c 1.1.1 on Wed Dec 26 10:38:25 2018 */

{
        YYCTYPE yych;
        if ((YYLIMIT - YYCURSOR) < 2) YYFILL(2);
        yych = *YYCURSOR;
        switch (yych) {
        case '*':       goto yy3;
        case 0xC2:      goto yy5;
        case 0xC3:      goto yy6;
        default:        goto yy2;
        }
yy2:
yy3:
        ++YYCURSOR;
        {}
yy5:
        yych = *++YYCURSOR;
        switch (yych) {
        case 0x97:      goto yy3;
        default:        goto yy2;
        }
yy6:
        yych = *++YYCURSOR;
        switch (yych) {
        case 0x83:      goto yy3;
        default:        goto yy2;
        }
}

This is totally different from what you meant with [*×], namely a range consisting of code points 2a and d7 (multiplication sign), corresponding to this lexer:

$ echo '/*!re2c [*\xd7] {} */' | ./re2c -i8 -
/* Generated by re2c 1.1.1 on Wed Dec 26 10:37:29 2018 */

{
        YYCTYPE yych;
        if ((YYLIMIT - YYCURSOR) < 2) YYFILL(2);
        yych = *YYCURSOR;
        switch (yych) {
        case '*':       goto yy3;
        case 0xC3:      goto yy5;
        default:        goto yy2;
        }
yy2:
yy3:
        ++YYCURSOR;
        {}
yy5:
        yych = *++YYCURSOR;
        switch (yych) {
        case 0x97:      goto yy3;
        default:        goto yy2;
        }
}

A similar thing happens with strings: if you write "×", it will be transformed to code point sequence c3 2a, which will be further UTF-8 encoded to 4-byte string c3 83 c2 97, while you expected code point sequence d7 and 2-byte string c3 97.

Neither classes not strings are broken: they behave as they are supposed to, but I surely agree that the default behaviour is confusing.

@skvadrik skvadrik mentioned this issue May 22, 2019
@skvadrik
Copy link
Owner

@terpstra Now it is possible to use UTF-8 encoded strings in regular expressions (in string literals and character classes). The new behaviour is enabled with option --input-encoding utf8. See #250.

netbsd-srcmastr referenced this issue in NetBSD/pkgsrc Sep 20, 2020
2.0.3 (2020-08-22)
~~~~~~~~~~~~~~~~~~

- Fix issues when building re2c as a CMake subproject
  (`#302 <https://github.com/skvadrik/re2c/pull/302>`_:

- Final corrections in the SIMPA article "RE2C: A lexer generator based on
  lookahead-TDFA", https://doi.org/10.1016/j.simpa.2020.100027

2.0.2 (2020-08-08)
~~~~~~~~~~~~~~~~~~

- Enable re2go building by default.

- Package CMake files into release tarball.

2.0.1 (2020-07-29)
~~~~~~~~~~~~~~~~~~

- Updated version for CMake build system (forgotten in release 2.0).

- Added a short article about re2c for the Software Impacts journal.

2.0 (2020-07-20)
~~~~~~~~~~~~~~~~

- Added new code generation backend for Go and a new ``re2go`` program
  (`#272 <https://github.com/skvadrik/re2c/issues/272>`_: Go support).
  Added option ``--lang <c | go>``.

- Added CMake build system as an alternative to Autotools
  (`#275 <https://github.com/skvadrik/re2c/pull/275>`_:
  Add a CMake build system (thanks to ligfx),
  `#244 <https://github.com/skvadrik/re2c/issues/244>`_: Switching to CMake).

- Changes in generic API:

  + Removed primitives ``YYSTAGPD`` and ``YYMTAGPD``.
  + Added primitives ``YYSHIFT``, ``YYSHIFTSTAG``, ``YYSHIFTMTAG``
    that allow to express fixed tags in terms of generic API.
  + Added configurations ``re2c:api:style`` and ``re2c:api:sigil``.
  + Added named placeholders in interpolated configuration strings.

- Changes in reuse mode (``-r, --reuse`` option):

  + Do not reset API-related configurations in each `use:re2c` block
    (`#291 <https://github.com/skvadrik/re2c/issues/291>`_:
    Defines in rules block are not propagated to use blocks).
  + Use block-local options instead of last block options.
  + Do not accumulate options from rules/reuse blocks in whole-program options.
  + Generate non-overlapping YYFILL labels for reuse blocks.
  + Generate start label for each reuse block in storable state mode.

- Changes in start-conditions mode (``-c, --start-conditions`` option):

  + Allow to use normal (non-conditional) blocks in `-c` mode
    (`#263 <https://github.com/skvadrik/re2c/issues/263>`_:
    allow mixing conditional and non-conditional blocks with -c,
    `#296 <https://github.com/skvadrik/re2c/issues/296>`_:
    Conditions required for all lexers when using '-c' option).
  + Generate condition switch in every re2c block
    (`#295 <https://github.com/skvadrik/re2c/issues/295>`_:
    Condition switch generated for only one lexer per file).

- Changes in the generated labels:

  + Use ``yyeof`` label prefix instead of ``yyeofrule``.
  + Use ``yyfill`` label prefix instead of ``yyFillLabel``.
  + Decouple start label and initial label (affects label numbering).

- Removed undocumented configuration ``re2c:flags:o``, ``re2c:flags:output``.

- Changes in ``re2c:flags:t``, ``re2c:flags:type-header`` configuration:
  filename is now relative to the output file directory.

- Added option ``--case-ranges`` and configuration ``re2c:flags:case-ranges``.

- Extended fixed tags optimization for the case of fixed-counter repetition.

- Fixed bugs related to EOF rule:

  + `#276 <https://github.com/skvadrik/re2c/issues/276>`_:
    Example 01_fill.re in docs is broken
  + `#280 <https://github.com/skvadrik/re2c/issues/280>`_:
    EOF rules with multiple blocks
  + `#284 <https://github.com/skvadrik/re2c/issues/284>`_:
    mismatched YYBACKUP and YYRESTORE
    (Add missing fallback states with EOF rule)

- Fixed miscellaneous bugs:

  + `#286 <https://github.com/skvadrik/re2c/issues/286>`_:
    Incorrect submatch values with fixed-length trailing context.
  + `#297 <https://github.com/skvadrik/re2c/issues/297>`_:
    configure error on ubuntu 18.04 / cmake 3.10

- Changed bootstrap process (require explicit configuration flags and a path to
  re2c executable to regenerate the lexers).

- Added internal options ``--posix-prectable <naive | complex>``.

- Added debug option ``--dump-dfa-tree``.

- Major revision of the paper "Efficient POSIX submatch extraction on NFA".

----
1.3x
----

1.3 (2019-12-14)
~~~~~~~~~~~~~~~~

- Added option: ``--stadfa``.

- Added warning: ``-Wsentinel-in-midrule``.

- Added generic API primitives:

  + ``YYSTAGPD``
  + ``YYMTAGPD``

- Added configurations:

  + ``re2c:sentinel = 0;``
  + ``re2c:define:YYSTAGPD = "YYSTAGPD";``
  + ``re2c:define:YYMTAGPD = "YYMTAGPD";``

- Worked on reproducible builds
  (`#258 <https://github.com/skvadrik/re2c/pull/258>`_:
  Make the build reproducible).

----
1.2x
----

1.2.1 (2019-08-11)
~~~~~~~~~~~~~~~~~~

- Fixed bug `#253 <https://github.com/skvadrik/re2c/issues/253>`_:
  re2c should install unicode_categories.re somewhere.

- Fixed bug `#254 <https://github.com/skvadrik/re2c/issues/254>`_:
  Turn off re2c:eof = 0.

1.2 (2019-08-02)
~~~~~~~~~~~~~~~~

- Added EOF rule ``$`` and configuration ``re2c:eof``.

- Added ``/*!include:re2c ... */`` directive and ``-I`` option.

- Added ``/*!header:re2c:on*/`` and ``/*!header:re2c:off*/`` directives.

- Added ``--input-encoding <ascii | utf8>`` option.

  + `#237 <https://github.com/skvadrik/re2c/issues/237>`_:
    Handle non-ASCII encoded characters in regular expressions
  + `#250 <https://github.com/skvadrik/re2c/issues/250>`_
    UTF8 enoding

- Added include file with a list of definitions for Unicode character classes.

  + `#235 <https://github.com/skvadrik/re2c/issues/235>`_:
    Unicode character classes

- Added ``--location-format <gnu | msvc>`` option.

  + `#195 <https://github.com/skvadrik/re2c/issues/195>`_:
    Please consider using Gnu format for error messages

- Added ``--verbose`` option that prints "success" message if re2c exits
  without errors.

- Added configurations for options:

  + ``-o --output`` (specify output file)
  + ``-t --type-header`` (specify header file)

- Removed configurations for internal/debug options.

- Extended ``-r`` option: allow to mix multiple ``/*!rules:re2c*/``,
  ``/*!use:re2c*/`` and ``/*!re2c*/`` blocks.

  + `#55 <https://github.com/skvadrik/re2c/issues/55>`_:
    allow standard re2c blocks in reuse mode

- Fixed ``-F --flex-support`` option: parsing and operator precedence.

  + `#229 <https://github.com/skvadrik/re2c/issues/229>`_:
    re2c option -F (flex syntax) broken
  + `#242 <https://github.com/skvadrik/re2c/issues/242>`_:
    Operator precedence with --flex-syntax is broken

- Changed difference operator ``/`` to apply before encoding expansion of
  operands.

  + `#236 <https://github.com/skvadrik/re2c/issues/236>`_:
    Support range difference with variable-length encodings

- Changed output generation of output file to be atomic.

  + `#245 <https://github.com/skvadrik/re2c/issues/245>`_:
    re2c output is not atomic

- Authored research paper "Efficient POSIX Submatch Extraction on NFA"
  together with Dr Angelo Borsotti.

- Added experimental libre2c library (``--enable-libs`` configure option) with
  the following algorithms:

  + TDFA with leftmost-greedy disambiguation
  + TDFA with POSIX disambiguation (Okui-Suzuki algorithm)
  + TNFA with leftmost-greedy disambiguation
  + TNFA with POSIX disambiguation (Okui-Suzuki algorithm)
  + TNFA with lazy POSIX disambiguation (Okui-Suzuki algorithm)
  + TNFA with POSIX disambiguation (Kuklewicz algorithm)
  + TNFA with POSIX disambiguation (Cox algorithm)

- Added debug subsystem (``--enable-debug`` configure option) and new debug
  options:

  + ``-dump-cfg`` (dump control flow graph of tag variables)
  + ``-dump-interf`` (dump interference table of tag variables)
  + ``-dump-closure-stats`` (dump epsilon-closure statistics)

- Added internal options:

  + ``--posix-closure <gor1 | gtop>`` (switch between shortest-path algorithms
    used for the construction of POSIX closure)

- Fixed a number of crashes found by American Fuzzy Lop fuzzer:

  + `#226 <https://github.com/skvadrik/re2c/issues/226>`_,
    `#227 <https://github.com/skvadrik/re2c/issues/227>`_,
    `#228 <https://github.com/skvadrik/re2c/issues/228>`_,
    `#231 <https://github.com/skvadrik/re2c/issues/231>`_,
    `#232 <https://github.com/skvadrik/re2c/issues/232>`_,
    `#233 <https://github.com/skvadrik/re2c/issues/233>`_,
    `#234 <https://github.com/skvadrik/re2c/issues/234>`_,
    `#238 <https://github.com/skvadrik/re2c/issues/238>`_

- Fixed handling of newlines:

  + correctly parse multi-character newlines CR LF in ``#line`` directives
  + consistently convert all newlines in the generated file to Unix-style LF

- Changed default tarball format from .gz to .xz.

  + `#221 <https://github.com/skvadrik/re2c/issues/221>`_:
    big source tarball

- Fixed a number of other bugs and resolved issues:

  + `#2 <https://github.com/skvadrik/re2c/issues/2>`_: abort
  + `#6 <https://github.com/skvadrik/re2c/issues/6>`_: segfault
  + `#10 <https://github.com/skvadrik/re2c/issues/10>`_:
    lessons/002_upn_calculator/calc_002 doesn't produce a useful example program
  + `#44 <https://github.com/skvadrik/re2c/issues/44>`_:
    Access violation when translating the attached file
  + `#49 <https://github.com/skvadrik/re2c/issues/49>`_:
    wildcard state \000 rules makes lexer behave weard
  + `#98 <https://github.com/skvadrik/re2c/issues/98>`_:
    Transparent handling of #line directives in input files
  + `#104 <https://github.com/skvadrik/re2c/issues/104>`_:
    Improve const-correctness
  + `#105 <https://github.com/skvadrik/re2c/issues/105>`_:
    Conversion of pointer parameters into references
  + `#114 <https://github.com/skvadrik/re2c/issues/114>`_:
    Possibility of fixing bug 2535084
  + `#120 <https://github.com/skvadrik/re2c/issues/120>`_:
    condition consisting of default rule only is ignored
  + `#167 <https://github.com/skvadrik/re2c/issues/167>`_:
    Add word boundary support
  + `#168 <https://github.com/skvadrik/re2c/issues/168>`_:
    Wikipedia's article on re2c
  + `#180 <https://github.com/skvadrik/re2c/issues/180>`_:
    Comment syntax?
  + `#182 <https://github.com/skvadrik/re2c/issues/182>`_:
    yych being set by YYPEEK () and then not used
  + `#196 <https://github.com/skvadrik/re2c/issues/196>`_:
    Implicit type conversion warnings
  + `#198 <https://github.com/skvadrik/re2c/issues/198>`_:
    no match for ‘operator!=’ in ‘i != std::vector<_Tp, _Alloc>::rend() [with _Tp = re2c::bitmap_t, _Alloc = std::allocator<re2c::bitmap_t>]()’
  + `#210 <https://github.com/skvadrik/re2c/issues/210>`_:
    How to build re2c in windows?
  + `#215 <https://github.com/skvadrik/re2c/issues/215>`_:
    A memory read overrun issue in s_to_n32_unsafe.cc
  + `#220 <https://github.com/skvadrik/re2c/issues/220>`_:
    src/dfa/dfa.h: simplify constructor to avoid g++-3.4 bug
  + `#223 <https://github.com/skvadrik/re2c/issues/223>`_:
    Fix typo
  + `#224 <https://github.com/skvadrik/re2c/issues/224>`_:
    src/dfa/closure_posix.cc: pack() tweaks
  + `#225 <https://github.com/skvadrik/re2c/issues/225>`_:
    Documentation link is broken in libre2c/README
  + `#230 <https://github.com/skvadrik/re2c/issues/230>`_:
    Changes for upcoming Travis' infra migration
  + `#239 <https://github.com/skvadrik/re2c/issues/239>`_:
    Push model example has wrong re2c invocation, breaks guide
  + `#241 <https://github.com/skvadrik/re2c/issues/241>`_:
    Guidance on how to use re2c for full-duplex command & response protocol
  + `#243 <https://github.com/skvadrik/re2c/issues/243>`_:
    A code generated for period (.) requires 4 bytes
  + `#246 <https://github.com/skvadrik/re2c/issues/246>`_:
    Please add a license to this repo
  + `#247 <https://github.com/skvadrik/re2c/issues/247>`_:
    Build failure on current Cygwin, probably caused by force-fed c++98 mode
  + `#248 <https://github.com/skvadrik/re2c/issues/248>`_:
    distcheck still looks for README
  + `#251 <https://github.com/skvadrik/re2c/issues/251>`_:
    Including what you use is find, but not without inclusion guards

- Updated documentation and website.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants