Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix UTF-8 decode for 4bytes rune #307

Merged
merged 2 commits into from Sep 14, 2020
Merged

Fix UTF-8 decode for 4bytes rune #307

merged 2 commits into from Sep 14, 2020

Conversation

s-yasu
Copy link
Contributor

@s-yasu s-yasu commented Sep 13, 2020

Masked a character with a bit at 0x0F0000 position with MAX_4BYTE_RUNE (0x10FFFF) that bit was missing.
Such as 𤰖 (0x024C16).

@skvadrik
Copy link
Owner

Hi @s-yasu, thanks for finding this nasty bug!

I think your patch can be simplified a bit, in fact all that is needed is to change the value of MAX_4BYTE_RUNE from 0x10FFFFu to 0x1FFFFFu. The algorithm in re2c is an adaptation of re2 decoder with some validity checks thrown away (as re2c validates the input when lexing it), so I'd rather minimize the changes to it.

-const utf8::rune utf8::MAX_4BYTE_RUNE = 0x10FFFFu;
+const utf8::rune utf8::MAX_4BYTE_RUNE = 0x1FFFFFu;

After this change, the test passes with your newly added testcases.

Also, please can you can you return the assert in the test, as it is easy to overlook a non-zero return code, and much harder to overlook an assertion failure. Thanks again for the fix!

@s-yasu
Copy link
Contributor Author

s-yasu commented Sep 13, 2020

I think so that if MAX_4BYTE_RUNE change to 0x1FFFFFu, MAX_RUNE will has same value, and maximum value check in rune_to_bytes() may not be expected.

const utf8::rune utf8::MAX_4BYTE_RUNE = 0x10FFFFu;

const utf8::rune utf8::MAX_4BYTE_RUNE = 0x1FFFFFu;
const utf8::rune utf8::MAX_RUNE       = utf8::MAX_4BYTE_RUNE;

if (c > MAX_RUNE)

uint32_t utf8::rune_to_bytes(uint32_t *str, rune c)
{
       :
    if (c > MAX_RUNE)
        c = ERROR;

I understand to return the test to be the assert().

@skvadrik
Copy link
Owner

I think so that if MAX_4BYTE_RUNE change to 0x1FFFFFu, MAX_RUNE will has same value, and maximum value check in rune_to_bytes() may not be expected.

This explains the origin of the error, I must have thought that the maximum Unicode value 0x10FFFF and the 4-byte mask is the same value. MAX_RUNE should remain 0x10FFFF, but MAX_4BYTE_RUNE should be changed to 0x1FFFFF, because it is a bit mask that should match all code points up to 0x10FFFF (for example, 0xFFFFF is in this range, but it's not covered by bit mask 0x10FFFF).

@s-yasu
Copy link
Contributor Author

s-yasu commented Sep 14, 2020

OK.
I've fixed the branch so check it out.

@s-yasu
Copy link
Contributor Author

s-yasu commented Sep 14, 2020

Oops.
It's the same in the end, but a little fix.
(First commit's utf8_literals.c)

@skvadrik skvadrik merged commit 90e5a23 into skvadrik:master Sep 14, 2020
@skvadrik
Copy link
Owner

Great, thank you!

@s-yasu s-yasu deleted the fix-utf8-decode branch September 16, 2020 15:47
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Apr 1, 2021
2.1.1 (2021-03-27)
~~~~~~~~~~~~~~~~~~

- Added missing CMakeLists.txt to release tarballs
  (`#346 <https://github.com/skvadrik/re2c/issues/346>`_).

2.1 (2021-03-26)
~~~~~~~~~~~~~~~~

- Added GitHub Actions CI for Linux, macOS and Windows and fixed numerous build
  issues on those platforms (thanks to
  `Serghei Iakovlev <https://github.com/sergeyklay>`_).

- Added benchmarks for submatch extraction in lexer generators (ragel vs.
  kleenex vs. re2c with TDFA(0), TDFA(1) or sta-DFA algorithms).

  + New Autotools (configure) options: ``--enable-benchmarks``,
    ``--enable-benchmarks-regenerate``

  + New CMake options: ``-DRE2C_BUILD_BENCHMARKS``, ``-DRE2C_REGEN_BENCHMARKS``

  + New `json2pgfplot.py
    <https://github.com/skvadrik/re2c/blob/master/benchmarks/json2pgfplot.py>`_
    script that converts benchmark results in JSON to a PDF with bar charts

- Added option ``--depfile <filename>`` to generate build dependency files
  (allows to track ``/*!include:re2c*/`` dependencies in the build system).

- Added option ``--fixed-tags <none | all | toplevel>`` and improved fixed-tag
  optimization to work with nested tags.

- Added lzip to the distribution tarballs.

- Added registerless-TDFA algorithm in the experimental libre2c library.

- Explicitly disallowed invalid configuration when ``-f``, ``--storable-state``
  option is used, but ``YYFILL`` is disabled
  (`#306 <https://github.com/skvadrik/re2c/issues/306>`_).

- Fixed bug in UTF-8 decode for 4-bytes rune
  (`#307 <https://github.com/skvadrik/re2c/pull/307>`_, thanks to
  `Satoshi Yasushima <https://github.com/s-yasu>`_).

- Fixed bugs in rare cases of the end-of-input rule ``$`` usage
  (`277f0295 <https://github.com/skvadrik/re2c/commit/277f0295fc77a2dad3b9838e45f787319b54a25f>`_,
  `68611a57 <https://github.com/skvadrik/re2c/commit/68611a57a9683c05801255b35ba6217b91391dd8>`_
  and `a9d582f9 <https://github.com/skvadrik/re2c/commit/a9d582f9d2a6d123aa55f3b8b73076aae7cb5616>`_).

- Optimized ``--skeleton`` generation time.

- Renamed internal option ``--dfa`` to ``--nested-negative-tags``.

- Updated documentation for end of input handling and submatch extraction.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants