New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix UTF-8 decode for 4bytes rune #307
Conversation
Hi @s-yasu, thanks for finding this nasty bug! I think your patch can be simplified a bit, in fact all that is needed is to change the value of -const utf8::rune utf8::MAX_4BYTE_RUNE = 0x10FFFFu;
+const utf8::rune utf8::MAX_4BYTE_RUNE = 0x1FFFFFu; After this change, the test passes with your newly added testcases. Also, please can you can you return the |
I think so that if re2c/src/encoding/utf8/utf8.cc Line 11 in ee125da
const utf8::rune utf8::MAX_4BYTE_RUNE = 0x1FFFFFu;
const utf8::rune utf8::MAX_RUNE = utf8::MAX_4BYTE_RUNE; re2c/src/encoding/utf8/utf8.cc Line 45 in ee125da
uint32_t utf8::rune_to_bytes(uint32_t *str, rune c)
{
:
if (c > MAX_RUNE)
c = ERROR; I understand to return the test to be the |
This explains the origin of the error, I must have thought that the maximum Unicode value |
OK. |
Oops. |
Great, thank you! |
2.1.1 (2021-03-27) ~~~~~~~~~~~~~~~~~~ - Added missing CMakeLists.txt to release tarballs (`#346 <https://github.com/skvadrik/re2c/issues/346>`_). 2.1 (2021-03-26) ~~~~~~~~~~~~~~~~ - Added GitHub Actions CI for Linux, macOS and Windows and fixed numerous build issues on those platforms (thanks to `Serghei Iakovlev <https://github.com/sergeyklay>`_). - Added benchmarks for submatch extraction in lexer generators (ragel vs. kleenex vs. re2c with TDFA(0), TDFA(1) or sta-DFA algorithms). + New Autotools (configure) options: ``--enable-benchmarks``, ``--enable-benchmarks-regenerate`` + New CMake options: ``-DRE2C_BUILD_BENCHMARKS``, ``-DRE2C_REGEN_BENCHMARKS`` + New `json2pgfplot.py <https://github.com/skvadrik/re2c/blob/master/benchmarks/json2pgfplot.py>`_ script that converts benchmark results in JSON to a PDF with bar charts - Added option ``--depfile <filename>`` to generate build dependency files (allows to track ``/*!include:re2c*/`` dependencies in the build system). - Added option ``--fixed-tags <none | all | toplevel>`` and improved fixed-tag optimization to work with nested tags. - Added lzip to the distribution tarballs. - Added registerless-TDFA algorithm in the experimental libre2c library. - Explicitly disallowed invalid configuration when ``-f``, ``--storable-state`` option is used, but ``YYFILL`` is disabled (`#306 <https://github.com/skvadrik/re2c/issues/306>`_). - Fixed bug in UTF-8 decode for 4-bytes rune (`#307 <https://github.com/skvadrik/re2c/pull/307>`_, thanks to `Satoshi Yasushima <https://github.com/s-yasu>`_). - Fixed bugs in rare cases of the end-of-input rule ``$`` usage (`277f0295 <https://github.com/skvadrik/re2c/commit/277f0295fc77a2dad3b9838e45f787319b54a25f>`_, `68611a57 <https://github.com/skvadrik/re2c/commit/68611a57a9683c05801255b35ba6217b91391dd8>`_ and `a9d582f9 <https://github.com/skvadrik/re2c/commit/a9d582f9d2a6d123aa55f3b8b73076aae7cb5616>`_). - Optimized ``--skeleton`` generation time. - Renamed internal option ``--dfa`` to ``--nested-negative-tags``. - Updated documentation for end of input handling and submatch extraction.
Masked a character with a bit at 0x0F0000 position with MAX_4BYTE_RUNE (0x10FFFF) that bit was missing.
Such as 𤰖 (0x024C16).