Investigate optimized utf8 decoders and validators #2100

thejoshwolfe · 2019-03-24T21:34:49Z

inspired by some comments in #2099 , check out these algorithms for UTF-8 processing:

So far I'm concerned that none of the above properly validate UTF-8. None of them explain which validation checks they're doing, and Wikipedia lists several checks that are commonly overlooked. And because the above implementations are optimized, it's difficult to know what they're doing without testing, which is part of the objective for this issue.

In addition to nonsensical byte sequences, we also need to be sure to reject:

Overlong encoding
Surrogate half
Overflow

We already have tests for these in the unicode.zig library (search for testError). Switching to an optimized implementation should not regress those tests.

The text was updated successfully, but these errors were encountered:

shawnl · 2019-03-26T12:51:01Z

Looking at the atomic state graph on this page: http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
and comparing it with this chart:


 * http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf - page 94
 *
 * Table 3-7. Well-Formed UTF-8 Byte Sequences
 *
 * +--------------------+------------+-------------+------------+-------------+
 * | Code Points        | First Byte | Second Byte | Third Byte | Fourth Byte |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+0000..U+007F     | 00..7F     |             |            |             |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+0080..U+07FF     | C2..DF     | 80..BF      |            |             |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+0800..U+0FFF     | E0         | A0..BF      | 80..BF     |             |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+1000..U+CFFF     | E1..EC     | 80..BF      | 80..BF     |             |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+D000..U+D7FF     | ED         | 80..9F      | 80..BF     |             |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+E000..U+FFFF     | EE..EF     | 80..BF      | 80..BF     |             |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+10000..U+3FFFF   | F0         | 90..BF      | 80..BF     | 80..BF      |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+40000..U+FFFFF   | F1..F3     | 80..BF      | 80..BF     | 80..BF      |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+100000..U+10FFFF | F4         | 80..8F      | 80..BF     | 80..BF      |
 * +--------------------+------------+-------------+------------+-------------+

They are identical.

matu3ba · 2021-12-22T17:51:15Z

Notable source for cross-validation libgrapheme from suckless project.
If SIMD is acceptable: simdutf8 or a port.
Article how-quickly-can-you-check-that-a-string-is-valid-unicode-utf-8 for own implementation.
Stuff for Benchmarking utf8

andrewrk added this to the 1.0.0 milestone Mar 24, 2019

daurnimator added enhancement Solving this issue will likely involve adding new logic or components to the codebase. standard library This issue involves writing Zig code for the standard library. labels Dec 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate optimized utf8 decoders and validators #2100

Investigate optimized utf8 decoders and validators #2100

thejoshwolfe commented Mar 24, 2019 •

edited

shawnl commented Mar 26, 2019

matu3ba commented Dec 22, 2021

Investigate optimized utf8 decoders and validators #2100

Investigate optimized utf8 decoders and validators #2100

Comments

thejoshwolfe commented Mar 24, 2019 • edited

shawnl commented Mar 26, 2019

matu3ba commented Dec 22, 2021

thejoshwolfe commented Mar 24, 2019 •

edited