Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate optimized utf8 decoders and validators #2100

Open
thejoshwolfe opened this issue Mar 24, 2019 · 2 comments
Open

Investigate optimized utf8 decoders and validators #2100

thejoshwolfe opened this issue Mar 24, 2019 · 2 comments
Labels
enhancement Solving this issue will likely involve adding new logic or components to the codebase. standard library This issue involves writing Zig code for the standard library.
Milestone

Comments

@thejoshwolfe
Copy link
Sponsor Contributor

thejoshwolfe commented Mar 24, 2019

inspired by some comments in #2099 , check out these algorithms for UTF-8 processing:

So far I'm concerned that none of the above properly validate UTF-8. None of them explain which validation checks they're doing, and Wikipedia lists several checks that are commonly overlooked. And because the above implementations are optimized, it's difficult to know what they're doing without testing, which is part of the objective for this issue.

In addition to nonsensical byte sequences, we also need to be sure to reject:

  • Overlong encoding
  • Surrogate half
  • Overflow

We already have tests for these in the unicode.zig library (search for testError). Switching to an optimized implementation should not regress those tests.

@andrewrk andrewrk added this to the 1.0.0 milestone Mar 24, 2019
@shawnl
Copy link
Contributor

shawnl commented Mar 26, 2019

Looking at the atomic state graph on this page: http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
and comparing it with this chart:


 * http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf - page 94
 *
 * Table 3-7. Well-Formed UTF-8 Byte Sequences
 *
 * +--------------------+------------+-------------+------------+-------------+
 * | Code Points        | First Byte | Second Byte | Third Byte | Fourth Byte |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+0000..U+007F     | 00..7F     |             |            |             |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+0080..U+07FF     | C2..DF     | 80..BF      |            |             |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+0800..U+0FFF     | E0         | A0..BF      | 80..BF     |             |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+1000..U+CFFF     | E1..EC     | 80..BF      | 80..BF     |             |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+D000..U+D7FF     | ED         | 80..9F      | 80..BF     |             |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+E000..U+FFFF     | EE..EF     | 80..BF      | 80..BF     |             |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+10000..U+3FFFF   | F0         | 90..BF      | 80..BF     | 80..BF      |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+40000..U+FFFFF   | F1..F3     | 80..BF      | 80..BF     | 80..BF      |
 * +--------------------+------------+-------------+------------+-------------+
 * | U+100000..U+10FFFF | F4         | 80..8F      | 80..BF     | 80..BF      |
 * +--------------------+------------+-------------+------------+-------------+

They are identical.

@daurnimator daurnimator added enhancement Solving this issue will likely involve adding new logic or components to the codebase. standard library This issue involves writing Zig code for the standard library. labels Dec 28, 2019
@matu3ba
Copy link
Contributor

matu3ba commented Dec 22, 2021

  1. Notable source for cross-validation libgrapheme from suckless project.
  2. If SIMD is acceptable: simdutf8 or a port.
  3. Article how-quickly-can-you-check-that-a-string-is-valid-unicode-utf-8 for own implementation.
  4. Stuff for Benchmarking utf8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Solving this issue will likely involve adding new logic or components to the codebase. standard library This issue involves writing Zig code for the standard library.
Projects
None yet
Development

No branches or pull requests

5 participants