Investigate optimized utf8 decoders and validators #2100
Labels
enhancement
Solving this issue will likely involve adding new logic or components to the codebase.
standard library
This issue involves writing Zig code for the standard library.
Milestone
inspired by some comments in #2099 , check out these algorithms for UTF-8 processing:
So far I'm concerned that none of the above properly validate UTF-8. None of them explain which validation checks they're doing, and Wikipedia lists several checks that are commonly overlooked. And because the above implementations are optimized, it's difficult to know what they're doing without testing, which is part of the objective for this issue.
In addition to nonsensical byte sequences, we also need to be sure to reject:
We already have tests for these in the
unicode.zig
library (search fortestError
). Switching to an optimized implementation should not regress those tests.The text was updated successfully, but these errors were encountered: