Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Latin 1 <= UTF 8 (AVX) #285

Closed
lemire opened this issue Sep 1, 2023 · 5 comments
Closed

Support Latin 1 <= UTF 8 (AVX) #285

lemire opened this issue Sep 1, 2023 · 5 comments
Assignees

Comments

@lemire
Copy link
Member

lemire commented Sep 1, 2023

No description provided.

@Nick-Nuon Nick-Nuon self-assigned this Sep 5, 2023
@aqrit
Copy link

aqrit commented Sep 5, 2023

When "errors" are expected to be rare:
If we don’t distinguish between bad utf-8 and codepoints above 255 in the fast path…
then validation could use fewer instructions?

Converting from a valid (0xC2 0b10xxxxxx or 0xC3 0b10xxxxxx) sequence to latin1 should be just blend -> and -> or ?
Or does it not pay to do a compress_store?

Here is some half-baked, probably incorrect, pseudo code:
https://gist.github.com/aqrit/ebcbd13a43ac4ee4ef05578074ad3631

I’ve been wondering what to do when an issue is detected in the input stream,
maybe have a flag to:

  • stop
  • strip
  • copy verbatim
  • replacement byte

Also have a callback to "transliterate" unmapped codepoints?

@lemire
Copy link
Member Author

lemire commented Sep 5, 2023

Currently we simply do not support fully transcoding bad inputs (i.e. we return an error) but it is something planned for future versions.

The standard approach is to transcode with replacement: bad characters are replaced by a replacement character (possibly one that the user can specify).

@lemire
Copy link
Member Author

lemire commented Sep 5, 2023

Please see #147

@lemire
Copy link
Member Author

lemire commented Sep 5, 2023

“validation could use fewer instruction”

I expect so.

@lemire
Copy link
Member Author

lemire commented Oct 3, 2023

Closed via #301

@lemire lemire closed this as completed Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

3 participants