Encoded strings may need to fit a specific size in bytes to be accepted by some libraries or APIs. Let's take UTF-8 chars, for instance, they may need from 1 to 4 bytes (please refer to Wikipedia for details):
- the first 128 characters (US-ASCII) need one byte;
- the next 1,920 characters need two bytes to encode, which covers the remainder of almost all Latin-script alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Thaana and N'Ko alphabets, as well as Combining Diacritical Marks;
- three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use, including most Chinese, Japanese and Korean characters;
- four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).
truncate_utf8()
: given a string and maximum size, the function checks string's UTF-8 byte-size and truncates if needed. Implementation is based on StackOverflow question and answers.
Please make sure to take a moment and read the Code of Conduct.
Please report bugs and suggest features via the GitHub Issues.
Before opening an issue, search the tracker for possible duplicates. If you find a duplicate, please add a comment saying that you encountered the problem as well.
Please make sure to read the Contributing Guide before making a pull request.