Skip to content
UTF-8/16/32 C++11 header only library for Windows/Linux/MacOs
C++ Batchfile CMake Shell
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
include/ww898
perf
test Total redesign. Support VS2019. May 6, 2019
.gitignore
CMakeLists.txt Implement UTF converters and tests Feb 13, 2017
LICENSE.md Create LICENSE.md. Jan 29, 2017
README.md
build.cmd Total redesign. Support VS2019. May 6, 2019
build.sh
perf.cmd

README.md

UTF-8/16/32 C++ library

This is the C++11 template based header only library under Windows/Linux/MacOs to convert UFT-8/16/32 symbols and strings. The library transparently support wchar_t as UTF-16 for Windows and UTF-32 for Linux and MacOs.

UTF-8 and UTF-32 (UCS-32) both support 31 bit wide code points [0‥0x7FFFFFFF]with no restriction. UTF-16 supports only unicode code points [0‥0x10FFFF], where high [0xD800‥0xDBFF] and low [0xDC00‥0xDFFF] surrogate regions are prohibited.

The maximum UTF-16 symbol size is 2 words (4 bytes, both words should be in the surrogate region). UFT-32 (UCS-32) is always 1 word (4 bytes). UTF-8 has the maximum symbol size (see conversion table for details):

  • 4 bytes for unicode code points
  • 6 bytes for 31bit code points
UTF-16 surrogate decoder:
High\Low DC00 DC01 DFFF
D800 010000 010001 0103FF
D801 010400 010401 0107FF
DBFF 10FC00 10FC01 10FFFF

UTF-16 Surrogates

Supported compilers

Tested on following compilers:

Usage example

    // यूनिकोड
    static char const u8s[] = "\xE0\xA4\xAF\xE0\xA5\x82\xE0\xA4\xA8\xE0\xA4\xBF\xE0\xA4\x95\xE0\xA5\x8B\xE0\xA4\xA1";
    using namespace ww898::utf;
    std::u16string u16;
    convz<utf_selector_t<decltype(*u8s)>, utf16>(u8s, std::back_inserter(u16));
    std::u32string u32;
    conv<utf16, utf_selector_t<decltype(u32)::value_type>>(u16.begin(), u16.end(), std::back_inserter(u32));
    std::vector<char> u8;
    convz<utf32, utf8>(u32.data(), std::back_inserter(u8));
    std::wstring uw;
    conv<utf8, utfw>(u8s, u8s + sizeof(u8s), std::back_inserter(uw));
    auto u8r = conv<char>(uw);
    auto u16r = conv<char16_t>(u16);
    auto uwr = convz<wchar_t>(u8s);

    auto u32r = conv<char32_t>(std::string_view(u8r.data(), u8r.size())); // C++17 only

    static_assert(std::is_same<utf_selector<decltype(*u8s)>, utf_selector<decltype(u8)::value_type>>::value, "Fail");
    static_assert(
        std::is_same<utf_selector_t<decltype(u16)::value_type>, utf_selector_t<decltype(uw)::value_type>>::value !=
        std::is_same<utf_selector_t<decltype(u32)::value_type>, utf_selector_t<decltype(uw)::value_type>>::value, "Fail");

UTF-8 Conversion table

UTF-8/32 table

You can’t perform that action at this time.