Skip to content

Loading…

CharsetConverter UTF-32 fix #3353

Merged
merged 2 commits into from

2 participants

@Karlson2k
Team Kodi member
  • Most common implementation of iconv (GNU libiconv) use big endian and BOM when output is UTF-32. Most other (but not all) iconv implementations usually works similar. To ensure that conversion will be correct, we need to explicitly specify output endianness.

  • On some platforms (FreeBSD) wchar_t isn't in Unicode, so code checks STDC_ISO_10646 macro before using wchar_t as UTF.

  • If wchar_t is 4 bytes wide and in Unicode, do proper convert from wchar_t to UTF-32 (instead of copy), as wchar_t in UCS-4 (superset of UTF-32). UTF-32 can be simply copied to wchar_t

This PR is alternative for #3351 and #3342

@Karlson2k
Team Kodi member

@jmarshallnz Could you select one of: #3342, #3351, #3353?
I think this one is better that others.

Need it to finish work on CharsetConverter

@Karlson2k
Team Kodi member

jenkins build this please
Thanks!

@jmarshallnz
Team Kodi member

My understanding is that UCS-4 and UTF-32 are identical at this point, as per ISO 10646:2003 no codepoints greater than 10FFFF are allowed anyway (as they can't be represented in UTF-16). This is the same reason why UTF-8 only needs 4 bytes/character max rather than 5 or 6.

@Karlson2k
Team Kodi member

Right. Small addition: UCS-4 ususally not checked for value limits, while UTF-32 is checked. Look at any charset processing engine. So conversion from UCS-4 to UTF-32 is precaution as unchecked value can be used as security hole.
But I'm fine with it removal.
@jmarshallnz Other code is fine?

@jmarshallnz
Team Kodi member

I suggest then clearly labelling WCHAR_IS_UCS4 and then clearly commenting why you're "converting" it when switching to UTF32. Otherwise, all good.

Karlson2k added some commits
@Karlson2k Karlson2k CharsetConverter: add extra care on using wchar_t as UTF-32
* check for __STDC_ISO_10646__
* copy-convert only for UTF-32 -> wchar_t, but use converter for wchar_t -> UTF-32 (UTF-32 has more restrictions)
807747a
@Karlson2k Karlson2k CharsetConverter: use UTF-32 charset with correct endianness d06388d
@Karlson2k
Team Kodi member

@jmarshallnz updated.
jenkins build this please

@Karlson2k
Team Kodi member

Master was fixed, so jenkins build this please again

@Karlson2k
Team Kodi member

Got OK from jmarshallnz.

@Karlson2k Karlson2k merged commit 3a08053 into xbmc:master

1 check passed

Details default Merged build #297 succeeded in 52 min
@Karlson2k Karlson2k deleted the Karlson2k:charsetconverter_fix_02_alt2 branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Oct 2, 2013
  1. @Karlson2k

    CharsetConverter: add extra care on using wchar_t as UTF-32

    Karlson2k committed
    * check for __STDC_ISO_10646__
    * copy-convert only for UTF-32 -> wchar_t, but use converter for wchar_t -> UTF-32 (UTF-32 has more restrictions)
  2. @Karlson2k
Showing with 45 additions and 42 deletions.
  1. +45 −42 xbmc/utils/CharsetConverter.cpp
View
87 xbmc/utils/CharsetConverter.cpp
@@ -31,44 +31,48 @@
#include <errno.h>
#include <iconv.h>
+#if !defined(TARGET_WINDOWS) && defined(HAVE_CONFIG_H)
+ #include "config.h"
+#endif
+
+#ifdef WORDS_BIGENDIAN
+ #define ENDIAN_SUFFIX "BE"
+#else
+ #define ENDIAN_SUFFIX "LE"
+#endif
+
#if defined(TARGET_DARWIN)
- #define WCHAR_IS_UTF32 1
- #undef WCHAR_IS_UTF16
- #ifdef __POWERPC__
- #define WCHAR_CHARSET "UTF-32BE"
- #else
- #define WCHAR_CHARSET "UTF-32LE"
- #endif
+ #define WCHAR_IS_UCS_4 1
+ #define UTF16_CHARSET "UTF-16" ENDIAN_SUFFIX
+ #define UTF32_CHARSET "UTF-32" ENDIAN_SUFFIX
#define UTF8_SOURCE "UTF-8-MAC"
+ #define WCHAR_CHARSET UTF32_CHARSET
#elif defined(TARGET_WINDOWS)
- #undef WCHAR_IS_UTF32
#define WCHAR_IS_UTF16 1
- #define WCHAR_CHARSET "UTF-16LE"
+ #define UTF16_CHARSET "UTF-16" ENDIAN_SUFFIX
+ #define UTF32_CHARSET "UTF-32" ENDIAN_SUFFIX
#define UTF8_SOURCE "UTF-8"
+ #define WCHAR_CHARSET UTF16_CHARSET
#pragma comment(lib, "libfribidi.lib")
#pragma comment(lib, "libiconv.lib")
#elif defined(TARGET_ANDROID)
- #define WCHAR_IS_UTF32 1
- #undef WCHAR_IS_UTF16
+ #define WCHAR_IS_UCS_4 1
+ #define UTF16_CHARSET "UTF-16" ENDIAN_SUFFIX
+ #define UTF32_CHARSET "UTF-32" ENDIAN_SUFFIX
#define UTF8_SOURCE "UTF-8"
- #ifdef __BIG_ENDIAN__
- #define WCHAR_CHARSET "UTF-32BE"
- #else
- #define WCHAR_CHARSET "UTF-32LE"
- #endif
+ #define WCHAR_CHARSET UTF32_CHARSET
#else
- #define WCHAR_CHARSET "WCHAR_T"
+ #define UTF16_CHARSET "UTF-16" ENDIAN_SUFFIX
+ #define UTF32_CHARSET "UTF-32" ENDIAN_SUFFIX
#define UTF8_SOURCE "UTF-8"
- #ifdef HAVE_CONFIG_H
- #include "config.h"
- #endif // HAVE_CONFIG_H
- #undef WCHAR_IS_UTF32
- #undef WCHAR_IS_UTF16
- #ifdef SIZEOF_WCHAR_T
- #if SIZEOF_WCHAR_T == 4
- #define WCHAR_IS_UTF32 1
- #elif SIZEOF_WCHAR_T == 2
- #define WCHAR_IS_UTF16 1
+ #define WCHAR_CHARSET "WCHAR_T"
+ #if __STDC_ISO_10646__
+ #ifdef SIZEOF_WCHAR_T
+ #if SIZEOF_WCHAR_T == 4
+ #define WCHAR_IS_UCS_4 1
+ #elif SIZEOF_WCHAR_T == 2
+ #define WCHAR_IS_UCS_2 1
+ #endif
#endif
#endif
#endif
@@ -479,7 +483,7 @@ void CCharsetConverter::reset(void)
bool CCharsetConverter::utf8ToUtf32(const std::string& utf8StringSrc, std::u32string& utf32StringDst, bool failOnBadChar /*= true*/)
{
CSingleLock lock(m_critSection);
- return convert(m_iconvUtf8ToUtf32, 1, UTF8_SOURCE, "UTF-32", utf8StringSrc, utf32StringDst, failOnBadChar);
+ return convert(m_iconvUtf8ToUtf32, 1, UTF8_SOURCE, UTF32_CHARSET, utf8StringSrc, utf32StringDst, failOnBadChar);
}
std::u32string CCharsetConverter::utf8ToUtf32(const std::string& utf8StringSrc, bool failOnBadChar /*= true*/)
@@ -497,16 +501,16 @@ bool CCharsetConverter::utf8ToUtf32Visual(const std::string& utf8StringSrc, std:
if (!logicalToVisualBiDi(utf8StringSrc, strFlipped, FRIBIDI_UTF8, forceLTRReadingOrder ? FRIBIDI_TYPE_LTR : FRIBIDI_TYPE_PDF))
return false;
CSingleLock lock(m_critSection);
- return convert(m_iconvUtf8ToUtf32, 1, UTF8_SOURCE, "UTF-32", strFlipped, utf32StringDst, failOnBadChar);
+ return convert(m_iconvUtf8ToUtf32, 1, UTF8_SOURCE, UTF32_CHARSET, strFlipped, utf32StringDst, failOnBadChar);
}
CSingleLock lock(m_critSection);
- return convert(m_iconvUtf8ToUtf32, 1, UTF8_SOURCE, "UTF-32", utf8StringSrc, utf32StringDst, failOnBadChar);
+ return convert(m_iconvUtf8ToUtf32, 1, UTF8_SOURCE, UTF32_CHARSET, utf8StringSrc, utf32StringDst, failOnBadChar);
}
bool CCharsetConverter::utf32ToUtf8(const std::u32string& utf32StringSrc, std::string& utf8StringDst, bool failOnBadChar /*= true*/)
{
CSingleLock lock(m_critSection);
- return convert(m_iconvUtf32ToUtf8, m_Utf8CharMaxSize, "UTF-32", "UTF-8", utf32StringSrc, utf8StringDst, failOnBadChar);
+ return convert(m_iconvUtf32ToUtf8, m_Utf8CharMaxSize, UTF32_CHARSET, "UTF-8", utf32StringSrc, utf8StringDst, failOnBadChar);
}
std::string CCharsetConverter::utf32ToUtf8(const std::u32string& utf32StringSrc, bool failOnBadChar /*= false*/)
@@ -518,13 +522,13 @@ std::string CCharsetConverter::utf32ToUtf8(const std::u32string& utf32StringSrc,
bool CCharsetConverter::utf32ToW(const std::u32string& utf32StringSrc, std::wstring& wStringDst, bool failOnBadChar /*= true*/)
{
-#ifdef WCHAR_IS_UTF32
+#ifdef WCHAR_IS_UCS_4
wStringDst.assign((const wchar_t*)utf32StringSrc.c_str(), utf32StringSrc.length());
return true;
-#else // !WCHAR_IS_UTF32
+#else // !WCHAR_IS_UCS_4
CSingleLock lock(m_critSection);
- return convert(m_iconvUtf32ToW, 1, "UTF-32", WCHAR_CHARSET, utf32StringSrc, wStringDst, failOnBadChar);
-#endif // !WCHAR_IS_UTF32
+ return convert(m_iconvUtf32ToW, 1, UTF32_CHARSET, WCHAR_CHARSET, utf32StringSrc, wStringDst, failOnBadChar);
+#endif // !WCHAR_IS_UCS_4
}
bool CCharsetConverter::utf32logicalToVisualBiDi(const std::u32string& logicalStringSrc, std::u32string& visualStringDst, bool forceLTRReadingOrder /*= false*/)
@@ -539,13 +543,12 @@ bool CCharsetConverter::utf32logicalToVisualBiDi(const std::u32string& logicalSt
bool CCharsetConverter::wToUtf32(const std::wstring& wStringSrc, std::u32string& utf32StringDst, bool failOnBadChar /*= true*/)
{
-#ifdef WCHAR_IS_UTF32
- utf32StringDst.assign((const char32_t*)wStringSrc.c_str(), wStringSrc.length());
- return true;
-#else // !WCHAR_IS_UTF32
+#ifdef WCHAR_IS_UCS_4
+ /* UCS-4 is almost equal to UTF-32, but UTF-32 has strict limits on possible values, while UCS-4 is usually unchecked.
+ * With this "conversion" we ensure that output will be valid UTF-32 string. */
+#endif
CSingleLock lock(m_critSection);
- return convert(m_iconvWToUtf32, 1, WCHAR_CHARSET, "UTF-32", wStringSrc, utf32StringDst, failOnBadChar);
-#endif // !WCHAR_IS_UTF32
+ return convert(m_iconvWToUtf32, 1, WCHAR_CHARSET, UTF32_CHARSET, wStringSrc, utf32StringDst, failOnBadChar);
}
// The bVisualBiDiFlip forces a flip of characters for hebrew/arabic languages, only set to false if the flipping
@@ -706,7 +709,7 @@ bool CCharsetConverter::utf16LEtoW(const std::u16string& utf16String, std::wstri
bool CCharsetConverter::utf32ToStringCharset(const std::u32string& utf32StringSrc, std::string& stringDst)
{
CSingleLock lock(m_critSection);
- return convert(m_iconvUtf32ToStringCharset, 1, g_langInfo.GetGuiCharSet().c_str(), "UTF-32", utf32StringSrc, stringDst);
+ return convert(m_iconvUtf32ToStringCharset, 1, g_langInfo.GetGuiCharSet().c_str(), UTF32_CHARSET, utf32StringSrc, stringDst);
}
bool CCharsetConverter::utf8ToSystem(std::string& stringSrcDst, bool failOnBadChar /*= false*/)
Something went wrong with that request. Please try again.