Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

CharsetConverter UTF-32 fix #3353

Merged
merged 2 commits into from

2 participants

@Karlson2k
Collaborator
  • Most common implementation of iconv (GNU libiconv) use big endian and BOM when output is UTF-32. Most other (but not all) iconv implementations usually works similar. To ensure that conversion will be correct, we need to explicitly specify output endianness.

  • On some platforms (FreeBSD) wchar_t isn't in Unicode, so code checks STDC_ISO_10646 macro before using wchar_t as UTF.

  • If wchar_t is 4 bytes wide and in Unicode, do proper convert from wchar_t to UTF-32 (instead of copy), as wchar_t in UCS-4 (superset of UTF-32). UTF-32 can be simply copied to wchar_t

This PR is alternative for #3351 and #3342

@Karlson2k
Collaborator

@jmarshallnz Could you select one of: #3342, #3351, #3353?
I think this one is better that others.

Need it to finish work on CharsetConverter

@Karlson2k
Collaborator

jenkins build this please
Thanks!

@jmarshallnz
Owner

My understanding is that UCS-4 and UTF-32 are identical at this point, as per ISO 10646:2003 no codepoints greater than 10FFFF are allowed anyway (as they can't be represented in UTF-16). This is the same reason why UTF-8 only needs 4 bytes/character max rather than 5 or 6.

@Karlson2k
Collaborator

Right. Small addition: UCS-4 ususally not checked for value limits, while UTF-32 is checked. Look at any charset processing engine. So conversion from UCS-4 to UTF-32 is precaution as unchecked value can be used as security hole.
But I'm fine with it removal.
@jmarshallnz Other code is fine?

@jmarshallnz
Owner

I suggest then clearly labelling WCHAR_IS_UCS4 and then clearly commenting why you're "converting" it when switching to UTF32. Otherwise, all good.

Karlson2k added some commits
@Karlson2k Karlson2k CharsetConverter: add extra care on using wchar_t as UTF-32
* check for __STDC_ISO_10646__
* copy-convert only for UTF-32 -> wchar_t, but use converter for wchar_t -> UTF-32 (UTF-32 has more restrictions)
807747a
@Karlson2k Karlson2k CharsetConverter: use UTF-32 charset with correct endianness d06388d
@Karlson2k
Collaborator

@jmarshallnz updated.
jenkins build this please

@Karlson2k
Collaborator

Master was fixed, so jenkins build this please again

@Karlson2k
Collaborator

Got OK from jmarshallnz.

@Karlson2k Karlson2k merged commit 3a08053 into xbmc:master
@Karlson2k Karlson2k deleted the Karlson2k:charsetconverter_fix_02_alt2 branch
@ghost Unknown referenced this pull request from a commit
Commit has since been removed from the repository and is no longer available.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Oct 2, 2013
  1. @Karlson2k

    CharsetConverter: add extra care on using wchar_t as UTF-32

    Karlson2k authored
    * check for __STDC_ISO_10646__
    * copy-convert only for UTF-32 -> wchar_t, but use converter for wchar_t -> UTF-32 (UTF-32 has more restrictions)
  2. @Karlson2k
This page is out of date. Refresh to see the latest.
Showing with 45 additions and 42 deletions.
  1. +45 −42 xbmc/utils/CharsetConverter.cpp
View
87 xbmc/utils/CharsetConverter.cpp
@@ -31,44 +31,48 @@
#include <errno.h>
#include <iconv.h>
+#if !defined(TARGET_WINDOWS) && defined(HAVE_CONFIG_H)
+ #include "config.h"
+#endif
+
+#ifdef WORDS_BIGENDIAN
+ #define ENDIAN_SUFFIX "BE"
+#else
+ #define ENDIAN_SUFFIX "LE"
+#endif
+
#if defined(TARGET_DARWIN)
- #define WCHAR_IS_UTF32 1
- #undef WCHAR_IS_UTF16
- #ifdef __POWERPC__
- #define WCHAR_CHARSET "UTF-32BE"
- #else
- #define WCHAR_CHARSET "UTF-32LE"
- #endif
+ #define WCHAR_IS_UCS_4 1
+ #define UTF16_CHARSET "UTF-16" ENDIAN_SUFFIX
+ #define UTF32_CHARSET "UTF-32" ENDIAN_SUFFIX
#define UTF8_SOURCE "UTF-8-MAC"
+ #define WCHAR_CHARSET UTF32_CHARSET
#elif defined(TARGET_WINDOWS)
- #undef WCHAR_IS_UTF32
#define WCHAR_IS_UTF16 1
- #define WCHAR_CHARSET "UTF-16LE"
+ #define UTF16_CHARSET "UTF-16" ENDIAN_SUFFIX
+ #define UTF32_CHARSET "UTF-32" ENDIAN_SUFFIX
#define UTF8_SOURCE "UTF-8"
+ #define WCHAR_CHARSET UTF16_CHARSET
#pragma comment(lib, "libfribidi.lib")
#pragma comment(lib, "libiconv.lib")
#elif defined(TARGET_ANDROID)
- #define WCHAR_IS_UTF32 1
- #undef WCHAR_IS_UTF16
+ #define WCHAR_IS_UCS_4 1
+ #define UTF16_CHARSET "UTF-16" ENDIAN_SUFFIX
+ #define UTF32_CHARSET "UTF-32" ENDIAN_SUFFIX
#define UTF8_SOURCE "UTF-8"
- #ifdef __BIG_ENDIAN__
- #define WCHAR_CHARSET "UTF-32BE"
- #else
- #define WCHAR_CHARSET "UTF-32LE"
- #endif
+ #define WCHAR_CHARSET UTF32_CHARSET
#else
- #define WCHAR_CHARSET "WCHAR_T"
+ #define UTF16_CHARSET "UTF-16" ENDIAN_SUFFIX
+ #define UTF32_CHARSET "UTF-32" ENDIAN_SUFFIX
#define UTF8_SOURCE "UTF-8"
- #ifdef HAVE_CONFIG_H
- #include "config.h"
- #endif // HAVE_CONFIG_H
- #undef WCHAR_IS_UTF32
- #undef WCHAR_IS_UTF16
- #ifdef SIZEOF_WCHAR_T
- #if SIZEOF_WCHAR_T == 4
- #define WCHAR_IS_UTF32 1
- #elif SIZEOF_WCHAR_T == 2
- #define WCHAR_IS_UTF16 1
+ #define WCHAR_CHARSET "WCHAR_T"
+ #if __STDC_ISO_10646__
+ #ifdef SIZEOF_WCHAR_T
+ #if SIZEOF_WCHAR_T == 4
+ #define WCHAR_IS_UCS_4 1
+ #elif SIZEOF_WCHAR_T == 2
+ #define WCHAR_IS_UCS_2 1
+ #endif
#endif
#endif
#endif
@@ -479,7 +483,7 @@ void CCharsetConverter::reset(void)
bool CCharsetConverter::utf8ToUtf32(const std::string& utf8StringSrc, std::u32string& utf32StringDst, bool failOnBadChar /*= true*/)
{
CSingleLock lock(m_critSection);
- return convert(m_iconvUtf8ToUtf32, 1, UTF8_SOURCE, "UTF-32", utf8StringSrc, utf32StringDst, failOnBadChar);
+ return convert(m_iconvUtf8ToUtf32, 1, UTF8_SOURCE, UTF32_CHARSET, utf8StringSrc, utf32StringDst, failOnBadChar);
}
std::u32string CCharsetConverter::utf8ToUtf32(const std::string& utf8StringSrc, bool failOnBadChar /*= true*/)
@@ -497,16 +501,16 @@ bool CCharsetConverter::utf8ToUtf32Visual(const std::string& utf8StringSrc, std:
if (!logicalToVisualBiDi(utf8StringSrc, strFlipped, FRIBIDI_UTF8, forceLTRReadingOrder ? FRIBIDI_TYPE_LTR : FRIBIDI_TYPE_PDF))
return false;
CSingleLock lock(m_critSection);
- return convert(m_iconvUtf8ToUtf32, 1, UTF8_SOURCE, "UTF-32", strFlipped, utf32StringDst, failOnBadChar);
+ return convert(m_iconvUtf8ToUtf32, 1, UTF8_SOURCE, UTF32_CHARSET, strFlipped, utf32StringDst, failOnBadChar);
}
CSingleLock lock(m_critSection);
- return convert(m_iconvUtf8ToUtf32, 1, UTF8_SOURCE, "UTF-32", utf8StringSrc, utf32StringDst, failOnBadChar);
+ return convert(m_iconvUtf8ToUtf32, 1, UTF8_SOURCE, UTF32_CHARSET, utf8StringSrc, utf32StringDst, failOnBadChar);
}
bool CCharsetConverter::utf32ToUtf8(const std::u32string& utf32StringSrc, std::string& utf8StringDst, bool failOnBadChar /*= true*/)
{
CSingleLock lock(m_critSection);
- return convert(m_iconvUtf32ToUtf8, m_Utf8CharMaxSize, "UTF-32", "UTF-8", utf32StringSrc, utf8StringDst, failOnBadChar);
+ return convert(m_iconvUtf32ToUtf8, m_Utf8CharMaxSize, UTF32_CHARSET, "UTF-8", utf32StringSrc, utf8StringDst, failOnBadChar);
}
std::string CCharsetConverter::utf32ToUtf8(const std::u32string& utf32StringSrc, bool failOnBadChar /*= false*/)
@@ -518,13 +522,13 @@ std::string CCharsetConverter::utf32ToUtf8(const std::u32string& utf32StringSrc,
bool CCharsetConverter::utf32ToW(const std::u32string& utf32StringSrc, std::wstring& wStringDst, bool failOnBadChar /*= true*/)
{
-#ifdef WCHAR_IS_UTF32
+#ifdef WCHAR_IS_UCS_4
wStringDst.assign((const wchar_t*)utf32StringSrc.c_str(), utf32StringSrc.length());
return true;
-#else // !WCHAR_IS_UTF32
+#else // !WCHAR_IS_UCS_4
CSingleLock lock(m_critSection);
- return convert(m_iconvUtf32ToW, 1, "UTF-32", WCHAR_CHARSET, utf32StringSrc, wStringDst, failOnBadChar);
-#endif // !WCHAR_IS_UTF32
+ return convert(m_iconvUtf32ToW, 1, UTF32_CHARSET, WCHAR_CHARSET, utf32StringSrc, wStringDst, failOnBadChar);
+#endif // !WCHAR_IS_UCS_4
}
bool CCharsetConverter::utf32logicalToVisualBiDi(const std::u32string& logicalStringSrc, std::u32string& visualStringDst, bool forceLTRReadingOrder /*= false*/)
@@ -539,13 +543,12 @@ bool CCharsetConverter::utf32logicalToVisualBiDi(const std::u32string& logicalSt
bool CCharsetConverter::wToUtf32(const std::wstring& wStringSrc, std::u32string& utf32StringDst, bool failOnBadChar /*= true*/)
{
-#ifdef WCHAR_IS_UTF32
- utf32StringDst.assign((const char32_t*)wStringSrc.c_str(), wStringSrc.length());
- return true;
-#else // !WCHAR_IS_UTF32
+#ifdef WCHAR_IS_UCS_4
+ /* UCS-4 is almost equal to UTF-32, but UTF-32 has strict limits on possible values, while UCS-4 is usually unchecked.
+ * With this "conversion" we ensure that output will be valid UTF-32 string. */
+#endif
CSingleLock lock(m_critSection);
- return convert(m_iconvWToUtf32, 1, WCHAR_CHARSET, "UTF-32", wStringSrc, utf32StringDst, failOnBadChar);
-#endif // !WCHAR_IS_UTF32
+ return convert(m_iconvWToUtf32, 1, WCHAR_CHARSET, UTF32_CHARSET, wStringSrc, utf32StringDst, failOnBadChar);
}
// The bVisualBiDiFlip forces a flip of characters for hebrew/arabic languages, only set to false if the flipping
@@ -706,7 +709,7 @@ bool CCharsetConverter::utf16LEtoW(const std::u16string& utf16String, std::wstri
bool CCharsetConverter::utf32ToStringCharset(const std::u32string& utf32StringSrc, std::string& stringDst)
{
CSingleLock lock(m_critSection);
- return convert(m_iconvUtf32ToStringCharset, 1, g_langInfo.GetGuiCharSet().c_str(), "UTF-32", utf32StringSrc, stringDst);
+ return convert(m_iconvUtf32ToStringCharset, 1, g_langInfo.GetGuiCharSet().c_str(), UTF32_CHARSET, utf32StringSrc, stringDst);
}
bool CCharsetConverter::utf8ToSystem(std::string& stringSrcDst, bool failOnBadChar /*= false*/)
Something went wrong with that request. Please try again.