You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For wide strings the fragment SWIG_AsWCharPtrAndSize (Lib/python/pywstrings.swg) is used. This function does not return the correct wchar_t array on Windows, if the original UTF-8 string contains code points which need more than two bytes for their representation.
For example, the UTF-8 string in Python is "🤠ABC" will be returned as "🤠AB".
This is caused by the use of PyUnicode_GetSize in combination with PyUnicode_AsWideChar and the fact, that wchar_t is only two bytes on Windows.
PyUnicode_GetSize is used to obtain the size in code units, for the example above this would be 4. The function PyUnicode_AsWideChar reads at most size wchar_t characters. Here the miss match is happening, since wchar_t is only 2 bytes on windows, the number of wchar_t characters (5) is not the same as the numer of code units (4). As a result not all of the characters are read.
The use of PyUnicode_AsWideCharString might be a solution. Alternatively PyUnicode_AsWideChar(SWIGPY_UNICODE_ARG(obj), NULL, 0) could be used to obtain the correct number of wchar_t elements on Windows.
The text was updated successfully, but these errors were encountered:
Daniel-da6a
changed the title
SWIG_AsWCharPtrAndSize does not work correctly on Windows with code point > 2 bye
SWIG_AsWCharPtrAndSize does not work correctly on Windows with code point > 2 byte
May 15, 2024
Current state on master brach / Swig v4.2.1:
For wide strings the fragment SWIG_AsWCharPtrAndSize (Lib/python/pywstrings.swg) is used. This function does not return the correct wchar_t array on Windows, if the original UTF-8 string contains code points which need more than two bytes for their representation.
For example, the UTF-8 string in Python is "🤠ABC" will be returned as "🤠AB".
This is caused by the use of PyUnicode_GetSize in combination with PyUnicode_AsWideChar and the fact, that wchar_t is only two bytes on Windows.
PyUnicode_GetSize is used to obtain the size in code units, for the example above this would be 4. The function PyUnicode_AsWideChar reads at most size wchar_t characters. Here the miss match is happening, since wchar_t is only 2 bytes on windows, the number of wchar_t characters (5) is not the same as the numer of code units (4). As a result not all of the characters are read.
https://github.com/swig/swig/blob/7c2b245ceafb49552e559f8056c2618e84aad0b7/Lib/python/pywstrings.swg#L31C1-L44C74
The use of PyUnicode_AsWideCharString might be a solution. Alternatively PyUnicode_AsWideChar(SWIGPY_UNICODE_ARG(obj), NULL, 0) could be used to obtain the correct number of wchar_t elements on Windows.
The text was updated successfully, but these errors were encountered: