Windows: WindowsConsoleIO produces mojibake replacement characters #110913

sorgloomer · 2023-10-16T02:32:01Z

Bug report

Bug description:

Hi! The following code reliably produces some unicode replacement characters �, on Windows, always in the same location. Works fine on Linux.

This report is a follow-up to this other one: #82052

A fix was already attempted, but as you can see, there are still some cases uncovered.

Example 1

print('é'*20001)

Example 2

This is an attempt at making a shorter example, and a bit of a stretch goal.

python -c "import sys;[sys.stdout.buffer.raw.write(b) for b in [b'\xc3', b'\xa9',b'\xc3\xa9']]"

CPython versions tested on:

3.10, 3.12

Operating systems tested on:

Linux, Windows

Linked PRs

zooba · 2023-10-16T16:54:32Z

Example 2 demonstrates the issue, but isn't relevant to our bugfix. It's bypassing our buffer, which is the only point we'd buffer a character and wait for the next one to be written.

I have no idea why example 1 works/fails though, unless it's a straightforward logic error in the code that's supposed to walk back at the end of the buffer.

sorgloomer · 2023-10-17T19:59:35Z

I think it is just that, a straightforward logic error. There was this comment in the original ticket:

One thing this doesn't handle is if there isn't a UTF-8 final byte in the part of the buffer we are planning on converting, so len goes to 0. I can't imagine how that would happen, but I haven't checked if that's a possibility. And if it is, I don't know what else we can do about it...

The assessment here is incorrect, the code does not search for a "final byte", buf[i] & 0x80 != 0 just skips all the bytes, including "final bytes". Here is my attempt at an improved version:

while (wlen > 32766 / sizeof(wchar_t)) {
    len /= 2;
    orig_len = len;
    /* Reduce the length until we hit the final byte of a UTF-8 sequence
     * (top bit is unset). Fix for github issue 82052.
     */
    if (((char *)b->buf)[len-1] & 0x80) != 0) {
        // last byte is not for 1-byte character
        while (len > 0 && (((char *)b->buf)[len-1] & 0xc0) == 0x80)
            --len; // trace back all continuation bytes
        if (len > 0)
            --len; // trace back first byte of character
    }
    /* If we hit a length of 0, something has gone wrong. This shouldn't
     * be possible, as valid UTF-8 can have at most 3 non-final bytes
     * before a final one, and our buffer is way longer than that.
     * But to be on the safe side, if we hit this issue we just restore
     * the original length and let the console API sort it out.
     */
    if (len == 0) {
        len = orig_len;
    }
    wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
}

I originally wrote a full utf8 analyzer, but then I realized we don't need this kind of precision here. I'll leave it here for the record:

static PyObject *
_io__WindowsConsoleIO_write_impl(winconsoleio *self, PyTypeObject *cls,
                                 Py_buffer *b)
{
    //...
    DWORD len, wlen, orig_len, last_char_start_idx, n = 0;
    char current_byte;
    // ...
        while (wlen > 32766 / sizeof(wchar_t)) {
            len /= 2;
            orig_len = len;
    
            len = _find_last_complete_char_end((char *)b->buf, len);
            if (len == 0) {
                // found an encoding error, the last character starts at index 0, or the buffer was empty: fall back
                len = orig_len;
            }
            wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
        }
    // ...
}

DWORD _find_last_complete_char_end(char* buf, DWORD len) {
    DWORD i, actual_count, indicated_count;
    char b;
    if (len == 0) {
        // error, empty buffer
        return 0;
    }
    actual_count = 1;
    i = len - 1;
    if ((buf[i] & 0x80) == 0) {
        // ok, last byte is a 1-byte character
        return len;
    }
    // last byte is part of a multibyte character, we have to search the start of the character to see if it is complete
    for (;;) {
        if ((buf[i] & 0xc0) != 0x80) {
            // not a continuation byte, conclude search
            break;
        }
        if (i == 0) {
            // error, only continuation bytes found
            return 0;
        }
        i--;
        actual_count++;
        if (actual_count > 4) {
            // error, did not find a character boundary in 4 bytes
            return 0;
        }
    }
    indicated_count = _get_utf8_byte_count(buf[i]);
    if (indicated_count == 0) {
        // error, invalid start byte
        return 0;
    }
    if (indicated_count < actual_count) {
        // error, there was an excess of continuation bytes
        return 0;
    }
    if (indicated_count == actual_count) {
        // ok, character was actually complete, last byte coincides with buffer end
        return len;
    }
    // ok, last character is incomplete, return its start index
    return i;
}

DWORD _get_utf8_byte_count(char first_byte) {
    if ((first_byte & 0x80) == 0x00)
        return 1;
    if ((first_byte & 0xe0) == 0xc0)
        return 2;
    if ((first_byte & 0xf0) == 0xe0)
        return 3;
    if ((first_byte & 0xf8) == 0xf0)
        return 4;
    return 0;
}

zooba · 2023-10-17T20:52:45Z

Ah, I see, the end byte doesn't necessarily have to have 0x80 unset, but it does have to match the byte count.

Could we use some logic instead where we back up by 4 bytes, read the value there, and move forward by up to 4 bytes based on the character? That should avoid any bad loops, and we know we have at least 4 bytes there already. Basically:

while (wlen > 32766 / sizeof(wchar_t)) {
    len = len / 2;
    if (len > 4) {
        len += _get_utf8_byte_count(((char *)b->buf)[len - 4]) - 4;
    }
    wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
}

sorgloomer · 2023-10-17T21:12:33Z

I am afraid we can't avoid a loop here. By backing up exactly 4 bytes we might find ourselves in the middle of another utf8 sequence. _get_utf8_byte_count would return 0 in such case, as only the first byte of a sequence carries the necessary length information.

zooba · 2023-10-17T21:29:24Z

I must be confusing encoding schemes in my (fever-riddled, right now) head, there's at least one out there where you can detect how many bytes are left in a sequence from anywhere in a sequence.

Is it better to back up and scan forward a small amount? Or cap the scan backwards? We ought to be able to get away with checking no more than the longest valid sequence, right?

ChrisDenton · 2023-10-17T22:05:25Z

To be clear, UTF-8 has the follow encoding (in binary) for one to four bytes:

0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

If on a continuation byte (high bits are 10) then we only need to scan backwards until we find a byte whose high bits are 11. Which will be in the previous 1 to 3 bytes, assuming valid UTF-8.

sorgloomer · 2023-10-17T22:07:54Z

We ought to be able to get away with checking no more than the longest valid sequence, right?

Indeed. As long as the encoding is correct, we will only need to back up at most 4 bytes. No more than 3 continuation bytes ( 0b10xxxxxx ) appear in a sequence in a valid utf-8 encoded text, so capping it should not affect legit usecases.

sorgloomer · 2023-10-18T00:47:28Z

How about

DWORD _find_last_utf8_boundary(char *buf, DWORD len) {
    /* This function never returns 0, returns the original len instead */
    DWORD count = 1;
    if (len == 0 || (buf[len - 1] & 0x80) == 0)
        return len;
    for (;; count++) {
        if (count > 3 || count >= len)
            return len;
        if ((buf[len - count] & 0xc0) != 0x80)
            return len - count;
    }
}

...

while (wlen > 32766 / sizeof(wchar_t)) {
    len /= 2;
    /* Fix for github issues gh-110913 and gh-82052. */
    len = _find_last_utf8_boundary(b->buf, len);
    wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
}

…H-111007) (cherry picked from commit 11312ea) Co-authored-by: Tamás Hegedűs <sorgloomer@users.noreply.github.com>

(cherry picked from commit 11312ea) Co-authored-by: Tamás Hegedűs <sorgloomer@users.noreply.github.com>

hugovk · 2023-11-09T21:02:50Z

Thank you for the report and fix!

…H-111007)

sorgloomer added the type-bug An unexpected behavior, bug, or error label Oct 16, 2023

Eclips4 added OS-windows topic-IO labels Oct 16, 2023

sorgloomer mentioned this issue Oct 17, 2023

gh-110913: Fix WindowsConsoleIO chunking of UTF-8 text #111007

Merged

zooba pushed a commit that referenced this issue Oct 20, 2023

gh-110913: Fix WindowsConsoleIO chunking of UTF-8 text (GH-111007)

11312ea

bedevere-app bot mentioned this issue Oct 20, 2023

[3.12] gh-110913: Fix WindowsConsoleIO chunking of UTF-8 text (GH-111007) #111108

Merged

bedevere-app bot mentioned this issue Oct 20, 2023

[3.11] gh-110913: Fix WindowsConsoleIO chunking of UTF-8 text (GH-111007) #111109

Merged

zooba pushed a commit that referenced this issue Oct 20, 2023

gh-110913: Fix WindowsConsoleIO chunking of UTF-8 text (GH-111007)

69bcaf7

(cherry picked from commit 11312ea) Co-authored-by: Tamás Hegedűs <sorgloomer@users.noreply.github.com>

zooba pushed a commit that referenced this issue Oct 20, 2023

gh-110913: Fix WindowsConsoleIO chunking of UTF-8 text (GH-111007)

dfc8d7a

(cherry picked from commit 11312ea) Co-authored-by: Tamás Hegedűs <sorgloomer@users.noreply.github.com>

hugovk closed this as completed Nov 9, 2023

aisk pushed a commit to aisk/cpython that referenced this issue Feb 11, 2024

pythongh-110913: Fix WindowsConsoleIO chunking of UTF-8 text (pythonG…

18f2add

…H-111007)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows: WindowsConsoleIO produces mojibake replacement characters #110913

Windows: WindowsConsoleIO produces mojibake replacement characters #110913

sorgloomer commented Oct 16, 2023 •

edited by bedevere-app bot

zooba commented Oct 16, 2023

sorgloomer commented Oct 17, 2023 •

edited

zooba commented Oct 17, 2023

sorgloomer commented Oct 17, 2023 •

edited

zooba commented Oct 17, 2023

ChrisDenton commented Oct 17, 2023

sorgloomer commented Oct 17, 2023

sorgloomer commented Oct 18, 2023

hugovk commented Nov 9, 2023

Windows: WindowsConsoleIO produces mojibake replacement characters #110913

Windows: WindowsConsoleIO produces mojibake replacement characters #110913

Comments

sorgloomer commented Oct 16, 2023 • edited by bedevere-app bot

Bug report

Bug description:

Example 1

Example 2

CPython versions tested on:

Operating systems tested on:

Linked PRs

zooba commented Oct 16, 2023

sorgloomer commented Oct 17, 2023 • edited

zooba commented Oct 17, 2023

sorgloomer commented Oct 17, 2023 • edited

zooba commented Oct 17, 2023

ChrisDenton commented Oct 17, 2023

sorgloomer commented Oct 17, 2023

sorgloomer commented Oct 18, 2023

hugovk commented Nov 9, 2023

sorgloomer commented Oct 16, 2023 •

edited by bedevere-app bot

sorgloomer commented Oct 17, 2023 •

edited

sorgloomer commented Oct 17, 2023 •

edited