Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows: WindowsConsoleIO produces mojibake replacement characters #110913

Closed
sorgloomer opened this issue Oct 16, 2023 · 9 comments
Closed

Windows: WindowsConsoleIO produces mojibake replacement characters #110913

sorgloomer opened this issue Oct 16, 2023 · 9 comments
Labels
OS-windows topic-IO type-bug An unexpected behavior, bug, or error

Comments

@sorgloomer
Copy link
Contributor

sorgloomer commented Oct 16, 2023

Bug report

Bug description:

Hi! The following code reliably produces some unicode replacement characters �, on Windows, always in the same location. Works fine on Linux.

This report is a follow-up to this other one: #82052

A fix was already attempted, but as you can see, there are still some cases uncovered.

Example 1

print('é'*20001)

image

Example 2

This is an attempt at making a shorter example, and a bit of a stretch goal.

python -c "import sys;[sys.stdout.buffer.raw.write(b) for b in [b'\xc3', b'\xa9',b'\xc3\xa9']]"

image

CPython versions tested on:

3.10, 3.12

Operating systems tested on:

Linux, Windows

Linked PRs

@sorgloomer sorgloomer added the type-bug An unexpected behavior, bug, or error label Oct 16, 2023
@zooba
Copy link
Member

zooba commented Oct 16, 2023

Example 2 demonstrates the issue, but isn't relevant to our bugfix. It's bypassing our buffer, which is the only point we'd buffer a character and wait for the next one to be written.

I have no idea why example 1 works/fails though, unless it's a straightforward logic error in the code that's supposed to walk back at the end of the buffer.

@sorgloomer
Copy link
Contributor Author

sorgloomer commented Oct 17, 2023

I think it is just that, a straightforward logic error. There was this comment in the original ticket:

One thing this doesn't handle is if there isn't a UTF-8 final byte in the part of the buffer we are planning on converting, so len goes to 0. I can't imagine how that would happen, but I haven't checked if that's a possibility. And if it is, I don't know what else we can do about it...

The assessment here is incorrect, the code does not search for a "final byte", buf[i] & 0x80 != 0 just skips all the bytes, including "final bytes". Here is my attempt at an improved version:

while (wlen > 32766 / sizeof(wchar_t)) {
    len /= 2;
    orig_len = len;
    /* Reduce the length until we hit the final byte of a UTF-8 sequence
     * (top bit is unset). Fix for github issue 82052.
     */
    if (((char *)b->buf)[len-1] & 0x80) != 0) {
        // last byte is not for 1-byte character
        while (len > 0 && (((char *)b->buf)[len-1] & 0xc0) == 0x80)
            --len; // trace back all continuation bytes
        if (len > 0)
            --len; // trace back first byte of character
    }
    /* If we hit a length of 0, something has gone wrong. This shouldn't
     * be possible, as valid UTF-8 can have at most 3 non-final bytes
     * before a final one, and our buffer is way longer than that.
     * But to be on the safe side, if we hit this issue we just restore
     * the original length and let the console API sort it out.
     */
    if (len == 0) {
        len = orig_len;
    }
    wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
}

I originally wrote a full utf8 analyzer, but then I realized we don't need this kind of precision here. I'll leave it here for the record:

static PyObject *
_io__WindowsConsoleIO_write_impl(winconsoleio *self, PyTypeObject *cls,
                                 Py_buffer *b)
{
    //...
    DWORD len, wlen, orig_len, last_char_start_idx, n = 0;
    char current_byte;
    // ...
        while (wlen > 32766 / sizeof(wchar_t)) {
            len /= 2;
            orig_len = len;
    
            len = _find_last_complete_char_end((char *)b->buf, len);
            if (len == 0) {
                // found an encoding error, the last character starts at index 0, or the buffer was empty: fall back
                len = orig_len;
            }
            wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
        }
    // ...
}

DWORD _find_last_complete_char_end(char* buf, DWORD len) {
    DWORD i, actual_count, indicated_count;
    char b;
    if (len == 0) {
        // error, empty buffer
        return 0;
    }
    actual_count = 1;
    i = len - 1;
    if ((buf[i] & 0x80) == 0) {
        // ok, last byte is a 1-byte character
        return len;
    }
    // last byte is part of a multibyte character, we have to search the start of the character to see if it is complete
    for (;;) {
        if ((buf[i] & 0xc0) != 0x80) {
            // not a continuation byte, conclude search
            break;
        }
        if (i == 0) {
            // error, only continuation bytes found
            return 0;
        }
        i--;
        actual_count++;
        if (actual_count > 4) {
            // error, did not find a character boundary in 4 bytes
            return 0;
        }
    }
    indicated_count = _get_utf8_byte_count(buf[i]);
    if (indicated_count == 0) {
        // error, invalid start byte
        return 0;
    }
    if (indicated_count < actual_count) {
        // error, there was an excess of continuation bytes
        return 0;
    }
    if (indicated_count == actual_count) {
        // ok, character was actually complete, last byte coincides with buffer end
        return len;
    }
    // ok, last character is incomplete, return its start index
    return i;
}

DWORD _get_utf8_byte_count(char first_byte) {
    if ((first_byte & 0x80) == 0x00)
        return 1;
    if ((first_byte & 0xe0) == 0xc0)
        return 2;
    if ((first_byte & 0xf0) == 0xe0)
        return 3;
    if ((first_byte & 0xf8) == 0xf0)
        return 4;
    return 0;
}

@zooba
Copy link
Member

zooba commented Oct 17, 2023

Ah, I see, the end byte doesn't necessarily have to have 0x80 unset, but it does have to match the byte count.

Could we use some logic instead where we back up by 4 bytes, read the value there, and move forward by up to 4 bytes based on the character? That should avoid any bad loops, and we know we have at least 4 bytes there already. Basically:

while (wlen > 32766 / sizeof(wchar_t)) {
    len = len / 2;
    if (len > 4) {
        len += _get_utf8_byte_count(((char *)b->buf)[len - 4]) - 4;
    }
    wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
}

@sorgloomer
Copy link
Contributor Author

sorgloomer commented Oct 17, 2023

I am afraid we can't avoid a loop here. By backing up exactly 4 bytes we might find ourselves in the middle of another utf8 sequence. _get_utf8_byte_count would return 0 in such case, as only the first byte of a sequence carries the necessary length information.

@zooba
Copy link
Member

zooba commented Oct 17, 2023

I must be confusing encoding schemes in my (fever-riddled, right now) head, there's at least one out there where you can detect how many bytes are left in a sequence from anywhere in a sequence.

Is it better to back up and scan forward a small amount? Or cap the scan backwards? We ought to be able to get away with checking no more than the longest valid sequence, right?

@ChrisDenton
Copy link

To be clear, UTF-8 has the follow encoding (in binary) for one to four bytes:

0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

If on a continuation byte (high bits are 10) then we only need to scan backwards until we find a byte whose high bits are 11. Which will be in the previous 1 to 3 bytes, assuming valid UTF-8.

@sorgloomer
Copy link
Contributor Author

We ought to be able to get away with checking no more than the longest valid sequence, right?

Indeed. As long as the encoding is correct, we will only need to back up at most 4 bytes. No more than 3 continuation bytes ( 0b10xxxxxx ) appear in a sequence in a valid utf-8 encoded text, so capping it should not affect legit usecases.
image

@sorgloomer
Copy link
Contributor Author

How about

DWORD _find_last_utf8_boundary(char *buf, DWORD len) {
    /* This function never returns 0, returns the original len instead */
    DWORD count = 1;
    if (len == 0 || (buf[len - 1] & 0x80) == 0)
        return len;
    for (;; count++) {
        if (count > 3 || count >= len)
            return len;
        if ((buf[len - count] & 0xc0) != 0x80)
            return len - count;
    }
}

...

while (wlen > 32766 / sizeof(wchar_t)) {
    len /= 2;
    /* Fix for github issues gh-110913 and gh-82052. */
    len = _find_last_utf8_boundary(b->buf, len);
    wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
}

miss-islington pushed a commit to miss-islington/cpython that referenced this issue Oct 20, 2023
…H-111007)

(cherry picked from commit 11312ea)

Co-authored-by: Tamás Hegedűs <sorgloomer@users.noreply.github.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Oct 20, 2023
…H-111007)

(cherry picked from commit 11312ea)

Co-authored-by: Tamás Hegedűs <sorgloomer@users.noreply.github.com>
zooba pushed a commit that referenced this issue Oct 20, 2023
(cherry picked from commit 11312ea)

Co-authored-by: Tamás Hegedűs <sorgloomer@users.noreply.github.com>
zooba pushed a commit that referenced this issue Oct 20, 2023
(cherry picked from commit 11312ea)

Co-authored-by: Tamás Hegedűs <sorgloomer@users.noreply.github.com>
@hugovk
Copy link
Member

hugovk commented Nov 9, 2023

Thank you for the report and fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OS-windows topic-IO type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

5 participants