New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows: WindowsConsoleIO produces mojibake replacement characters #110913
Comments
Example 2 demonstrates the issue, but isn't relevant to our bugfix. It's bypassing our buffer, which is the only point we'd buffer a character and wait for the next one to be written. I have no idea why example 1 works/fails though, unless it's a straightforward logic error in the code that's supposed to walk back at the end of the buffer. |
I think it is just that, a straightforward logic error. There was this comment in the original ticket:
The assessment here is incorrect, the code does not search for a "final byte",
I originally wrote a full utf8 analyzer, but then I realized we don't need this kind of precision here. I'll leave it here for the record:
|
Ah, I see, the end byte doesn't necessarily have to have Could we use some logic instead where we back up by 4 bytes, read the value there, and move forward by up to 4 bytes based on the character? That should avoid any bad loops, and we know we have at least 4 bytes there already. Basically:
|
I am afraid we can't avoid a loop here. By backing up exactly 4 bytes we might find ourselves in the middle of another utf8 sequence. |
I must be confusing encoding schemes in my (fever-riddled, right now) head, there's at least one out there where you can detect how many bytes are left in a sequence from anywhere in a sequence. Is it better to back up and scan forward a small amount? Or cap the scan backwards? We ought to be able to get away with checking no more than the longest valid sequence, right? |
To be clear, UTF-8 has the follow encoding (in binary) for one to four bytes:
If on a continuation byte (high bits are |
How about
|
Thank you for the report and fix! |
Bug report
Bug description:
Hi! The following code reliably produces some unicode replacement characters �, on Windows, always in the same location. Works fine on Linux.
This report is a follow-up to this other one: #82052
A fix was already attempted, but as you can see, there are still some cases uncovered.
Example 1
Example 2
This is an attempt at making a shorter example, and a bit of a stretch goal.
python -c "import sys;[sys.stdout.buffer.raw.write(b) for b in [b'\xc3', b'\xa9',b'\xc3\xa9']]"
CPython versions tested on:
3.10, 3.12
Operating systems tested on:
Linux, Windows
Linked PRs
The text was updated successfully, but these errors were encountered: