Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyPy uses wrong file name encoding on Windows #4924

Open
agronholm opened this issue Mar 11, 2024 · 5 comments
Open

PyPy uses wrong file name encoding on Windows #4924

agronholm opened this issue Mar 11, 2024 · 5 comments
Labels
Milestone

Comments

@agronholm
Copy link

The following script demonstrates 2 seemingly different issues:

from pathlib import Path

path = Path("MARKER-ö.txt")
path.touch()

path2 = next(p for p in Path.cwd().iterdir() if p.name.startswith("MARKER-"))
path2.chmod(0o777)
  1. The created file is named MARKER-ö.txt and not MARKER-ö.txt as is should be (and is on CPython)
  2. The chmod operation fails with FileNotFoundError, despite just being returned from iterdir(), likely also due to a file name encoding mismatch

The issue appears to be caused by some code erroneously using the iso-latin-1 encoding, as the UTF-8 bytes 0xC3 0xB6 decoded as latin-1 translate exactly to ö.

@agronholm agronholm changed the title PyPy uses wong Windows file name encoding PyPy uses wrong Windows file name encoding Mar 11, 2024
@agronholm agronholm changed the title PyPy uses wrong Windows file name encoding PyPy uses wrong file name encoding on Windows Mar 11, 2024
@mattip mattip added the windows label Mar 11, 2024
@mattip
Copy link
Member

mattip commented Mar 11, 2024

Related to #3890. Is this with latest PyPy 7.3.15?

@agronholm
Copy link
Author

Yes, I just downloaded it today.

@mattip
Copy link
Member

mattip commented Mar 11, 2024

When I try with 7.3.15 in a cmd.exe terminal, I get

>chcp
Active code page: 437
>pypy3.9-v7.3.15-win64\python.exe
Python 3.9.18 (9c4f8ef178b6, Jan 14 2024, 12:40:23)
[PyPy 7.3.15 with MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>>> import pathlib
>>>> path = Path("MARKER-ö.txt")
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc94' in position 20: surrogates not allowed

So that is one problem. To get your filename, I can do

from pathlib import Path
name = b'MARKER-\xc3\xb6.txt'.decode()
ord(name[7])
246  # that checks out with `ord('ö')`
path = Path(name)
path.touch()

Now I see a file with the name MARKER-ö.txt in FileExplorer, just like in the issue report.

@agronholm
Copy link
Author

For me, chcp reports 850 as the active code page.

@mattip
Copy link
Member

mattip commented Mar 11, 2024

The path to solving this goes via adding a test that can be run untranslated, trying to figure out where the encoding goes bad, and then fixing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants