Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python cannot run in the ja_JP.sjis locale used windows-31j encoding. #102388

Closed
moriyama opened this issue Mar 3, 2023 · 2 comments
Closed

Python cannot run in the ja_JP.sjis locale used windows-31j encoding. #102388

moriyama opened this issue Mar 3, 2023 · 2 comments
Labels
topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@moriyama
Copy link
Contributor

moriyama commented Mar 3, 2023

Bug report

Linux using glibc cannot run Python when ja_JP.sjis locale is set as follows.

$ sudo dnf install glibc-locale-source # RHEL or RHEL compatible Linux distribution
$ sudo localedef -f WINDOWS-31J -i ja_JP ja_JP.sjis
$ export LANG=ja_JP.SJIS
$ python3
Python path configuration:
  PYTHONHOME = (not set)
  PYTHONPATH = (not set)
  program name = 'python3'
  isolated = 0
  environment = 1
  user site = 1
  import site = 1
  sys._base_executable = '/usr/bin/python3'
  sys.base_prefix = '/usr'
  sys.base_exec_prefix = '/usr'
  sys.platlibdir = 'lib64'
  sys.executable = '/usr/bin/python3'
  sys.prefix = '/usr'
  sys.exec_prefix = '/usr'
  sys.path = [
    '/usr/lib64/python39.zip',
    '/usr/lib64/python3.9',
    '/usr/lib64/python3.9/lib-dynload',
  ]
Fatal Python error: init_fs_encoding: failed to get the Python codec of the filesystem encoding
Python runtime state: core initialized
LookupError: unknown encoding: WINDOWS-31J

Current thread 0x00007f49d76ee740 (most recent call first):
<no Python frame>

The charset name "Windows-31J" is registered in the IANA Charset Registry[1].
Windows-31J is supported by perl[2], php[3], ruby[4], java[5], etc.
Python's cp932 is equivalent to Windows-31J, so I propose to add windows_31j to aliases for cp932.

[1] https://www.iana.org/assignments/charset-reg/windows-31J
[2] https://perldoc.perl.org/Encode::JP
[3] https://www.php.net/manual/en/mbstring.encodings.php
[4] https://docs.ruby-lang.org/ja/latest/class/Encoding.html
[5] https://docs.oracle.com/en/java/javase/19/intl/supported-encodings.html

Your environment

  • CPython versions tested on: 3.9.13, 3.12.0a5+
  • Operating system and architecture: MIRACLE LINUX 8.6 x86_64 (RHEL 8.6 compatible)

Linked PRs

@moriyama moriyama added the type-bug An unexpected behavior, bug, or error label Mar 3, 2023
moriyama added a commit to moriyama/cpython that referenced this issue Mar 3, 2023
The charset name "Windows-31J" is registered in the IANA Charset Registry[1]
and is implemented in Python as the cp932 codec.

This commit adds windows_31j to the aliases of cp932.

[1] https://www.iana.org/assignments/charset-reg/windows-31J

Signed-off-by: Masayuki Moriyama <masayuki.moriyama@miraclelinux.com>
@moriyama
Copy link
Contributor Author

moriyama commented Mar 6, 2023

I will give a supplementary explanation.
The reason why Python's cp932 and IANA's Windows-31J are equivalent is as follows.

IANA Windows-31J:
The IANA Windows-31J document[1] lists CP932.TXT[2] published on unicode.org as ISO 10646 equivalency table.

Python cp932:
The conversion table between Python's cp932 and Unicode is defined in mappings_jp.h[3] of the cjkcodecs module.
The mappings_jp.h file is generated by genmap_japanese.py[4].
The genmap_japanese.py generates conversion tables between cp932 and Unicode from CP932.TXT[2].

https://github.com/python/cpython/blob/main/Tools/unicode/genmap_japanese.py#L26

MAPPINGS_CP932 = 'http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT'

Python's cp932 and Unicode conversion tables are generated from the same CP932.TXT[2] as IANA's Windows-31J, so I think adding windows-31j to the alias of cp932 is no problem.

[1] https://www.iana.org/assignments/charset-reg/windows-31J
[2] http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
[3] https://github.com/python/cpython/blob/main/Modules/cjkcodecs/mappings_jp.h
[4] https://github.com/python/cpython/blob/main/Tools/unicode/genmap_japanese.py

moriyama added a commit to moriyama/cpython that referenced this issue Apr 5, 2023
The charset name "Windows-31J" is registered in the IANA Charset Registry[1]
and is implemented in Python as the cp932 codec.

This commit adds windows_31j to the aliases of cp932.

[1] https://www.iana.org/assignments/charset-reg/windows-31J

Signed-off-by: Masayuki Moriyama <masayuki.moriyama@miraclelinux.com>
moriyama added a commit to moriyama/cpython that referenced this issue Feb 19, 2024
The charset name "Windows-31J" is registered in the IANA Charset Registry[1]
and is implemented in Python as the cp932 codec.

This commit adds windows_31j to the aliases of cp932.

[1] https://www.iana.org/assignments/charset-reg/windows-31J

Signed-off-by: Masayuki Moriyama <masayuki.moriyama@miraclelinux.com>
moriyama added a commit to moriyama/cpython that referenced this issue Feb 19, 2024
The charset name "Windows-31J" is registered in the IANA Charset Registry[1]
and is implemented in Python as the cp932 codec.

This commit adds windows_31j to the aliases of cp932.

[1] https://www.iana.org/assignments/charset-reg/windows-31J

Signed-off-by: Masayuki Moriyama <masayuki.moriyama@miraclelinux.com>
methane pushed a commit that referenced this issue Feb 19, 2024
The charset name "Windows-31J" is registered in the IANA Charset Registry[1]
and is implemented in Python as the cp932 codec.

[1] https://www.iana.org/assignments/charset-reg/windows-31J

Signed-off-by: Masayuki Moriyama <masayuki.moriyama@miraclelinux.com>
woodruffw pushed a commit to woodruffw-forks/cpython that referenced this issue Mar 4, 2024
…02389)

The charset name "Windows-31J" is registered in the IANA Charset Registry[1]
and is implemented in Python as the cp932 codec.

[1] https://www.iana.org/assignments/charset-reg/windows-31J

Signed-off-by: Masayuki Moriyama <masayuki.moriyama@miraclelinux.com>
@hugovk
Copy link
Member

hugovk commented Mar 15, 2024

Closing as the PR has been merged. Thanks!

@hugovk hugovk closed this as completed Mar 15, 2024
diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
…02389)

The charset name "Windows-31J" is registered in the IANA Charset Registry[1]
and is implemented in Python as the cp932 codec.

[1] https://www.iana.org/assignments/charset-reg/windows-31J

Signed-off-by: Masayuki Moriyama <masayuki.moriyama@miraclelinux.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-unicode type-bug An unexpected behavior, bug, or error
Projects
Status: Done
Development

No branches or pull requests

3 participants