Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File read of Chinese utf-16-le treats upper byte 1A as EOF #39981

Closed
rrother mannequin opened this issue Feb 25, 2004 · 3 comments
Closed

File read of Chinese utf-16-le treats upper byte 1A as EOF #39981

rrother mannequin opened this issue Feb 25, 2004 · 3 comments

Comments

@rrother
Copy link
Mannequin

rrother mannequin commented Feb 25, 2004

BPO 904474
Nosy @malemburg

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2005-10-03.01:19:09.000>
created_at = <Date 2004-02-25.19:30:42.000>
labels = ['invalid', 'expert-unicode']
title = 'File read of Chinese utf-16-le treats upper byte 1A as EOF'
updated_at = <Date 2005-10-03.01:19:09.000>
user = 'https://bugs.python.org/rrother'

bugs.python.org fields:

activity = <Date 2005-10-03.01:19:09.000>
actor = 'nnorwitz'
assignee = 'none'
closed = True
closed_date = None
closer = None
components = ['Unicode']
creation = <Date 2004-02-25.19:30:42.000>
creator = 'rrother'
dependencies = []
files = []
hgrepos = []
issue_num = 904474
keywords = []
message_count = 3.0
messages = ['20133', '20134', '20135']
nosy_count = 3.0
nosy_names = ['lemburg', 'nnorwitz', 'rrother']
pr_nums = []
priority = 'normal'
resolution = 'not a bug'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue904474'
versions = []

@rrother
Copy link
Mannequin Author

rrother mannequin commented Feb 25, 2004

Any utf-16-le Chinese character with 1A as the most
significant byte causes remainder of file to be ignored.

code extract:

(utf16_encoder, utf16_decoder, utf16_reader,
utf16_writer) = codecs.lookup("utf-16-le")

ifile = utf16_reader(open(sys.argv[1],"r"))

t=ifile.read()

When the Chinese character 1A 5C () is encoundered,
everthing from the 5C is discarded.

These 3 lines:
English="You have not selected any books!"
Context=1,[MsgBox "You have not selected any books!"]
Chinese(Simplified)="* éûUfw"

are input as:
English="You have not selected any books!"
Context=1,[MsgBox "You have not selected any books!"]
Chinese(Simplified)="

@rrother rrother mannequin closed this as completed Feb 25, 2004
@rrother rrother mannequin closed this as completed Feb 25, 2004
@malemburg
Copy link
Member

Logged In: YES
user_id=38388

I believe there is a misconception here: the open(..., "r")
will cause the file to be opened in C lib's text mode. Since
UTF-16 is binary data, this will lead to problems with line
breaking
and file handling in general.

You should try:

import codecs
ifile = codecs.open(filename, 'rb', encoding='utf-16-le')

@nnorwitz
Copy link
Mannequin

nnorwitz mannequin commented Oct 3, 2005

Logged In: YES
user_id=33168

MAL, this seems to come up from time to time. Perhaps we
should update the doc for open()? If it's already
documented, could we make it clearer? Then we should be
able to close this bug. I think I saw another bug recently
that was similar to this one.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant