Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding a unicode with unicode() and ignoring errors #63063

Closed
GScottJohnston mannequin opened this issue Aug 28, 2013 · 2 comments
Closed

Encoding a unicode with unicode() and ignoring errors #63063

GScottJohnston mannequin opened this issue Aug 28, 2013 · 2 comments
Labels
type-bug An unexpected behavior, bug, or error

Comments

@GScottJohnston
Copy link
Mannequin

GScottJohnston mannequin commented Aug 28, 2013

BPO 18863
Nosy @ned-deily

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2013-08-28.06:45:08.820>
created_at = <Date 2013-08-28.06:28:31.713>
labels = ['type-bug']
title = 'Encoding a unicode with unicode() and ignoring errors'
updated_at = <Date 2013-08-28.06:45:08.782>
user = 'https://bugs.python.org/GScottJohnston'

bugs.python.org fields:

activity = <Date 2013-08-28.06:45:08.782>
actor = 'ned.deily'
assignee = 'none'
closed = True
closed_date = <Date 2013-08-28.06:45:08.820>
closer = 'ned.deily'
components = []
creation = <Date 2013-08-28.06:28:31.713>
creator = 'G..Scott.Johnston'
dependencies = []
files = []
hgrepos = []
issue_num = 18863
keywords = []
message_count = 2.0
messages = ['196350', '196351']
nosy_count = 2.0
nosy_names = ['ned.deily', 'G..Scott.Johnston']
pr_nums = []
priority = 'normal'
resolution = 'rejected'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue18863'
versions = ['Python 2.7']

@GScottJohnston
Copy link
Mannequin Author

GScottJohnston mannequin commented Aug 28, 2013

I've come up with the following series of minimal examples to demonstrate my bug.

>>> unicode("")
u''
>>> unicode("", errors="ignore")
u''


>>> unicode("abcü")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
>>> unicode("abcü", errors="ignore")
u'abc'


>>> unicode(3)
u'3'
>>> unicode(3, errors="ignore")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: coercing to Unicode: need string or buffer, int found


>>> unicode(unicode(""))
u''
>>> unicode(unicode(""), errors="ignore")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: decoding Unicode is not supported

The first two pairs of mini-programs are reasonable behaviour. If the errors parameter is set to "ignore", no additional errors are thrown, but characters that produce encoding errors are skipped in the output, as expected.

The third pair of mini-programs can be solved by instead writing unicode(str(3), errors="ignore"). This should likely be done automatically, given the fact that unicode(3) behaves as expected, and properly converts between types. The fact that the conversion is done automatically without the errors parameter leads me to believe that there is a logic problem with the code, where the setting errors="ignore" changes the path of execution by more than just skipping characters that cause encoding errors.

The fourth pair of mini-programs is simply baffling. The first mini-program clearly demonstrates that decoding a Unicode object is in fact supported. The fact that the second mini-program claims it's not supported further demonstrates that the logic depends on the errors="ignore" parameter more than it should.

@ned-deily
Copy link
Member

See http://docs.python.org/2/library/functions.html#unicode. It appears to me that unicode() is behaving exactly as documented. In particular:

"If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding."

"If no optional parameters are given, unicode() will mimic the behaviour of str() except that it returns Unicode strings instead of 8-bit strings. More precisely, if object is a Unicode string or subclass it will return that Unicode string without any additional decoding applied."

One can argue about whether this documented behavior makes the most sense but, since it is documented to behave that way and that any significant changes to that behavior at this late stage of the life of Python 2 could break existing programs, I think there will be little support for making such a change now. Sorry!

@ned-deily ned-deily added the type-bug An unexpected behavior, bug, or error label Aug 28, 2013
@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

1 participant