Incorrect length of unicode strings using .encode('utf-8') #41179

edschofield · 2004-11-16T11:58:42Z

BPO	1067294
Nosy	@malemburg
Files	python unicode char length bug.txt: Code example exposing a bug in determining the length of utf-8 encoded strings

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/malemburg'
closed_at = <Date 2004-11-16.12:12:44.000>
created_at = <Date 2004-11-16.11:58:42.000>
labels = ['expert-unicode']
title = "Incorrect length of unicode strings using .encode('utf-8')"
updated_at = <Date 2004-11-16.12:12:44.000>
user = 'https://bugs.python.org/edschofield'

bugs.python.org fields:

activity = <Date 2004-11-16.12:12:44.000>
actor = 'lemburg'
assignee = 'lemburg'
closed = True
closed_date = None
closer = None
components = ['Unicode']
creation = <Date 2004-11-16.11:58:42.000>
creator = 'edschofield'
dependencies = []
files = ['1487']
hgrepos = []
issue_num = 1067294
keywords = []
message_count = 2.0
messages = ['23167', '23168']
nosy_count = 2.0
nosy_names = ['lemburg', 'edschofield']
pr_nums = []
priority = 'normal'
resolution = 'works for me'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue1067294'
versions = ['Python 2.4']

edschofield · 2004-11-16T11:58:42Z

Python 2.3.4 and Python 2.4b2:

print "x = %-15s" %(x.encode('utf-8'),) + " more text"

gives an incorrect number of spaces when x is a
two-byte unicode character like à. There is no such
problem if x is used alone rather than its encode(...)
method.

The reason seems to be this: if x = u'\u00e0' (the
character à) and s=x.encode('utf-8'), then len(s) = 2,
which breaks the print command above on a UTF-8 terminal.

A slightly longer example is attached.

malemburg · 2004-11-16T12:12:44Z

Logged In: YES
user_id=38388

As you already noted: the problem is that you are mixing Unicode
and strings in a way which is bound to fail.

You should use:

print (u"x = %-15s" %x + u" more text").encode('utf-8')

ie. stay with Unicode as long as you can and only call encode
when doing I/O as last step before passing off the string
to an 8-bit stream.

edschofield mannequin closed this as completed Nov 16, 2004

edschofield mannequin assigned malemburg Nov 16, 2004

edschofield mannequin added the topic-unicode label Nov 16, 2004

edschofield mannequin closed this as completed Nov 16, 2004

edschofield mannequin assigned malemburg Nov 16, 2004

edschofield mannequin added the topic-unicode label Nov 16, 2004

ezio-melotti transferred this issue from another repository Apr 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect length of unicode strings using .encode('utf-8') #41179

Incorrect length of unicode strings using .encode('utf-8') #41179

edschofield mannequin commented Nov 16, 2004

edschofield mannequin commented Nov 16, 2004

malemburg commented Nov 16, 2004

Incorrect length of unicode strings using .encode('utf-8') #41179

Incorrect length of unicode strings using .encode('utf-8') #41179

Comments

edschofield mannequin commented Nov 16, 2004

edschofield mannequin commented Nov 16, 2004

malemburg commented Nov 16, 2004