Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect length of unicode strings using .encode('utf-8') #41179

Closed
edschofield mannequin opened this issue Nov 16, 2004 · 2 comments
Closed

Incorrect length of unicode strings using .encode('utf-8') #41179

edschofield mannequin opened this issue Nov 16, 2004 · 2 comments
Assignees

Comments

@edschofield
Copy link
Mannequin

edschofield mannequin commented Nov 16, 2004

BPO 1067294
Nosy @malemburg
Files
  • python unicode char length bug.txt: Code example exposing a bug in determining the length of utf-8 encoded strings
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/malemburg'
    closed_at = <Date 2004-11-16.12:12:44.000>
    created_at = <Date 2004-11-16.11:58:42.000>
    labels = ['expert-unicode']
    title = "Incorrect length of unicode strings using .encode('utf-8')"
    updated_at = <Date 2004-11-16.12:12:44.000>
    user = 'https://bugs.python.org/edschofield'

    bugs.python.org fields:

    activity = <Date 2004-11-16.12:12:44.000>
    actor = 'lemburg'
    assignee = 'lemburg'
    closed = True
    closed_date = None
    closer = None
    components = ['Unicode']
    creation = <Date 2004-11-16.11:58:42.000>
    creator = 'edschofield'
    dependencies = []
    files = ['1487']
    hgrepos = []
    issue_num = 1067294
    keywords = []
    message_count = 2.0
    messages = ['23167', '23168']
    nosy_count = 2.0
    nosy_names = ['lemburg', 'edschofield']
    pr_nums = []
    priority = 'normal'
    resolution = 'works for me'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue1067294'
    versions = ['Python 2.4']

    @edschofield
    Copy link
    Mannequin Author

    edschofield mannequin commented Nov 16, 2004

    Python 2.3.4 and Python 2.4b2:

    print "x = %-15s" %(x.encode('utf-8'),) + " more text"

    gives an incorrect number of spaces when x is a
    two-byte unicode character like à. There is no such
    problem if x is used alone rather than its encode(...)
    method.

    The reason seems to be this: if x = u'\u00e0' (the
    character à) and s=x.encode('utf-8'), then len(s) = 2,
    which breaks the print command above on a UTF-8 terminal.

    A slightly longer example is attached.

    @edschofield edschofield mannequin closed this as completed Nov 16, 2004
    @edschofield edschofield mannequin assigned malemburg Nov 16, 2004
    @edschofield edschofield mannequin added the topic-unicode label Nov 16, 2004
    @edschofield edschofield mannequin closed this as completed Nov 16, 2004
    @edschofield edschofield mannequin assigned malemburg Nov 16, 2004
    @edschofield edschofield mannequin added the topic-unicode label Nov 16, 2004
    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    As you already noted: the problem is that you are mixing Unicode
    and strings in a way which is bound to fail.

    You should use:

    print (u"x = %-15s" %x + u" more text").encode('utf-8')

    ie. stay with Unicode as long as you can and only call encode
    when doing I/O as last step before passing off the string
    to an 8-bit stream.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant