Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusing statement about unicode strings in tutorial introduction #64885

Closed
Urhixidur mannequin opened this issue Feb 19, 2014 · 10 comments
Closed

Confusing statement about unicode strings in tutorial introduction #64885

Urhixidur mannequin opened this issue Feb 19, 2014 · 10 comments
Labels
docs Documentation in the Doc dir type-feature A feature request or enhancement

Comments

@Urhixidur
Copy link
Mannequin

Urhixidur mannequin commented Feb 19, 2014

BPO 20686
Nosy @birkenfeld, @bitdancer, @serhiy-storchaka, @Urhixidur

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2020-05-31.13:04:50.718>
created_at = <Date 2014-02-19.16:24:55.899>
labels = ['type-feature', 'docs']
title = 'Confusing statement about unicode strings in tutorial introduction'
updated_at = <Date 2020-05-31.13:04:50.717>
user = 'https://github.com/Urhixidur'

bugs.python.org fields:

activity = <Date 2020-05-31.13:04:50.717>
actor = 'serhiy.storchaka'
assignee = 'docs@python'
closed = True
closed_date = <Date 2020-05-31.13:04:50.718>
closer = 'serhiy.storchaka'
components = ['Documentation']
creation = <Date 2014-02-19.16:24:55.899>
creator = 'Daniel.U..Thibault'
dependencies = []
files = []
hgrepos = []
issue_num = 20686
keywords = []
message_count = 10.0
messages = ['211627', '211635', '211638', '211721', '211726', '214217', '214223', '214268', '214302', '370436']
nosy_count = 5.0
nosy_names = ['georg.brandl', 'r.david.murray', 'docs@python', 'serhiy.storchaka', 'Daniel.U..Thibault']
pr_nums = []
priority = 'normal'
resolution = 'out of date'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue20686'
versions = ['Python 2.7']

@Urhixidur
Copy link
Mannequin Author

Urhixidur mannequin commented Feb 19, 2014

Near the end of 3.1.3 http://docs.python.org/2/tutorial/introduction.html#unicode-strings you can read:

"When a Unicode string is printed, written to a file, or converted with str(), conversion takes place using this default encoding."

This can be interpreted as stating that stating that printing a Unicode string (using the print function or the shell's default print behaviour) results in ASCII printout. It can likewise be interpreted as stating that any write of a Unicode string to a file converts the string to ASCII. Experimentation shows this is not true. Perhaps you meant something like this:

"When a Unicode string is converted with str() in order to be printed or written to a file, conversion takes place using this default encoding."

Grammatical comments: In the statement "When a Unicode string is printed, written to a file, or converted with str(), conversion takes place using this default encoding.", the ", or" puts the three elements of the enumeration on the same level (respectively "printed", "written to a file", and "converted with str()"). The confusion seems to arise because "with str()" was meant to apply to the list as a whole, not just its last element.

@Urhixidur Urhixidur mannequin assigned docspython Feb 19, 2014
@Urhixidur Urhixidur mannequin added docs Documentation in the Doc dir type-feature A feature request or enhancement labels Feb 19, 2014
@bitdancer
Copy link
Member

It seems to me the statement is correct as written. What experiments indicate otherwise?

@birkenfeld
Copy link
Member

The only problem I can see is that "print" uses the console encoding.

For files and str(), the comment is correct for Python 2.

@Urhixidur
Copy link
Mannequin Author

Urhixidur mannequin commented Feb 20, 2014

"It seems to me the statement is correct as written. What experiments indicate otherwise?"

Here's a simple one:

>> print «1»

The guillemets are certainly not ASCII (Unicode AB and BB, well outside ASCII's 7F upper limit) but are rendered as guillemets. (Guillemets are easy for me 'cause I use a French keyboard) I haven't actually checked yet what happens when writing to a file. If Python is unable to write anything but ASCII to file, it becomes nearly useless.

@bitdancer
Copy link
Member

Thanks, yes, Georg already pointed out the issue with print. I suppose that this is something that changed at some point in Python2's history but this bit of the docs was not updated.

Python can write anything to a file, you just have to tell it what encoding to use, either by explicitly encoding the unicode to binary before writing it to the file, or by using codecs.open and specifying an encoding for the file. (This is all much easier in python3, where the unicode support is part of the core of the language.)

@Urhixidur
Copy link
Mannequin Author

Urhixidur mannequin commented Mar 20, 2014

"The default encoding is normally set to ASCII [...]. When a Unicode string is printed, written to a file, or converted with str(), conversion takes place using this default encoding."

>>> u"äöü"
u'\xe4\xf6\xfc'
   Printing a Unicode string uses ASCII encoding: false (the characters are not converted to their ASCII equivalents) (compare with str(), below)

>>> str(u"äöü")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
   Converting a Unicode string with str() uses ASCII encoding: true (if print (see above) behaved like str(), you'd get an error too)

>>> f = open('workfile', 'w')
>>> f.write('This is a «test»\n')
>>> f.close()
   Writing a Unicode string to a file uses ASCII encoding: false (examination of the file reveals UTF-8 characters (hex dump: 54 68 69 73 20 69 73 20 61 20 C2 AB 74 65 73 74 C2 BB 0A))

@bitdancer
Copy link
Member

re: file. You forgot the 'u' in front of the string:

>>> f.write(u'This is a «test»\n')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xab' in position 10: ordinal not in range(128)

So you were actually writing binary in your console encoding, which must have been utf-8. (This kind of confusion is the main reason python3 exists).

@bitdancer bitdancer changed the title Confusing statement Confusing statement about unicode strings in tutorial introduction Mar 20, 2014
@Urhixidur
Copy link
Mannequin Author

Urhixidur mannequin commented Mar 20, 2014

>> mystring="äöü"
>> myustring=u"äöü"

>>> mystring
'\xc3\xa4\xc3\xb6\xc3\xbc'
>>> myustring
u'\xe4\xf6\xfc'

>>> str(mystring)
'\xc3\xa4\xc3\xb6\xc3\xbc'
>>> str(myustring)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

>>> f = open('workfile', 'w')
>>> f.write(mystring)
>>> f.close()
>>> f = open('workufile', 'w')
>>> f.write(myustring)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
>>> f.close()

workfile contains C3 A4 C3 B6 C3 BC

So the Unicode string (myustring) does indeed try to convert to ASCII when written to file. But not when just printed.

It seems really strange that non-Unicode strings (mystring) should actually be more flexible than Unicode strings...

@birkenfeld
Copy link
Member

First, entering a string at the command prompt like this is not considered "printing"; it's invoking the repr().

Then, when you say flexible, you say it as if it's a good thing. In this context "flexible" means as much as "easy to produce mojibake" and is not desirable.

For all these use cases, there are ways to do the right thing with Unicode strings in Python 2 (e.g. using io.open instead of builtin open). But making these the builtin case was the big gain of Python 3.

@serhiy-storchaka
Copy link
Member

Python 2.7 is no longer supported.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation in the Doc dir type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants