Fix unicode literals #58384

Jean-MichelFauth · 2012-03-02T12:36:56Z

BPO	14176
Nosy	@loewis, @birkenfeld, @terryjreedy, @benjaminp, @ezio-melotti, @bitdancer

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2012-03-02.23:21:06.120>
created_at = <Date 2012-03-02.12:36:56.284>
labels = ['type-bug', 'expert-unicode']
title = 'Fix unicode literals'
updated_at = <Date 2012-03-04.20:09:55.462>
user = 'https://bugs.python.org/Jean-MichelFauth'

bugs.python.org fields:

activity = <Date 2012-03-04.20:09:55.462>
actor = 'loewis'
assignee = 'none'
closed = True
closed_date = <Date 2012-03-02.23:21:06.120>
closer = 'terry.reedy'
components = ['Unicode']
creation = <Date 2012-03-02.12:36:56.284>
creator = 'Jean-Michel.Fauth'
dependencies = []
files = []
hgrepos = []
issue_num = 14176
keywords = []
message_count = 15.0
messages = ['154763', '154765', '154782', '154792', '154793', '154794', '154796', '154797', '154798', '154806', '154807', '154808', '154829', '154834', '154907']
nosy_count = 8.0
nosy_names = ['loewis', 'georg.brandl', 'terry.reedy', 'benjamin.peterson', 'ezio.melotti', 'jmfauth', 'r.david.murray', 'Jean-Michel.Fauth']
pr_nums = []
priority = 'normal'
resolution = 'out of date'
stage = None
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue14176'
versions = ['Python 3.3']

Jean-MichelFauth · 2012-03-02T12:36:55Z

Now, that the PEP-414 has been accepted, I can
only strongly recommend to fix the problem
of unicode literals as a partial workaround.

>>> print u'abcœé€'
abcé
>>>

If these six characters are not rendered correctly, you
shoud read:
LATIN SMALL LETTER A
LATIN SMALL LETTER B
LATIN SMALL LETTER C
LATIN SMALL LIGATURE OE
LATIN SMALL LETTER E WITH ACUTE
EURO SIGN

It is not necessary to give here the list of
the numerous libs that do not understand
u'unicode literals' as shown above.

(I wrote all my Py2 code in a u'unicode mode',
and I know how hard it is to have to select
between the u'' or unicode() variants.

Face it. Python has never worked [*], Python does
not work, Python will never work. More important,
it is more than clear to me, there is no willingness
to solve this issue. (The holy compatibilty with not
working code).

[*] Except the pure ASCII serie (Py 1.5) and the
Python 3[0,1,2] serie.

No offense. I'm pretty sure the creator of this
PEP is not even able to type on his machine the
list of the 42 characters supposed to be available
it the typographies (plural) used by the different
countries speaking French.
The whole free/open source software disaster in all
its splendor.

Regards.
jmf

benjaminp · 2012-03-02T13:53:25Z

What exactly is the bug you're reporting?

Python 2.7.2 (default, Oct 27 2011, 22:35:02) 
[GCC 4.5.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print u'abcœé€'
abcœé€

loewis · 2012-03-02T17:06:56Z

What operating system and what terminal are you using? If Windows: what code page does your terminal run in?

Jean-MichelFauth · 2012-03-02T20:08:40Z

I deliberately hid the information about the used interactive
interpreter; just to show you the "experience" of new Python
user. (This is what I'm showing to potential Python devs who
are interested in this tool; I know Python and use it since
v. 1.5.6 as a non computer scientist).

The interactive interpreter was:

Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.

>>

In that precise case, it was Windws 7 Pro (Windows 7
Professionnel, in French because of a Swiss French version)
and IDLE is just the IDLE an end user see after a fresh
installation.
I can ensure you, such a behaviour exists / existed on all
Windows versions I used (from Win98, win2000, ...) with all
the Python 2 versions since the unicode introduction.

The technical reasons/aspects: "sys.defaultencoding",
non iso-8859-1 chars [#], *non working unicode literals*,
sys.stdout.encoding = 'cp1252' and so on.

[#] For those who do not know, one can not write text
in French with Latin-1.

Please do not take my aggressive (I recognize it), but sometimes
necessary message badly.

IDLE is not the cause, I use here IDLE to show as an example the
disaster of code containing *unicode literals*.

I'm not really happy to see this mess again in Py3.3 [†]; the key
point beeing *unicode literals*.

The Pandora's box is opened.

[†] In fact, I will somehow never see or suffer from it. Decisions
have been taken.

jmf

birkenfeld · 2012-03-02T20:13:11Z

Well, let me soothe your mind then: in Python 3, '...' and u'...' will be absolutely equal, so you won't find any more "mess" with the changes from PEP-414.

bitdancer · 2012-03-02T20:32:27Z

Unless I'm misunderstanding, this is a duplicate of bpo-1602.

You will note that the problem is *not* with Python (or open source software in general), the problem is that Microsoft treats the command line as a second (or third, or fourth) class citizen.

Jean-MichelFauth · 2012-03-02T20:35:50Z

Sorry, I neglected the most important information.

Python 3.2 is working perfectly. It is simply impossible
to create non valid strings (type/class 'str') from a
keyboard. (non programmatically created).

Like the limited characters set I used when I wrote my
first program on a PDP-8.

Porting Py 2 code was a child play.

bitdancer · 2012-03-02T20:40:59Z

OK, so I still don't understand what problem it is you are reporting. What do you mean by "can't craete non-valid strings"? Of course you can't. (I don't see how you could do that programatically, either, although that depends heavily on your definition of non-valid.)

Are you reporting that cmd.exe has no support for entering French characters? That wouldn't be a Python bug.

Are you reporting that idle lacks the keyboard support for French? (I don't use Idle, so I don't know if that is true or not.)

bitdancer · 2012-03-02T20:41:44Z

I'm changing the title since PEP-414 has no bearing here.

terryjreedy · 2012-03-02T22:01:35Z

As I explained to J-M when he posted much the same to python-list, Idle's French keyboard support is faulty because tcl/tk's French keyboard support is faulty. A patch for this was recently applied to tcl/tk. I hope it will be in a released version that we can incorporate in 3.3.

I am sure we all wish that Microsoft (and Apple) would take more of a lead in moving to a one Unicode world from a 200 encodings and codepages world. I am sometimes as frustrated at the current situation as J-M. But unless he can identify a valid *Python* bug, we should close this.

Jean-MichelFauth · 2012-03-02T22:10:57Z

You do not get it or I do not explain it correctly.

I do not care if Py 3.3 accepts '...' ou u'...'. I'm only
affraid, Py 3.3 is suffering from the same non working
behaviour Python 2 is suffering. I have seen so many things...

I can only use an Py2/Py3 analogy, the types beeing differnt.

In Python 2, the u'...' and the unicode('...', 'coding') are
not equivalent. This leads and has lead to a lot of non
working code. unicode() is always working, while u'...'
may not work. A lot of libs, are accepting unicode() and are
failing in having to accept u'...'.
That would mean in Python 3, '...' works and u'...' will not work.

Once again, an *illustration* with IDLE / Py2.

>>> import unicodedata as ud
>>> for c in u'abcéœ€':
	print ud.name(c)

LATIN SMALL LETTER A
LATIN SMALL LETTER B
LATIN SMALL LETTER C
LATIN SMALL LETTER E WITH ACUTE

Traceback (most recent call last):
  File "<pyshell#3>", line 2, in <module>
    print ud.name(c)
ValueError: no such name
>>> # but
>>> import sys
>>> for c in unicode('abcéœ€', sys.stdout.encoding):
	print ud.name(c)

LATIN SMALL LETTER A
LATIN SMALL LETTER B
LATIN SMALL LETTER C
LATIN SMALL LETTER E WITH ACUTE
LATIN SMALL LIGATURE OE
EURO SIGN

>>

A course, this is actually a no problem with Py 3.

I know nothing about the internal of Python. I have however
noticed this guilty behaviour happen especially with non
iso-8859-1 chars, valid byte string chars but equivalent chars
with unicode code point > 255. Infortunately, all these chars
which are so important in French. (I heared about similar problems
with the mac-roman coding. I do not know the status).

So, if this (u'...') works in Py 3.3, the problem can
be considered as "solved".
At least you have been informed about this potential issue.
It still remains that this is a serious problem on Py 2.

jmf

terryjreedy · 2012-03-02T23:21:06Z

That would mean in Python 3, '...' works and u'...' will not work.

You misunderstand the PEP: in 3.3, '...' and u'...' will be *exactly* the same. The only change is that the interpreter will ignore the u prefix instead of raising SyntaxError. It will be as if 'u' were not there. The only purpose is to let 2.x code run in 3.x without requiring the user to erase the 'u'.

I can see how you could misunderstand and think that the 'u' prefix must have some meaning. But is does not. The addition is a bit controversial but Guido approved it with the expectation that it will encourage more conversion of 2.x libraries to run on 3.3. In any case, the tracker is not the place for further discussion of the value of the PEP.

Once again, an *illustration* with IDLE / Py2.
...
Of course, this is actually a no problem with Py 3.
...
It still remains that this is a serious problem on Py 2.

We are painfully aware that 2.x has problems with unicode. You do not need to tell us. I believe that most of the problems that could be sensibly fixed in 2.x have been fixed. 3.0 fixed more problems by changing the language. 3.3 fixes still more problems by changing the internal implementation of unicode, along with the C api, and the meaning of the language on some systems. People who want to avoid all the problems that have been fixed should use 3.3 either from the repository or when it is released.

So, if this (u'...') works in Py 3.3, the problem can
be considered as "solved".

I am glad you agree and I will close the issue.

Please use python-list for any further discussion or questions.

jmfauth · 2012-03-03T11:03:37Z

2012/3/3 Terry J. Reedy <report@bugs.python.org>

Terry J. Reedy <tjreedy@udel.edu> added the comment:

> That would mean in Python 3, '...' works and u'...' will not work.

You misunderstand the PEP: in 3.3, '...' and u'...' will be *exactly* the
same. The only change is that the interpreter will ignore the u prefix
instead of raising SyntaxError. It will be as if 'u' were not there. The
only purpose is to let 2.x code run in 3.x without requiring the user to
erase the 'u'.

I can see how you could misunderstand and think that the 'u' prefix must
have some meaning. But is does not. The addition is a bit controversial but
Guido approved it with the expectation that it will encourage more
conversion of 2.x libraries to run on 3.3. In any case, the tracker is not
the place for further discussion of the value of the PEP.

> Once again, an *illustration* with IDLE / Py2.
...
> Of course, this is actually a no problem with Py 3.
...
> It still remains that this is a serious problem on Py 2.

We are painfully aware that 2.x has problems with unicode. You do not need
to tell us. I believe that most of the problems that could be sensibly
fixed in 2.x have been fixed. 3.0 fixed more problems by changing the
language. 3.3 fixes still more problems by changing the internal
implementation of unicode, along with the C api, and the meaning of the
language on some systems. People who want to avoid all the problems that
have been fixed should use 3.3 either from the repository or when it is
released.

> So, if this (u'...') works in Py 3.3, the problem can
be considered as "solved".

I am glad you agree and I will close the issue.

Preliminary remark. I'm sending this via gmail, so it
may happen the glyphs you see are illformed or
transfomred by Google. Be ensured I'm typing the
"right" glyphs.

No, no and no. This is not a tkinter issue. This
"strange" behaviour, I do not find a better word,
happens with many libraries, can be Python core libs
or external libs.
To tell you the truth and dispite my experience,
I never succeeded to narrow excatly the problem.
In Python 2 sometimes, understand with some pieces
of code / software, it "works" and somtimes it
simply does not. The libs used here a just the
first ones, that came to my mind.

-----

wxPython 2.8-ansi build.

Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "c:\python27\lib\site-packages\wx-2.8-msw-ansi\wx\py\shell.py", line
1242, in writeOut
    self.write(text)
  File "c:\python27\lib\site-packages\wx-2.8-msw-ansi\wx\py\shell.py", line
1000, in write
    self.AddText(text)
  File "c:\python27\lib\site-packages\wx-2.8-msw-ansi\wx\stc.py", line
1425, in AddText
    return _stc.StyledTextCtrl_AddText(*args, **kwargs)
  File "c:\python27\lib\encodings\cp1252.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position
4-5: character maps to <undefined>

abcéœ€

>>

----

PySide, passing "unicode" to a text widdget.

Passing u'abcéœ€' works.
Passing unicode('abcéœ€', 'cp1252') works.
Passing 'abcé€œ' doesn't ! 'œ€' are missing.

---

My interactive wx interpreter using wxPython. Strings
as frame title.

True

ok

Traceback (most recent call last):
  File "<psi last command>", line 1, in <module>
  File
"c:\Python27\lib\site-packages\wx-2.8-msw-ansi\wx\_windows.py",
line 505, in __init__
    _windows_.Frame_swiginit(self,_windows_.new_Frame(*args,
**kwargs))
  File "c:\Python27\lib\encodings\cp1252.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in
position 5-6: character maps to <undefined>

True

ok

---

And so on with many libs.

You may argue that these libs are guilty.

I may argue that Python is somehow guilty, because it
let users write non working code.
And practically in all the cases, the main problem is due
to the usage of unicode literals.

Just to show you, I'm quite comfortable with all this
coding stuff. The results my interactive intepreter.
Special hack, unfortunatelly non portable, works
only with Windows and cp1252.

abcé??
>>> unicode('abcéœ€', sys.stdout.encoding)
abcéœ€
>>> print u'abcéœ€'
abcé??
>>> print unicode('abcéœ€', sys.stdout.encoding)
abcéœ€

As I am aware of this "feature", all my code is
perfectly working. I'm paying attention to the
necessity of the usage of u'...' or unicode(...).
Unfortunatelly, this not a general case in a lot of
code I see, supposed to deal with texts.

To draw a conclusion.

You are wise enough to understand that, when I'm
saying "Python just does not work", I'm unforunatelly
not so far away form the reality.

I really, very really, expect all this mess (sorry
for the word) will not reappear in Py 3.3.

Let's wait.

'abcéœ€'
>>> print('abcéœ€')
abcéœ€
>>>

Regards,
Jean-Michel Fauth

PS The u() trick does not help.

birkenfeld · 2012-03-03T13:03:38Z

I'd like to encourage you to not try this sort of thing out from an interactive interpreter (incidentally, where does "<psi last command>" come from? It doesn't look like Python's REPL).

As David and Terry noted, interactions with such a console, be it Windows' "cmd" or IDLE, have their very own idiosyncrasies and bugs.

That said, in Python 2.x *source files* the following two expressions are identical:

u'abcœé€'
unicode('abcœé€', 'encoding the file is in')

Both result in a Unicode string with the six characters/codepoints you mentioned. There won't be any code that works with one but not the other.

Of course there are libraries that do not handle Unicode strings in general (nothing to do with literals!) correctly, but as you yourself said, that is a problem with the libraries.

Lastly, please read PEP-414 if you are not completely sure what it is proposing. You will see that it merely affects the available syntax for Unicode literals and allows the "u" again.

loewis · 2012-03-04T20:09:55Z

I propose to close this issue as invalid (although out-of-date might be fine as well). Jean-Michel is apparently unable to describe what issue *precisely* he wants to see fixed, rather than just complaining that open source is a disaster. I don't think we can anything do about open source being a disaster, and I'm not able to reproduce that perception.

Jean-Michel: please try to use this bug tracker in the way it is intended, i.e. report one bug at time, following this structure:

this is what I did
this is what happened
this is what should have happened instead

ezio-melotti added topic-unicode type-bug An unexpected behavior, bug, or error labels Mar 2, 2012

bitdancer changed the title ~~Fix unicode literals (for PEP 414)~~ Fix unicode literals Mar 2, 2012

terryjreedy closed this as completed Mar 2, 2012

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix unicode literals #58384

Fix unicode literals #58384

Jean-MichelFauth mannequin commented Mar 2, 2012

Jean-MichelFauth mannequin commented Mar 2, 2012

benjaminp commented Mar 2, 2012

loewis mannequin commented Mar 2, 2012

Jean-MichelFauth mannequin commented Mar 2, 2012

birkenfeld commented Mar 2, 2012

bitdancer commented Mar 2, 2012

Jean-MichelFauth mannequin commented Mar 2, 2012

bitdancer commented Mar 2, 2012

bitdancer commented Mar 2, 2012

terryjreedy commented Mar 2, 2012

Jean-MichelFauth mannequin commented Mar 2, 2012

terryjreedy commented Mar 2, 2012

jmfauth mannequin commented Mar 3, 2012

birkenfeld commented Mar 3, 2012

loewis mannequin commented Mar 4, 2012

Fix unicode literals #58384

Fix unicode literals #58384

Comments

Jean-MichelFauth mannequin commented Mar 2, 2012

Jean-MichelFauth mannequin commented Mar 2, 2012

benjaminp commented Mar 2, 2012

loewis mannequin commented Mar 2, 2012

Jean-MichelFauth mannequin commented Mar 2, 2012

birkenfeld commented Mar 2, 2012

bitdancer commented Mar 2, 2012

Jean-MichelFauth mannequin commented Mar 2, 2012

bitdancer commented Mar 2, 2012

bitdancer commented Mar 2, 2012

terryjreedy commented Mar 2, 2012

Jean-MichelFauth mannequin commented Mar 2, 2012

terryjreedy commented Mar 2, 2012

jmfauth mannequin commented Mar 3, 2012

birkenfeld commented Mar 3, 2012

loewis mannequin commented Mar 4, 2012