Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Certain diacritical marks can and should be capitalized... e.g. ü --> Ü #58792

Closed
ChristianClauss mannequin opened this issue Apr 15, 2012 · 6 comments
Closed

Certain diacritical marks can and should be capitalized... e.g. ü --> Ü #58792

ChristianClauss mannequin opened this issue Apr 15, 2012 · 6 comments

Comments

@ChristianClauss
Copy link
Mannequin

ChristianClauss mannequin commented Apr 15, 2012

BPO 14587
Nosy @loewis, @vstinner, @ezio-melotti, @bitdancer

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2012-04-15.14:43:22.938>
created_at = <Date 2012-04-15.14:17:27.733>
labels = ['invalid', 'expert-unicode']
title = 'Certain diacritical marks can and should be capitalized... e.g. \xc3\xbc --> \xc3\x9c'
updated_at = <Date 2012-04-15.18:02:18.621>
user = 'https://bugs.python.org/ChristianClauss'

bugs.python.org fields:

activity = <Date 2012-04-15.18:02:18.621>
actor = 'r.david.murray'
assignee = 'none'
closed = True
closed_date = <Date 2012-04-15.14:43:22.938>
closer = 'r.david.murray'
components = ['Unicode']
creation = <Date 2012-04-15.14:17:27.733>
creator = 'Christian.Clauss'
dependencies = []
files = []
hgrepos = []
issue_num = 14587
keywords = []
message_count = 6.0
messages = ['158332', '158336', '158339', '158341', '158344', '158346']
nosy_count = 5.0
nosy_names = ['loewis', 'vstinner', 'ezio.melotti', 'r.david.murray', 'Christian.Clauss']
pr_nums = []
priority = 'normal'
resolution = 'not a bug'
stage = 'resolved'
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue14587'
versions = ['Python 2.7']

@ChristianClauss
Copy link
Mannequin Author

ChristianClauss mannequin commented Apr 15, 2012

BUGS: certain diacritical marks can and should be capitalized...
str.upper() does not .replace('à', 'À').replace('ä', 'Ä').replace('è', 'È').replace('é', 'É').replace('ö', 'Ö').replace('ü', 'Ü'), etc.
str.lower() does not .replace('À', 'à').replace('Ä', 'ä').replace('È', 'è').replace('É', 'é').replace('Ö', 'ö').replace('Ü', 'ü'), etc.
str.title() has the same problems plus it capitalizes the letter _after_ a diacritic. e.g. 'lüsai'.title() --> 'LÜSai' with a capitol 'S'
myUpper(), myLower(), myTitle() exhibit the correct behavior with a handful of diacritic marks.

def myUpper(inString):
    return inString.upper().replace('à', 'À').replace('ä', 'Ä').replace('è', 'È').replace('é', 'É').replace('ö', 'Ö').replace('ü', 'Ü')

def myLower(inString):
    return inString.lower().replace('À', 'à').replace('Ä', 'ä').replace('È', 'è').replace('É', 'é').replace('Ö', 'ö').replace('Ü', 'ü')

def myTitle(inString): # WARNING: converts all whitespace to a single space
    returnValue = []
    for theWord in inString.split():
        returnValue.append(myUpper(theWord[:1]) + myLower(theWord[1:]))
    return ' '.join(returnValue)

@ChristianClauss ChristianClauss mannequin added the topic-unicode label Apr 15, 2012
@bitdancer
Copy link
Member

It works fine if you use unicode.

@ChristianClauss
Copy link
Mannequin Author

ChristianClauss mannequin commented Apr 15, 2012

On Apr 15, 2012, at 4:43 PM, R. David Murray wrote:

R. David Murray <rdmurray@bitdance.com> added the comment:

It works fine if you use unicode.

----------
nosy: +r.david.murray
resolution: -> invalid
stage: -> committed/rejected
status: open -> closed


Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue14587\>


What does it mean in this context to "use unicode"??
===============================================
In Idle...
===============================================

Python 2.7.3 (v2.7.3:70274d53c1dd, Apr  9 2012, 20:52:43) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "copyright", "credits" or "license()" for more information.
>>> lusai = u'lüsai'
Unsupported characters in input
>>> lusai = 'lüsai'
Unsupported characters in input
>>> print "ŠČŽ"
Unsupported characters in input

===============================================
In a script...
Every time that I try to "use unicode" an exception is thrown.
All try blocks in the following code trigger an exception
===============================================
#/bin/bash/env python
# -- coding: utf-8 --

print '=========='

import sys # sys.version_info = sys.version_info(major=2, minor=7, micro=1, releaselevel='final', serial=0)
print 'sys.version_info = {}.{}.{} {} {}'.format(sys.version_info[0], sys.version_info[1], sys.version_info[2], sys.version_info[3], sys.version_info[4])

import commands, os
print 'os.name = {}'.format(os.name)
print 'os.uname = {}'.format(os.uname())

print '=========='

def myUpper(inString):
    return inString.upper().replace('à', 'À').replace('ä', 'Ä').replace('è', 'È').replace('é', 'É').replace('ö', 'Ö').replace('ü', 'Ü').replace('ẞ', 'ß')

def myLower(inString):
    return inString.lower().replace('À', 'à').replace('Ä', 'ä').replace('È', 'è').replace('É', 'é').replace('Ö', 'ö').replace('Ü', 'ü').replace('ß', 'ẞ')

def myTitle(inString):
    returnValue = []
    for theWord in inString.split():
        returnValue.append(myUpper(theWord[:1]) + myLower(theWord[1:]))
    return ' '.join(returnValue)

def formatted(inValue, inSep = ' '):
    s = str(inValue)
    print ' s={}{}su={}{}sl={}{}st={}...'.format(s, inSep, s.upper(), inSep, s.lower(), inSep, s.title())
    print ' s={}{}mu={}{}ml={}{}mt={}...'.format(s, inSep, myUpper(s), inSep, myLower(s), inSep, myTitle(s))
    u = unicode(inValue, 'utf8')
    try:
        print ' u={}{}uu={}{}ul={}{}ut={}...'.format(u, inSep, u.upper(), inSep, u.lower(), inSep, u.title())
    except:
        print "=== Exception thrown trying to print unicode({}, 'utf8')".format(repr(s))

kolnUpperUnspecified   = str('KÖLN')
kolnUpperAsString      = str('KÖLN')
kolnUpperAsUnicode = unicode('KÖLN', 'utf8')

kolnLowerUnspecified   = str('köln')
kolnLowerAsString      = str('köln')
kolnLowerAsUnicode = unicode('köln', 'utf8')

formatted(kolnUpperUnspecified)
formatted(kolnUpperAsString)
try:
    formatted(kolnUpperAsUnicode)
except:
    pass

formatted(kolnLowerUnspecified)
formatted(kolnLowerAsString)
try:
    formatted(kolnLowerAsUnicode)
except:
    pass

formatted('Ötto Clauß lives in the hamlet of Lüsai in the village of Lü in the valley of Val Müstair in the Canton of Graubünden', '\n')
formatted('ZÜRICH is the largest city in Switzerland and the geographic center of the country is in Älggi-Alp which can be reached via the Lötschberg Tunnel', '\n')
formatted('20% of Swiss people speak Französisch but only 0.5% speak Rätoromanisch', '\n')
formatted('LÜSAI, lüsai, München, Neuchâtel, Ny-Ålesund, Tromsø, ZÜRICH', '\n')

print """BUGS: certain diacritical marks can and should be capitalized...
str.upper() does not .replace('à', 'À').replace('ä', 'Ä').replace('è', 'È').replace('é', 'É').replace('ö', 'Ö').replace('ü', 'Ü'), etc.
str.lower() does not .replace('À', 'à').replace('Ä', 'ä').replace('È', 'è').replace('É', 'é').replace('Ö', 'ö').replace('Ü', 'ü'), etc.
str.title() has the same problems plus it capitalizes the letter _after_ a diacritic. e.g. 'lüsai'.title() --> 'LÜSai' with a capitol 'S'
myUpper(), myLower(), myTitle() exhibit the correct behavior with a handful of diacritic marks."""

@loewis
Copy link
Mannequin

loewis mannequin commented Apr 15, 2012

In addition to R. David's remark, it also works fine in a German locale. In general, you cannot know whether the byte '\xe4' denotes 'ä' or some other letter. For example, in KOI8-R, it denotes Д, instead, which already is an upper-case letter. So either do setlocale at the start of your program, or (better) switch to Unicode strings.

Python 2.6.6 (r266:84292, Dec 27 2010, 00:02:40)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print u'ä'.upper()
Ä

@vstinner
Copy link
Member

Or you can port your program to Python 3 to avoid such issues :-)

@bitdancer
Copy link
Member

Indeed, this type of confusion is a large part of the motivation behind Python3.

You might try posting to the python-list mailing list asking for help if for some reason you are required to use python2 for your program.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants