Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str.title() misbehaves with apostrophes #51257

Closed
nickd mannequin opened this issue Sep 27, 2009 · 28 comments
Closed

str.title() misbehaves with apostrophes #51257

nickd mannequin opened this issue Sep 27, 2009 · 28 comments
Labels
type-bug An unexpected behavior, bug, or error

Comments

@nickd
Copy link
Mannequin

nickd mannequin commented Sep 27, 2009

BPO 7008
Nosy @malemburg, @rhettinger, @pitrou, @ezio-melotti, @bitdancer

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2009-09-28.23:12:59.135>
created_at = <Date 2009-09-27.17:23:25.710>
labels = ['type-bug']
title = 'str.title() misbehaves with apostrophes'
updated_at = <Date 2009-09-29.14:49:21.116>
user = 'https://bugs.python.org/nickd'

bugs.python.org fields:

activity = <Date 2009-09-29.14:49:21.116>
actor = 'gvanrossum'
assignee = 'none'
closed = True
closed_date = <Date 2009-09-28.23:12:59.135>
closer = 'rhettinger'
components = []
creation = <Date 2009-09-27.17:23:25.710>
creator = 'nickd'
dependencies = []
files = []
hgrepos = []
issue_num = 7008
keywords = []
message_count = 28.0
messages = ['93180', '93212', '93220', '93223', '93226', '93227', '93229', '93232', '93235', '93236', '93237', '93238', '93239', '93240', '93241', '93242', '93243', '93244', '93250', '93258', '93260', '93261', '93262', '93264', '93271', '93272', '93274', '93277']
nosy_count = 10.0
nosy_names = ['lemburg', 'nnorwitz', 'rhettinger', 'pitrou', 'christoph', 'ezio.melotti', 'r.david.murray', 'markon', 'twb', 'nickd']
pr_nums = []
priority = 'normal'
resolution = 'wont fix'
stage = 'test needed'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue7008'
versions = ['Python 2.7', 'Python 3.2']

@nickd
Copy link
Mannequin Author

nickd mannequin commented Sep 27, 2009

str.title() capitalizes the first letter after an apostrophe:

>>> "This isn't right".title()
"This Isn'T Right"

The library function string.capwords, which appears to have exactly the
same responsibility, doesn't exhibit this behavior:

>>> string.capwords("This isn't right")
"This Isn't Right"

Tested on 2.6.2 on Mac OS X

@nickd nickd mannequin added the type-bug An unexpected behavior, bug, or error label Sep 27, 2009
@markon
Copy link
Mannequin

markon mannequin commented Sep 28, 2009

This was already asked some years ago.

http://mail.python.org/pipermail/python-list/2006-April/549340.html

@twb
Copy link
Mannequin

twb mannequin commented Sep 28, 2009

The string module, however, fails to properly capitalize anything in quotes:

>>> string.capwords("i pity the 'foo'.")
"I Pity The 'foo'."

The string module could be easily made to work like the object. The
object could be made to work more like the module, only capitalizing
things after a space and the start of the string, but I'm not really
sure that it's any better. (The s.istitle() should also be updated if
s.title() is changed.) The inconsistency is pretty nasty, though, and
the documentation should probably be more specific about what's going on.

@rhettinger
Copy link
Contributor

I agree with the OP that str.title should be made smarter. As it
stands, it is a likely bug factory that would pass unittests, then
generate unpleasant results with real user inputs.

Extending on Thomas's comment, I think string.capwords() needs to be
deprecated and eliminated. It is an egregious hack that has unfortunate
effects such as dropping runs for repeated spaces and incorrectly
handling strings in quotes.

As it stands, we have two methods that both don't quite do what we would
really want in a title casing method (correct handling of apostrophe's
and quotation marks, keeping the string length unchanged, and only
changing desired letters from lower to uppercase with no other
side-effects).

@bitdancer
Copy link
Member

I believe capwords was supposed to be removed in 3.0, but this did not
happen.

@rhettinger
Copy link
Contributor

If you can find a link to the discussion for removing capwords, we can
go ahead and deprecate it now.

@rhettinger rhettinger self-assigned this Sep 28, 2009
@bitdancer
Copy link
Member

I haven't been able to find any discussion of deprecating capwords other
than a mention in this thread:

http://mail.python.org/pipermail/python-3000/2007-April/006642.html

Later in the thread Barry says he is neutral on removing capwords, and
it is not mentioned further.

I think Ezio found some other information somewhere.

@twb
Copy link
Mannequin

twb mannequin commented Sep 28, 2009

If "correct handling of apostrophe's and quotation marks, keeping the
string length unchanged, and only changing desired letters from lower to
uppercase with no other side-effects" is the criterion we want, then
what I suggested (toupper() the first character, and any character that
follows a space or punctuation character) should work. (Unless I'm
missing something.) Do we want to tolower() all other characters, like
the interpreter does now?

I can make a test and patch for this if this is what we decide.

@rhettinger
Copy link
Contributor

I'm still researching what other languages do. MS-Excel matches what
Python currently does. Django uses the python version and then fixes-up
apostrophe errors:
title=lambda value: re.sub("([a-z])'([A-Z])", lambda m:
m.group(0).lower(), value.title()).

It would also be nice to handle hyphenates like "xray" --> "X-ray".

Am thinking that it would be nice if the user could pass-in an optional
argument to list all desired characters to prevent transitions (such as
apostrophes and hyphens).

A broader solution would be to replace string.capwords() with a more
sophisticated set of rules that generally match what people are really
trying to accomplish with title casing:

http://aitech.ac.jp/~ckelly/midi/help/caps.html

http://search.cpan.org/dist/Text-Capitalize/Capitalize.pm

"Headline Style" in the Chicago Manual of Style or
Associate Pressd Stylebook:

http://grammar.about.com/b/2008/04/11/rules-for-capitalizing-the-words-in-a-title.htm

Any such attempt at a broad solution needs to provide ways for users to
modify the list of exception words and options for quoted text.

@rhettinger
Copy link
Contributor

Thomas, if you write-up an initial patch, aim for the most conservative
version that leaves all of the behavior unchanged except for embedded
single apostrophes (to handle contractions and possessives). That will
assure that we don't muck-up any existing uses for title case:

i'm I'm
you're You're
he's He's
david's David's
'bad' 'Bad'
f''t f''t
'x 'x

Given letters-apostrophe-letter, capitalize only the first letter and
lowercase the rest.

@pitrou
Copy link
Member

pitrou commented Sep 28, 2009

We shouldn't change the current default behaviour, people are probably
relying on it.

Besides, doing the right thing is both (natural) language-dependent and
context-dependent. It would be (very) hard to come with an
implementation catering to all needs. Perhaps a dedicated typography
module, but str.title() is certainly not the answer.

However, adding an optional argument to str.title() so as to change the
list of recognized separators could be an useful addition for those
people who aren't too perfectionist about the result.

@rhettinger
Copy link
Contributor

Guido, do you have an opinion on whether to have str.title() handle
embedded apostrophes, "you're" --> "You're" instead of "You'Re"?

IMO, the problem comes-up often enough that people are looking for
workarounds (i.e. string.capwords() was a failed hack created to handle
the problem and django.titlecase() is a successful attempt at a workaround).

I'm not worried about Antoines's comment that we can't change anything
ever. I am concerned about his point (mentioned on IRC) that there are
no context free solutions (the absolute right answer is hard). While
the change would seem to always be helpful in an English context, in
French the proper title casing of "l'argent" is "L'Argent". Then again,
there are cases in French that don't work under either method (i.e.
title casing Amaury Forgeot d'Arc ends-up capitalizing the D no matter
what we do).

Options:

  1. Leave everything the same (rejecting requests for apostrophe handling
    and forever live with the likes of You'Re).

  2. Handle embedded single apostrophes, fixing most cases in English, and
    wreaking havoc on the French (who are going to be ill-served under any
    scenario).

  3. Add an optional argument to str.title() with a list of characters
    that will not trigger a transition. This lets people add apostrophes
    and hyphens and other characters of interest. Hyphens are hard because
    cases like mother-in-law should properly be converted to Mother-in_Law
    and hyphens get used in many odd ways.

  4. Add a new string method for handling title case with embedded
    apostrophes but leaving the old version unchanged.

My order of preferences is 2,4,3,1.

@rhettinger rhettinger assigned gvanrossum and unassigned rhettinger Sep 28, 2009
@ezio-melotti
Copy link
Member

I think Ezio found some other information somewhere.

While I was fixing bpo-7000 I found that the tests for capwords had been
removed in r54854 but since the function was already there I added them
back in r75072.
The commit message of r54854 says "Also remove all calls to functions in
the string module (except maketrans)". I'm adding Neal to the nosy list,
maybe he remembers if maketrans really was the only function that was
supposed to survive.

In bpo-6412 other problems of .title() are discussed, and there are also a
couple of links to Technical Reports of the Unicode Consortium about
casing algorithms and similar issues (I didn't have time to read them
yet though).

@pitrou
Copy link
Member

pitrou commented Sep 28, 2009

While
the change would seem to always be helpful in an English context, in
French the proper title casing of "l'argent" is "L'Argent".

Well I think even in English it doesn't work right.
For example someone named O'Brien would end up as "O'brien".

My point is that capitalization is both language-sensitive and
context-sensitive, and it's a hard problem for a computer to solve.
Since str.title() can only be a very crude approximation of the right
thing, there's no good reason to break backwards compatibility, IMO.

  1. Leave everything the same (rejecting requests for apostrophe handling
    and forever live with the likes of You'Re).

  2. Handle embedded single apostrophes, fixing most cases in English, and
    wreaking havoc on the French (who are going to be ill-served under any
    scenario).

  3. Add an optional argument to str.title() with a list of characters
    that will not trigger a transition. This lets people add apostrophes
    and hyphens and other characters of interest. Hyphens are hard because
    cases like mother-in-law should properly be converted to Mother-in_Law
    and hyphens get used in many odd ways.

  4. Add a new string method for handling title case with embedded
    apostrophes but leaving the old version unchanged.

My order of preferences is 2,4,3,1.

I really think the only reasonable options are 3 and 1.
2 breaks compatibility with no real benefit.
4 is too specific a variation (especially in the unicode case, where you
might want to take into account the different variants of apostrophes
and other characters), and adding a new method for such a subtle
difference is not warranted.

@pitrou
Copy link
Member

pitrou commented Sep 28, 2009

By the way, we might want to mention in the documentation that the
title() method only gives imperfect results when trying to titlecase
natural language. So that people don't get fooled thinking things are
simple :-) What do you think?

@gvanrossum
Copy link
Member

Raymond, please refrain from emotional terms like "bug factory".

I have nothing to say about whether string.capwords() should be removed,
but I want to note that it does a split on whitespace and then rejoins
using a single space, so that string.capwords('A B\tC\r\nD') returns 'A
B C D'.

The title() method exists primarily because the Unicode standard has a
definition of "title case". I wouldn't want to change its default
behavior because there is no reasonable behavior that isn't locale-
dependent, and Unicode methods shouldn't depend on locale; and even then
it won't be perfect, as the O'Brien example shows.

Also note that .title() matches .istitle() in the sense that
x.title().istitle() is supposed to be true (except in end cases like a
string containing no letters).

I worry that providing an API that adds a way to specify a set of
characters to be treated as letters (for the purpose of deciding where
words start) will just make the bugs in apps harder to find because the
examples are rarer (like "l'Aperitif" or "O'Brien" -- or "RSVP" for that
matter). With the current behavior at least app authors will easily
notice the problem, decide whether it matters to them, and implement
their own algorithm if they do. And they are free to be as elaborate or
simplistic as they care.

What's a realistic use case for .title() anyway?

(Proposal: close as won't fix.)

@gvanrossum
Copy link
Member

A doc fix sounds like a great idea.

@gvanrossum gvanrossum removed their assignment Sep 28, 2009
@rhettinger
Copy link
Contributor

I will add a comment to the docs.

@nnorwitz
Copy link
Mannequin

nnorwitz mannequin commented Sep 29, 2009

I don't recall anything specifically wrt removing capwords. Most likely
it was something that struck me as not widely used or really
necessary--a good candidate to be removed. Applications could then
write the fucntion however they chose which would avoid the problem of
Python needing to figure out if it should be Isn'T or Isn't and all the
other variations mentioned here.

@malemburg
Copy link
Member

Guido van Rossum wrote:

What's a realistic use case for .title() anyway?

The primary use is when converting a string to be used as
title or sub-title of text - mostly inspired by the way
English treats titles.

The implementation follows the rules laid out in UTR#21:

http://unicode.org/reports/tr21/tr21-3.html

The Python version only implements the basic set of rules, i.e.
"If the preceeding letter is cased, chose the lowercase mapping; otherwise chose the titlecase
mapping (in most cases, this will be the same as the uppercase, but not always)."

It doesn't implement the special casing rules, since these would
require locale and language dependent context information which
we don't implement/use in Python.

It also doesn't implement mappings that would result in a change of
length (ligatures) or require look-ahead strategies (e.g. if the casing
depends on the code point following the converted code point).

Patches to enhance the code to support those additional rules
are welcome.

Regarding the apostrophe: the Unicode standard doesn't appear to
include any rule regarding that character and its use in titles
or upper-case versions of text. The apostrophe itself is a
non-cased code point.

It's likely that the special use of the apostrophe in English
is actually a language-specific use case. For those, it's (currently)
better to implement your own versions of the conversion functions,
based on the existing methods.

Regarding the idea to add an option to define which characters to
regard as cased/non-cased: This would cause the algorithm to no longer
adhere to the Unicode standard and most probably cause more problems
than it solves.

@malemburg
Copy link
Member

Marc-Andre Lemburg wrote:

Regarding the apostrophe: the Unicode standard doesn't appear to
include any rule regarding that character and its use in titles
or upper-case versions of text. The apostrophe itself is a
non-cased code point.

It's likely that the special use of the apostrophe in English
is actually a language-specific use case. For those, it's (currently)
better to implement your own versions of the conversion functions,
based on the existing methods.

Looking at the many different uses in various languages, this
appears to be the better option:

http://en.wikipedia.org/wiki/Apostrophe

To make things even more complicated, the usual typewriter apostrophe
that you find in ASCII is not the only one in Unicode:

http://en.wikipedia.org/wiki/Apostrophe#Unicode

@ezio-melotti
Copy link
Member

Patches to enhance the code to support those additional rules
are welcome.

bpo-6412 has a patch.

@pitrou
Copy link
Member

pitrou commented Sep 29, 2009

To make things even more complicated, the usual typewriter apostrophe
that you find in ASCII is not the only one in Unicode:

http://en.wikipedia.org/wiki/Apostrophe#Unicode

Yup, and the right one typographically isn't necessarily the ASCII
one :-)
That's why Microsoft Word automatically inserts a non-ASCII apostrophe
when you type « ' », at least in certain languages (apparently
OpenOffice doesn't).

@malemburg
Copy link
Member

Ezio Melotti wrote:

Ezio Melotti <ezio.melotti@gmail.com> added the comment:

> Patches to enhance the code to support those additional rules
are welcome.

bpo-6412 has a patch.

That patch looks promising.

@christoph
Copy link
Mannequin

christoph mannequin commented Sep 29, 2009

I admit I don't fully understand the semantics of capwords(). But from
what I believe what it should do, this function could be happily
replaced by the word-breaking algorithm as defined in
http://www.unicode.org/reports/tr29/.

This algorithm should be implemented anyway, to properly solve
bpo-6412.

@pitrou
Copy link
Member

pitrou commented Sep 29, 2009

This algorithm should be implemented anyway, to properly solve
bpo-6412.

Sure, but it should be another function, which might have its place in
the wordwrap module.

capwords() itself could be deprecated, since it's an obvious one-liner.
Replacing in with another method, however, will just confuse and annoy
existing users.

@malemburg
Copy link
Member

Christoph Burgmer wrote:

Christoph Burgmer <cburgmer@ira.uka.de> added the comment:

I admit I don't fully understand the semantics of capwords().

string.capwords() is an old function from the days before Unicode.
The function is basically defined by its implementation.

But from
what I believe what it should do, this function could be happily
replaced by the word-breaking algorithm as defined in
http://www.unicode.org/reports/tr29/.

This algorithm should be implemented anyway, to properly solve
bpo-6412.

Simple word breaking would be nice to have in Python as new
Unicode method, e.g. .splitwords().

Note however, that word boundaries are just as complicated as casing:
there are lots of special cases in different languages or locales
(see the notes after the word boundary rules in the TR29).

@christoph
Copy link
Mannequin

christoph mannequin commented Sep 29, 2009

Antoine Pitrou wrote:

capwords() itself could be deprecated, since it's an obvious one-
Replacing in with another method, however, will just confuse and
annoy
existing users.

Yes, sorry, I meant the semantics, where as you are right for the
specific function.

Marc-Andre Lemburg wrote:

Note however, that word boundaries are just as complicated as casing:
there are lots of special cases in different languages or locales
(see the notes after the word boundary rules in the TR29).

ICU already has the full implementation, so Python could get away with
just supporting the default implementation (as seen with other case
mappings).

>>> from PyICU import UnicodeString, Locale, BreakIterator            
>>> en_US_locale = Locale('en_US')                                    
>>> breakIter = BreakIterator.createWordInstance(en_US_locale)        
>>> s = UnicodeString("There's a hole in the bucket.")                
>>> print s.toTitle(breakIter, en_US_locale)
There's A Hole In The Bucket.
>>> breakIter.setText("There's a hole in the bucket.")
>>> last = 0
>>> for i in breakIter:
...     print s[last:i]
...     last = i
...
There's

A

Hole

In

The

Bucket
.

@gvanrossum gvanrossum removed their assignment Sep 29, 2009
@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

6 participants