split() breaks no-break spaces #42731

maximrazin · 2005-12-26T15:03:58Z

BPO	1390608
Nosy	@malemburg, @sjoerdmullender, @doerwalter, @hyeshik

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/malemburg'
closed_at = <Date 2006-01-03.11:07:39.000>
created_at = <Date 2005-12-26.15:03:58.000>
labels = ['library']
title = 'split() breaks no-break spaces'
updated_at = <Date 2006-01-03.11:07:39.000>
user = 'https://bugs.python.org/maximrazin'

bugs.python.org fields:

activity = <Date 2006-01-03.11:07:39.000>
actor = 'lemburg'
assignee = 'lemburg'
closed = True
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2005-12-26.15:03:58.000>
creator = 'maxim_razin'
dependencies = []
files = []
hgrepos = []
issue_num = 1390608
keywords = []
message_count = 9.0
messages = ['27152', '27153', '27154', '27155', '27156', '27157', '27158', '27159', '27160']
nosy_count = 6.0
nosy_names = ['lemburg', 'sjoerd', 'effbot', 'doerwalter', 'hyeshik.chang', 'maxim_razin']
pr_nums = []
priority = 'normal'
resolution = 'wont fix'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue1390608'
versions = ['Python 2.4']

maximrazin · 2005-12-26T15:03:58Z

string.split(), str.split() and unicode.split() without
parameters break strings by the No-break space (U+00A0)
character. This character is specially intended not to
be a split border.

>>> u"Hello\u00A0world".split()
[u'Hello', u'world']

effbot · 2005-12-29T20:42:04Z

Logged In: YES
user_id=38376

split isn't a word-wrapping split, so I'm not sure that's
the right place to fix this. ("no-break space" is white-
space, according to the Unicode standard, and split breaks
on whitespace).

hyeshik · 2005-12-30T00:30:45Z

Logged In: YES
user_id=55188

Python documentation says that it splits in "whitespace
characters" not "breaking characters". So, current
behavior is correct according to the documentation. And
even rationale among string methods are heavily depends on
ctype functions on libc. Therefore, we can't serve special
treatment for the NBSP.

However, I feel the need for the splitting function that
awares what character is breaking or not. How about to add
it as unicodedata.split()?

doerwalter · 2005-12-30T12:35:00Z

Logged In: YES
user_id=89016

What's wrong with the following?

import sys, unicodedata
spaces = u"".join(unichr(c) for c in xrange(0,
sys.maxunicode) if unicodedata.category(unichr(c))=="Zs" and
c != 160)
foo.split(spaces)

malemburg · 2005-12-30T13:06:23Z

Logged In: YES
user_id=38388

Maxim, you are right that \xA0 is a non-break space.
However, like the others already mentioned, the .split()
method defaults to breaking a string on whitespace
characters, not breakable whitespace characters. The intent
is not a typographical one, but originates from the desire
to quickly tokenize a string.

If you'd rather like to see a different set of whitespace
characters used, you can pass such a template string to the
.split() method (Walter gave an example).

Closing this as "Won't fix".

sjoerdmullender · 2006-01-02T10:48:42Z

Logged In: YES
user_id=43607

Walter and MAL, did you actually try that work around?  It
doesn't work:
>>> import sys, unicodedata
>>> spaces = u"".join(unichr(c) for c in xrange(0,
sys.maxunicode) if unicodedata.category(unichr(c))=="Zs" and
c != 160)
>>> foo = u"Hello\u00A0world"
>>> foo.split(spaces)
[u'Hello\xa0world']

That's because split() takes the whole separator argument as
separator, not any of the characters in it.

malemburg · 2006-01-02T11:13:24Z

Logged In: YES
user_id=38388

Oops. You're right, Sjoerd.

Still, you could achieve the splitting by using a
re-expression that is build from the set of characters
fetched from the Unicode database and then using the
.split() method of the re object.

doerwalter · 2006-01-03T10:33:34Z

Logged In: YES
user_id=89016

Seems I confused strip() with split(). I *did* try that work
around, and it did what I expected: It *didn't* split on
U+00A0 ;)

If we want to fix this discrepancy, we could add methods
stripchars(), (as a synonym for strip()) and stripstring(),
as well as splitchars() and splitstring() (as a synonym for
split()).

malemburg · 2006-01-03T11:07:39Z

Logged In: YES
user_id=38388

No.

These things are application scope details and should thus
be implemented in the application rather than as method on
an object.

The methods always work on whitespace and that's clearly
defined.

maximrazin mannequin closed this as completed Dec 26, 2005

maximrazin mannequin assigned malemburg Dec 26, 2005

maximrazin mannequin added the stdlib Python modules in the Lib dir label Dec 26, 2005

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

split() breaks no-break spaces #42731

split() breaks no-break spaces #42731

maximrazin mannequin commented Dec 26, 2005

maximrazin mannequin commented Dec 26, 2005

effbot mannequin commented Dec 29, 2005

hyeshik commented Dec 30, 2005

doerwalter commented Dec 30, 2005

malemburg commented Dec 30, 2005

sjoerdmullender commented Jan 2, 2006

malemburg commented Jan 2, 2006

doerwalter commented Jan 3, 2006

malemburg commented Jan 3, 2006

split() breaks no-break spaces #42731

split() breaks no-break spaces #42731

Comments

maximrazin mannequin commented Dec 26, 2005

maximrazin mannequin commented Dec 26, 2005

effbot mannequin commented Dec 29, 2005

hyeshik commented Dec 30, 2005

doerwalter commented Dec 30, 2005

malemburg commented Dec 30, 2005

sjoerdmullender commented Jan 2, 2006

malemburg commented Jan 2, 2006

doerwalter commented Jan 3, 2006

malemburg commented Jan 3, 2006