Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undocumented implicit strip() in split(None) string method #41462

Closed
yohell mannequin opened this issue Jan 19, 2005 · 12 comments
Closed

Undocumented implicit strip() in split(None) string method #41462

yohell mannequin opened this issue Jan 19, 2005 · 12 comments
Assignees
Labels
docs Documentation in the Doc dir

Comments

@yohell
Copy link
Mannequin

yohell mannequin commented Jan 19, 2005

BPO 1105286
Nosy @tim-one, @rhettinger, @terryjreedy

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = 'https://github.com/rhettinger'
closed_at = <Date 2007-01-06.02:16:46.000>
created_at = <Date 2005-01-19.15:04:27.000>
labels = ['docs']
title = 'Undocumented implicit strip() in split(None) string method'
updated_at = <Date 2007-01-06.02:16:46.000>
user = 'https://bugs.python.org/yohell'

bugs.python.org fields:

activity = <Date 2007-01-06.02:16:46.000>
actor = 'rhettinger'
assignee = 'rhettinger'
closed = True
closed_date = None
closer = None
components = ['Documentation']
creation = <Date 2005-01-19.15:04:27.000>
creator = 'yohell'
dependencies = []
files = []
hgrepos = []
issue_num = 1105286
keywords = []
message_count = 12.0
messages = ['23981', '23982', '23983', '23984', '23985', '23986', '23987', '23988', '23989', '23990', '23991', '23992']
nosy_count = 6.0
nosy_names = ['tim.peters', 'rhettinger', 'terry.reedy', 'calvin', 'jimjjewett', 'yohell']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue1105286'
versions = []

@yohell
Copy link
Mannequin Author

yohell mannequin commented Jan 19, 2005

Hi!

I noticed that the string method split() first does an
implicit strip() before splitting when it's used with
no arguments or with None as the separator (sep in the
docs). There is no mention of this implicit strip() in
the docs.

Example 1:
s = " word1 word2 "

s.split() then returns ['word1', 'word2'] and not ['',
'word1', 'word2', ''] as one might expect.

WHY IS THIS BAD?

  1. Because it's undocumented. See:
    http://www.python.org/doc/current/lib/string-methods.html#l2h-197

  2. Because it may lead to unexpected behavior in programs.
    Example 2:
    FASTA sequence headers are one line descriptors of
    biological sequences and are on this form:
    ">" + Identifier + whitespace + free text description.

Let sHeader be a Python string containing a FASTA
header. One could then use the following syntax to
extract the identifier from the header:

sID = sHeader[1:].split(None, 1)[0]

However, this does not work if sHeader contains a
faulty FASTA header where the identifier is missing or
consists of whitespace. In that case sID will contain
the first word of the free text description, which is
not the desired behavior.

WHAT SHOULD BE DONE?

The implicit strip() should be removed, or at least
should programmers be given the option to turn it off.
At the very least it should be documented so that
programmers have a chance of adapting their code to it.

Thank you for an otherwise splendid language!
/Joel Hedlund
Ph.D. Student
IFM Bioinformatics
Linköping University

@yohell yohell mannequin closed this as completed Jan 19, 2005
@yohell yohell mannequin assigned rhettinger Jan 19, 2005
@yohell yohell mannequin added the docs Documentation in the Doc dir label Jan 19, 2005
@yohell yohell mannequin closed this as completed Jan 19, 2005
@yohell yohell mannequin assigned rhettinger Jan 19, 2005
@yohell yohell mannequin added the docs Documentation in the Doc dir label Jan 19, 2005
@tim-one
Copy link
Member

tim-one commented Jan 19, 2005

Logged In: YES
user_id=31435

I think the docs for split() under "String Methods" are quite
clear:

"""
...

If sep is not specified or is None, a different splitting
algorithm is applied. Words are separated by arbitrary length
strings of whitespace characters (spaces, tabs, newlines,
returns, and formfeeds). Consecutive whitespace delimiters
are treated as a single delimiter ("'1 2 3'.split()"
returns "['1', '2', '3']"). Splitting an empty string returns "['']".
"""

This won't change, because mountains of code rely on this
behavior -- it's probably the single most common use case
for .split().

@yohell
Copy link
Mannequin Author

yohell mannequin commented Jan 20, 2005

Logged In: YES
user_id=1008220

In RE to tim_one:

I think the docs for split() under "String Methods" are quite
clear:

On the countrary, my friend, and here's why:

"""
...
If sep is not specified or is None, a different splitting
algorithm is applied.

This sentecnce does not say that whitespace will be
implicitly stripped from the edges of the string.

Words are separated by arbitrary length strings of whitespace
characters (spaces, tabs, newlines, returns, and formfeeds).

Neither does this one.

Consecutive whitespace delimiters are treated as a single
delimiter ("'1
2 3'.split()" returns "['1', '2', '3']").

And not that one.

Splitting an empty string returns "['']".
"""

And that last one does not mention it either. In fact, there
is no mention in the docs of how separators on edges of
strings are treated by the split method. And furthermore,
there is no mention of that s.split(sep) treats them
differrently when sep is None than it does otherwise. Example:

>>> ",2,".split(',')
['', '2', '']
>>> " 2 ".split()
['2']

This inconsistent behavior is not in line with how
beautifully thought out the Python language is otherwise,
and how brilliantly everything else is documented on the
http://python.org/doc/ documentation pages.

This won't change, because mountains of code rely on this
behavior -- it's probably the single most common use case
for .split().

I thought as much. However - it's would be Really easy for
an admin to add a line of documentation to .split() to
explain this. That would certainly help make me a happier
man, and hopefully others too.

Cheers guys!
/Joel

@rhettinger
Copy link
Contributor

Logged In: YES
user_id=80475

What new wording do you propose to be added?

@jimjjewett
Copy link
Mannequin

jimjjewett mannequin commented Jan 20, 2005

Logged In: YES
user_id=764593

Replacing the quoted line:

"""
...

If sep is not specified or is None, a different splitting
algorithm is applied. First whitespace (spaces, tabs,
newlines, returns, and formfeeds) is stripped from both
ends. Then words are separated by arbitrary length
strings of whitespace characters . Consecutive whitespace
delimiters are treated as a single delimiter ("'1 2 3'.split()"
returns "['1', '2', '3']"). Splitting an empty (or whitespace-
only) string returns "['']".
"""

@rhettinger
Copy link
Contributor

Logged In: YES
user_id=80475

The prosposed wording is fine.

If there are no objections or concerns, I'll apply it soon.

@yohell
Copy link
Mannequin Author

yohell mannequin commented Jan 20, 2005

Logged In: YES
user_id=1008220

Brilliant, guys!

Thanks again for a superb scripting language, and with
documentation to match!

Take care!
/Joel Hedlund

@terryjreedy
Copy link
Member

Logged In: YES
user_id=593130

To me, the removal of whitespace at the ends (stripping) is
consistent with the removal (or collapsing) of extra
whitespace in between so that .split() does not return empty
words anywhere. Consider:

>>> ',1,,2,'.split(',')
['', '1', '', '2', '']

If ' 1 2 '.split() were to return null strings at the beginning
and end of the list, then to be consistent, it should also put
one in the middle. One can get this by being explicit (mixed
WS can be handled by translation):

>>> ' 1  2 '.split(' ')
['', '1', '', '2', '']

Having said this, I also agree that the extra words proposed
by jj are helpful.

BUG?? In 2.2, splitting an empty or whitespace only string
produces an empty list [], not a list with a null word [''].

>>> ''.split()
[]
>>> '   '.split()
[]

which is what I see as consistent with the rest of the no-null-
word behavior. Has this changed since? (Yes, must
upgrade.) I could find no indication of such change in either
the tracker or CVS.

@calvin
Copy link
Mannequin

calvin mannequin commented Jan 24, 2005

Logged In: YES
user_id=9205

This should probably also be added to rsplit()?

@yohell
Copy link
Mannequin Author

yohell mannequin commented Nov 7, 2006

Logged In: YES
user_id=1008220

I'm opening this again, since the docs still don't reflect
the behavior of the method.

from the docs:
"""
If sep is not specified or is None, a different splitting
algorithm is applied. First, whitespace characters (spaces,
tabs, newlines, returns, and formfeeds) are stripped from
both ends.
"""

This is not true when maxsplit is given.

Example:

>>> " foo bar ".split(None)
['foo', 'bar']
>>> " foo bar ".split(None, 1)
['foo', 'bar ']

Whitespace is obviously not stripping whitespace from the
ends of the string before splitting the rest of the string.

@yohell
Copy link
Mannequin Author

yohell mannequin commented Nov 7, 2006

Logged In: YES
user_id=1008220

*resubmission: grammar corrected*

I'm opening this again, since the docs still don't reflect
the behavior of the method.

from the docs:
"""
If sep is not specified or is None, a different splitting
algorithm is applied. First, whitespace characters (spaces,
tabs, newlines, returns, and formfeeds) are stripped from
both ends.
"""

This is not true when maxsplit is given.

Example:
>>> " foo bar ".split(None)
['foo', 'bar']
>>> " foo bar ".split(None, 1)
['foo', 'bar ']

Whitespace is obviously not stripped from the ends before
the rest of the string is split.

@rhettinger
Copy link
Contributor

I think the current wording is clear enough and that further attempts to specify corner cases will only make the docs harder to understand and less useful.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation in the Doc dir
Projects
None yet
Development

No branches or pull requests

3 participants