bytes and unicode splitlines() methods differ on what is a line break #68789

gpshead · 2015-07-10T02:18:33Z

BPO	24601
Nosy	@gpshead, @stevendaprano, @vadmium
Superseder	bpo-22232: str.splitlines splitting on non-\r\n characters

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2015-07-10.16:52:40.298>
created_at = <Date 2015-07-10.02:18:33.202>
labels = []
title = 'bytes and unicode splitlines() methods differ on what is a line break'
updated_at = <Date 2015-07-10.16:52:40.295>
user = 'https://github.com/gpshead'

bugs.python.org fields:

activity = <Date 2015-07-10.16:52:40.295>
actor = 'gregory.p.smith'
assignee = 'none'
closed = True
closed_date = <Date 2015-07-10.16:52:40.298>
closer = 'gregory.p.smith'
components = []
creation = <Date 2015-07-10.02:18:33.202>
creator = 'gregory.p.smith'
dependencies = []
files = []
hgrepos = []
issue_num = 24601
keywords = []
message_count = 4.0
messages = ['246538', '246539', '246549', '246568']
nosy_count = 3.0
nosy_names = ['gregory.p.smith', 'steven.daprano', 'martin.panter']
pr_nums = []
priority = 'normal'
resolution = 'duplicate'
stage = None
status = 'closed'
superseder = '22232'
type = None
url = 'https://bugs.python.org/issue24601'
versions = ['Python 2.7', 'Python 3.4', 'Python 3.5', 'Python 3.6']

gpshead · 2015-07-10T02:18:32Z

for bytes, \v (0x0b) is not considered a line break. for unicode, it is.

this traces back to the Objects/stringlib/ code where unicode defers to the decision made by Objects/unicodeobject.c's ascii_linebreak table which contains 7 line breaks in the 0..127 character range:

static unsigned char ascii_linebreak[] = {
0, 0, 0, 0, 0, 0, 0, 0,
/* 0x000A, * LINE FEED */
/* 0x000B, * LINE TABULATION */
/* 0x000C, * FORM FEED */
/* 0x000D, * CARRIAGE RETURN */
0, 0, 1, 1, 1, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
/* 0x001C, * FILE SEPARATOR */
/* 0x001D, * GROUP SEPARATOR */
/* 0x001E, * RECORD SEPARATOR */
0, 0, 0, 0, 1, 1, 1, 0,

Whereas Objects/stringlib/stringdefs.h used by only considers \r and \n.

I think these should be consistent. But making this change likely breaks existing code in weird ways.

This does come up when porting from 2 to 3 as a str '' type with one of those other characters in it was not broken by splitlines in 2.x but is broken by splitlines in 3.x.

stevendaprano · 2015-07-10T03:02:41Z

On Fri, Jul 10, 2015 at 02:18:33AM +0000, Gregory P. Smith wrote:

for bytes, \v (0x0b) is not considered a line break. for unicode, it is.
[...]
I think these should be consistent.

I'm not sure that they should. Unicode includes other line breaks which
bytes should not consider line breaks, such as NEL (Next Line), U+0085.
Why should bytes be consistent with only the subset of line breaks that
are in ASCII?

vadmium · 2015-07-10T08:19:59Z

bpo-7643: Originally a complaint about the difference, but was closed after adding more differences!
bpo-22232: Documentation bug, but with some discussion on changing the API. Maybe a duplicate?
bpo-22233: Email and HTTP message parsing bug related to incorrectly using splitlines()
bpo-18291: codecs.StreamReader uses splitlines(), but io.TextIOWrapper uses universal newlines

gpshead · 2015-07-10T16:52:40Z

hah, i should've searched the tracker first. looks like the other open issues cover this.

gpshead closed this as completed Jul 10, 2015

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bytes and unicode splitlines() methods differ on what is a line break #68789

bytes and unicode splitlines() methods differ on what is a line break #68789

gpshead commented Jul 10, 2015

gpshead commented Jul 10, 2015

stevendaprano commented Jul 10, 2015

vadmium commented Jul 10, 2015

gpshead commented Jul 10, 2015

bytes and unicode splitlines() methods differ on what is a line break #68789

bytes and unicode splitlines() methods differ on what is a line break #68789

Comments

gpshead commented Jul 10, 2015

gpshead commented Jul 10, 2015

stevendaprano commented Jul 10, 2015

vadmium commented Jul 10, 2015

gpshead commented Jul 10, 2015