Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bytes and unicode splitlines() methods differ on what is a line break #68789

Closed
gpshead opened this issue Jul 10, 2015 · 4 comments
Closed

bytes and unicode splitlines() methods differ on what is a line break #68789

gpshead opened this issue Jul 10, 2015 · 4 comments

Comments

@gpshead
Copy link
Member

gpshead commented Jul 10, 2015

BPO 24601
Nosy @gpshead, @stevendaprano, @vadmium
Superseder
  • bpo-22232: str.splitlines splitting on non-\r\n characters
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2015-07-10.16:52:40.298>
    created_at = <Date 2015-07-10.02:18:33.202>
    labels = []
    title = 'bytes and unicode splitlines() methods differ on what is a line break'
    updated_at = <Date 2015-07-10.16:52:40.295>
    user = 'https://github.com/gpshead'

    bugs.python.org fields:

    activity = <Date 2015-07-10.16:52:40.295>
    actor = 'gregory.p.smith'
    assignee = 'none'
    closed = True
    closed_date = <Date 2015-07-10.16:52:40.298>
    closer = 'gregory.p.smith'
    components = []
    creation = <Date 2015-07-10.02:18:33.202>
    creator = 'gregory.p.smith'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 24601
    keywords = []
    message_count = 4.0
    messages = ['246538', '246539', '246549', '246568']
    nosy_count = 3.0
    nosy_names = ['gregory.p.smith', 'steven.daprano', 'martin.panter']
    pr_nums = []
    priority = 'normal'
    resolution = 'duplicate'
    stage = None
    status = 'closed'
    superseder = '22232'
    type = None
    url = 'https://bugs.python.org/issue24601'
    versions = ['Python 2.7', 'Python 3.4', 'Python 3.5', 'Python 3.6']

    @gpshead
    Copy link
    Member Author

    gpshead commented Jul 10, 2015

    for bytes, \v (0x0b) is not considered a line break. for unicode, it is.

    this traces back to the Objects/stringlib/ code where unicode defers to the decision made by Objects/unicodeobject.c's ascii_linebreak table which contains 7 line breaks in the 0..127 character range:

    static unsigned char ascii_linebreak[] = {
    0, 0, 0, 0, 0, 0, 0, 0,
    /* 0x000A, * LINE FEED */
    /* 0x000B, * LINE TABULATION */
    /* 0x000C, * FORM FEED */
    /* 0x000D, * CARRIAGE RETURN */
    0, 0, 1, 1, 1, 1, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0,
    /* 0x001C, * FILE SEPARATOR */
    /* 0x001D, * GROUP SEPARATOR */
    /* 0x001E, * RECORD SEPARATOR */
    0, 0, 0, 0, 1, 1, 1, 0,

    Whereas Objects/stringlib/stringdefs.h used by only considers \r and \n.

    I think these should be consistent. But making this change likely breaks existing code in weird ways.

    This does come up when porting from 2 to 3 as a str '' type with one of those other characters in it was not broken by splitlines in 2.x but is broken by splitlines in 3.x.

    @stevendaprano
    Copy link
    Member

    On Fri, Jul 10, 2015 at 02:18:33AM +0000, Gregory P. Smith wrote:

    for bytes, \v (0x0b) is not considered a line break. for unicode, it is.
    [...]
    I think these should be consistent.

    I'm not sure that they should. Unicode includes other line breaks which
    bytes should not consider line breaks, such as NEL (Next Line), U+0085.
    Why should bytes be consistent with only the subset of line breaks that
    are in ASCII?

    @vadmium
    Copy link
    Member

    vadmium commented Jul 10, 2015

    • bpo-7643: Originally a complaint about the difference, but was closed after adding more differences!
    • bpo-22232: Documentation bug, but with some discussion on changing the API. Maybe a duplicate?
    • bpo-22233: Email and HTTP message parsing bug related to incorrectly using splitlines()
    • bpo-18291: codecs.StreamReader uses splitlines(), but io.TextIOWrapper uses universal newlines

    @gpshead
    Copy link
    Member Author

    gpshead commented Jul 10, 2015

    hah, i should've searched the tracker first. looks like the other open issues cover this.

    @gpshead gpshead closed this as completed Jul 10, 2015
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants