Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenize module does not mirror "end-of-input" is newline behavior #78080

Closed
ammaraskar opened this issue Jun 19, 2018 · 27 comments
Closed

Tokenize module does not mirror "end-of-input" is newline behavior #78080

ammaraskar opened this issue Jun 19, 2018 · 27 comments
Assignees
Labels
3.7 3.8 stdlib type-bug

Comments

@ammaraskar
Copy link
Member

@ammaraskar ammaraskar commented Jun 19, 2018

BPO 33899
Nosy @terryjreedy, @gpshead, @taleinat, @benjaminp, @ned-deily, @meadori, @asottile, @ammaraskar, @miss-islington, @asmeurer
PRs
  • #7891
  • #8132
  • #8133
  • #8134
  • #10072
  • #10073
  • #10074
  • #10075
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ammaraskar'
    closed_at = <Date 2018-07-06.10:24:50.699>
    created_at = <Date 2018-06-19.07:41:52.217>
    labels = ['3.7', '3.8', 'type-bug', 'library']
    title = 'Tokenize module does not mirror "end-of-input" is newline behavior'
    updated_at = <Date 2019-03-22.11:46:47.154>
    user = 'https://github.com/ammaraskar'

    bugs.python.org fields:

    activity = <Date 2019-03-22.11:46:47.154>
    actor = 'brechtm'
    assignee = 'ammar2'
    closed = True
    closed_date = <Date 2018-07-06.10:24:50.699>
    closer = 'taleinat'
    components = ['Library (Lib)']
    creation = <Date 2018-06-19.07:41:52.217>
    creator = 'ammar2'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 33899
    keywords = ['patch']
    message_count = 27.0
    messages = ['319934', '321154', '321162', '321163', '321164', '321165', '328220', '328222', '328226', '328227', '328238', '328283', '328318', '328324', '328353', '328354', '328355', '328356', '328357', '328358', '328359', '328360', '328369', '328383', '328877', '330213', '338601']
    nosy_count = 11.0
    nosy_names = ['terry.reedy', 'gregory.p.smith', 'taleinat', 'benjamin.peterson', 'ned.deily', 'meador.inge', 'brechtm', 'Anthony Sottile', 'ammar2', 'miss-islington', 'asmeurer']
    pr_nums = ['7891', '8132', '8133', '8134', '10072', '10073', '10074', '10075']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue33899'
    versions = ['Python 3.6', 'Python 3.7', 'Python 3.8']

    @ammaraskar
    Copy link
    Member Author

    @ammaraskar ammaraskar commented Jun 19, 2018

    As was pointed out in https://bugs.python.org/issue33766 there is an edge case in the tokenizer whereby it will implicitly treat the end of input as a newline. The tokenize module in stdlib does not mirror the C code's behavior in this case.

    tokenizer.c:

    ~/cpython $ echo -n 'x' | ./python
    ----------
    NAME ("x")
    NEWLINE
    ENDMARKER

    tokenize module:

    ~/cpython $ echo -n 'x' | ./python -m tokenize
    1,0-1,1: NAME 'x'
    2,0-2,0: ENDMARKER ''

    The instrumentation to have the C tokenizer dump out its tokens is mine, can provide a diff to produce that output if needed.

    @ammaraskar ammaraskar added the 3.8 label Jun 19, 2018
    @ammaraskar ammaraskar self-assigned this Jun 19, 2018
    @ammaraskar ammaraskar added stdlib type-bug labels Jun 19, 2018
    @taleinat
    Copy link
    Contributor

    @taleinat taleinat commented Jul 6, 2018

    New changeset c4ef489 by Tal Einat (Ammar Askar) in branch 'master':
    bpo-33899: Make tokenize module mirror end-of-file is end-of-line behavior (GH-7891)
    c4ef489

    @taleinat
    Copy link
    Contributor

    @taleinat taleinat commented Jul 6, 2018

    New changeset ab75d9e by Tal Einat (Ammar Askar) in branch '3.7':
    [3.7] bpo-33899: Make tokenize module mirror end-of-file is end-of-line behavior (GH-7891) (GH-8132)
    ab75d9e

    @taleinat
    Copy link
    Contributor

    @taleinat taleinat commented Jul 6, 2018

    New changeset 11c36a3 by Tal Einat (Ammar Askar) in branch '3.6':
    [3.6] bpo-33899: Make tokenize module mirror end-of-file is end-of-line behavior (GH-7891) (GH-8134)
    11c36a3

    @taleinat
    Copy link
    Contributor

    @taleinat taleinat commented Jul 6, 2018

    New changeset 7829bba by Tal Einat (Ammar Askar) in branch '2.7':
    [2.7] bpo-33899: Make tokenize module mirror end-of-file is end-of-line behavior (GH-7891) (bpo-8133)
    7829bba

    @taleinat
    Copy link
    Contributor

    @taleinat taleinat commented Jul 6, 2018

    Thanks for all of your work on this, Ammar!

    @taleinat taleinat added the 3.7 label Jul 6, 2018
    @taleinat taleinat closed this as completed Jul 6, 2018
    @asottile
    Copy link
    Mannequin

    @asottile asottile mannequin commented Oct 21, 2018

    This change in behaviour is breaking pycodestyle: PyCQA/pycodestyle#786

    Perhaps it shouldn't have been backported (especially all the way to python2.7?)

    @taleinat
    Copy link
    Contributor

    @taleinat taleinat commented Oct 21, 2018

    This was backported since it was considered a bug, but you are right that it broke backwards compatibility, and perhaps shouldn't have been backported.

    Still, with 3.6.6 and 3.7.1 now released, that ship has sailed.

    We could perhaps revert this on the 2.7 branch, but I feel that reverting this change only on 2.7 would just cause even more confusion.

    @asottile
    Copy link
    Mannequin

    @asottile asottile mannequin commented Oct 21, 2018

    I'm surprised this was classified as a bug! Though that's subjective so I get that it's difficult to decide what is and what isn't ¯\(ツ)

    @ned-deily
    Copy link
    Member

    @ned-deily ned-deily commented Oct 22, 2018

    Apparently this change also affected IPython. Perhaps we should add an entry to the whatsnew documents for 3.7.1 and 3.7.6:

    https://docs.python.org/3/whatsnew/3.7.html#notable-changes-in-python-3-7-1

    https://docs.python.org/3.6/whatsnew/3.6.html#notable-changes-in-python-3-6-7

    @taleinat
    Copy link
    Contributor

    @taleinat taleinat commented Oct 22, 2018

    I'm sorry to have caused this mess, it was bad judgement on my part.

    Adding mention in What's is a good idea, Ned, I'll do that.

    @taleinat
    Copy link
    Contributor

    @taleinat taleinat commented Oct 23, 2018

    Ned, should this also be added to the 2.7 What's New? Or perhaps reverted on the 2.7 branch?

    @ned-deily
    Copy link
    Member

    @ned-deily ned-deily commented Oct 23, 2018

    I don't have a strong opinion about 2.7 here. Ultimately, it's Benjamin's call. But it might make sense to revert for 2.7 since it hasn't been released yet.

    @benjaminp
    Copy link
    Contributor

    @benjaminp benjaminp commented Oct 23, 2018

    Please revert in 2.7.

    @taleinat
    Copy link
    Contributor

    @taleinat taleinat commented Oct 24, 2018

    See PR #54281 for reverting in 2.7.

    @gpshead
    Copy link
    Member

    @gpshead gpshead commented Oct 24, 2018

    FYI, An example of other fallout from this change - patsy broke and needed this fix:

    pydata/patsy@4f53bba#diff-53c70e68c6dfd4fe9b08427792cb2bd6

    @taleinat
    Copy link
    Contributor

    @taleinat taleinat commented Oct 24, 2018

    See PR #54282 adding mention in "What's New".

    @gpshead
    Copy link
    Member

    @gpshead gpshead commented Oct 24, 2018

    some pylint fallout appears to be addressed in PyCQA/pylint@2698cbe

    @gpshead
    Copy link
    Member

    @gpshead gpshead commented Oct 24, 2018

    New changeset dfba1f6 by Gregory P. Smith (Tal Einat) in branch 'master':
    bpo-33899: Mention tokenize behavior change in What's New (GH-10073)
    dfba1f6

    @miss-islington
    Copy link
    Contributor

    @miss-islington miss-islington commented Oct 24, 2018

    New changeset 9a04762 by Miss Islington (bot) (Tal Einat) in branch '3.6':
    [3.6] bpo-33899: Mention tokenize behavior change in What's New (GH-10073) (GH-10075)
    9a04762

    @miss-islington
    Copy link
    Contributor

    @miss-islington miss-islington commented Oct 24, 2018

    New changeset b4c9874 by Miss Islington (bot) (Tal Einat) in branch '3.7':
    [3.7] bpo-33899: Mention tokenize behavior change in What's New (GH-10073) (GH-10074)
    b4c9874

    @taleinat
    Copy link
    Contributor

    @taleinat taleinat commented Oct 24, 2018

    Thanks for helping with the fallout from this, Gregory.

    @terryjreedy
    Copy link
    Member

    @terryjreedy terryjreedy commented Oct 24, 2018

    bpo-33766 was about documenting the C tokenizer change, some years ago, that made end-of-file EOF and end-of-string EOS generate the NEWLINE token required to properly terminate statements. "The end of input also serves
    as an implicit terminator for the final physical line."

    Although the tokenizer module intentionally does not exactly mirror the C tokenizer (it adds COMMENT tokens), it plausibly seems like a bug that it was not changed along with the C tokenizer, as it has since been tokenizing valid code as grammatically invalid. But I agree that this fix is too disruptive for 2.7.

    @benjaminp
    Copy link
    Contributor

    @benjaminp benjaminp commented Oct 24, 2018

    New changeset a1f45ec by Benjamin Peterson (Tal Einat) in branch '2.7':
    bpo-33899: Revert tokenize module adding an implicit final NEWLINE (GH-10072)
    a1f45ec

    @gpshead
    Copy link
    Member

    @gpshead gpshead commented Oct 29, 2018

    https://bugs.python.org/issue35107 filed to track further fallout from this API change.

    @asmeurer
    Copy link
    Mannequin

    @asmeurer asmeurer mannequin commented Nov 21, 2018

    Is it expected behavior that comments produce NEWLINE if they don't have a newline and don't produce NEWLINE if they do (that is, '# comment' produces NEWLINE but '# comment\n' does not)?

    @brechtm
    Copy link
    Mannequin

    @brechtm brechtm mannequin commented Mar 22, 2019

    In order to adapt code to this change, can we assume that a NEWLINE token with an empty string only occurs right before the ENDMARKER?

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 3.8 stdlib type-bug
    Projects
    None yet
    Development

    No branches or pull requests

    7 participants