Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fileinput and 'for line in sys.stdin' do strange mockery of input buffering #70478

Closed
donhatch mannequin opened this issue Feb 5, 2016 · 5 comments
Closed

fileinput and 'for line in sys.stdin' do strange mockery of input buffering #70478

donhatch mannequin opened this issue Feb 5, 2016 · 5 comments
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@donhatch
Copy link
Mannequin

donhatch mannequin commented Feb 5, 2016

BPO 26290
Nosy @vadmium, @serhiy-storchaka, @MojoVampire
Superseder
  • bpo-1633941: for line in sys.stdin: doesn't notice EOF the first time
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2016-06-21.12:58:01.051>
    created_at = <Date 2016-02-05.02:30:26.343>
    labels = ['type-bug', 'library']
    title = "fileinput and 'for line in sys.stdin' do strange mockery of input buffering"
    updated_at = <Date 2020-11-20.18:48:44.489>
    user = 'https://bugs.python.org/DonHatch'

    bugs.python.org fields:

    activity = <Date 2020-11-20.18:48:44.489>
    actor = 'josh.r'
    assignee = 'none'
    closed = True
    closed_date = <Date 2016-06-21.12:58:01.051>
    closer = 'martin.panter'
    components = ['Library (Lib)']
    creation = <Date 2016-02-05.02:30:26.343>
    creator = 'Don Hatch'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 26290
    keywords = []
    message_count = 5.0
    messages = ['259619', '259629', '259631', '268988', '381496']
    nosy_count = 4.0
    nosy_names = ['martin.panter', 'serhiy.storchaka', 'josh.r', 'Don Hatch']
    pr_nums = []
    priority = 'normal'
    resolution = 'duplicate'
    stage = None
    status = 'closed'
    superseder = '1633941'
    type = 'behavior'
    url = 'https://bugs.python.org/issue26290'
    versions = ['Python 2.7', 'Python 3.5', 'Python 3.6']

    @donhatch
    Copy link
    Mannequin Author

    donhatch mannequin commented Feb 5, 2016

    Iterating over input using either 'for line in fileinput.input():'
    or 'for line in sys.stdin:' has the following unexpected behavior:
    no matter how many lines of input the process reads, the loop body is not
    entered until either (1) at least 8193 chars have been read and at least one of
    them was a newline, or (2) EOF is read (i.e. the read() system call returns
    zero bytes).

    The behavior I expect instead is what
    "for line in iter(sys.stdin.readline, ''):" does: that is, the loop body is
    entered for the first time as soon as a newline or EOF is read.
    Furthermore strace reveals that this well-behaved alternative code does
    sensible input buffering, in the sense that the underlying system call being
    made is read(0,buf,8192), thereby allowing it to get as many characters as are
    available on input, up to 8192 of them, to be buffered and used in subsequent
    loop iterations. This is familiar and sensible behavior, and is what I think
    of as "input buffering".

    I anticipate there will be responses to this bug report of the form "this is
    documented behavior; the fileinput and sys.stdin iterators do input buffering".
    To that, I say: no, these iterators' unfriendly behavior is *not* input
    buffering in any useful sense; my impression is that someone may have
    implemented what they thought the words "input buffering" meant, but if so,
    they really botched it.

    This bug is most noticeable and harmful when using a filter written in python
    to filter the output of an ongoing process that may have long pauses between
    lines of output; e.g. running "tail -f" on a log file. In this case, the
    python filter spends a lot of time in a state where it is paused without
    reason, having read many input lines that it has not yet processed.

    If there is any suspicion that the delayed output is due to the previous
    program in the pipeline buffering its output instead, strace can be used on the
    python filter process to confirm that its input lines are in fact being read in
    a timely manner. This is certainly true if the previous process in the
    pipeline is "tail -f", at least on my ubuntu linux system.

    To demonstrate the bug, run each of the following from the bash command line.
    This was observed using bash 4.3.11(1), python 2.7.6, and python 3.4.3,
    on ubuntu 14.04 linux.

    ----------------------------------------------
    { echo a;echo b;echo c;sleep 1;} | python2.7 -c $'import fileinput,sys\nfor line in fileinput.input(): sys.stdout.write("line: "+line)'
    # result (BAD): pauses for 1 second, prints the three lines, returns to prompt

    { echo a;echo b;echo c;sleep 1;} | python2.7 -c $'import sys\nfor line in sys.stdin: sys.stdout.write("line: "+line)'
    # result (BAD): pauses for 1 second, prints the three lines, returns to prompt

    { echo a;echo b;echo c;sleep 1;} | python2.7 -c $'import sys\nfor line in iter(sys.stdin.readline, ""): sys.stdout.write("line: "+line)'
    # result (GOOD): prints the three lines, pauses for 1 second, returns to prompt

    { echo a;echo b;echo c;sleep 1;} | python3.4 -c $'import fileinput,sys\nfor line in fileinput.input(): sys.stdout.write("line: "+line)'
    # result (BAD): pauses for 1 second, prints the three lines, returns to prompt

    { echo a;echo b;echo c;sleep 1;} | python3.4 -c $'import sys\nfor line in sys.stdin: sys.stdout.write("line: "+line)'
    # result (GOOD): prints the three lines, pauses for 1 second, returns to prompt

    { echo a;echo b;echo c;sleep 1;} | python3.4 -c $'import sys\nfor line in iter(sys.stdin.readline, ""): sys.stdout.write("line: "+line)'
    # result (GOOD): prints the three lines, pauses for 1 second, returns to prompt
    ----------------------------------------------

    Notice the 'for line in sys.stdin:' behavior is apparently fixed in python 3.4.
    So the matrix of behavior observed above can be summarized as follows:

                                           2.7  3.4
    

    for line in fileinput.input(): BAD BAD
    for line in sys.stdin: BAD GOOD
    for line in iter(sys.stdin.readline, ""): GOOD GOOD

    Note that adding '-u' to the python args makes no difference in behavior, in
    any of the above 6 command lines.

    Finally, if I insert "strace -T" before "python" in each of the 6 command lines
    above, it confirms that the python process is reading the 3 lines of input
    immediately in all cases, in a single read(..., ..., 4096 or 8192) which seems
    reasonable.

    @donhatch donhatch mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Feb 5, 2016
    @donhatch
    Copy link
    Mannequin Author

    donhatch mannequin commented Feb 5, 2016

    Possibly related to http://bugs.python.org/issue1633941 .
    Note that the matrix of GOOD and BAD versions and input methods is
    exactly the same for this bug as for that one. To verify: run
    each of the 6 python commands I mentioned on its own, being sure to type
    at least one line of input ending in newline before hitting ctrl-D -- if it exits after one ctrl-D it's GOOD; having to type a second ctrl-D is BAD.

    @serhiy-storchaka
    Copy link
    Member

    For fileinput see bpo-15068.

    @vadmium
    Copy link
    Member

    vadmium commented Jun 21, 2016

    bpo-15068 as been fixed in 3.5+ and 2.7, and it looks like it fixes the fileinput aspect of this bug. That leaves the sys.stdin aspect, which only affects Python 2, and I think is a duplicate of bpo-1633941.

    @vadmium vadmium closed this as completed Jun 21, 2016
    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented Nov 20, 2020

    For those who find this in the future, the simplest workaround for the:

    for line in sys.stdin:

    issue on Python 2 is to replace it with:

    for line in iter(sys.stdin.readline, ''):

    The problem is caused by the way file.__next__'s buffering behaves, but file.readline doesn't use that code (it delegates to either fgets or a loop over getc/getc_unlocked that never overbuffers beyond the newline). Two-arg iter lets you make an iterator that calls readline each time you want a line, and considers a return of '' (which is what readline returns when you hit EOF) to terminate iteration.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants