Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

breaking file iter loop leaves file in stale state #36189

Closed
jvr mannequin opened this issue Mar 2, 2002 · 9 comments
Closed

breaking file iter loop leaves file in stale state #36189

jvr mannequin opened this issue Mar 2, 2002 · 9 comments
Assignees
Labels
stdlib Python modules in the Lib dir

Comments

@jvr
Copy link
Mannequin

jvr mannequin commented Mar 2, 2002

BPO 524804
Nosy @gvanrossum, @tim-one

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = 'https://github.com/gvanrossum'
closed_at = <Date 2002-03-23.03:40:25.000>
created_at = <Date 2002-03-02.15:44:15.000>
labels = ['library']
title = 'breaking file iter loop leaves file in stale state'
updated_at = <Date 2002-03-23.03:40:25.000>
user = 'https://bugs.python.org/jvr'

bugs.python.org fields:

activity = <Date 2002-03-23.03:40:25.000>
actor = 'gvanrossum'
assignee = 'gvanrossum'
closed = True
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2002-03-02.15:44:15.000>
creator = 'jvr'
dependencies = []
files = []
hgrepos = []
issue_num = 524804
keywords = []
message_count = 9.0
messages = ['9483', '9484', '9485', '9486', '9487', '9488', '9489', '9490', '9491']
nosy_count = 6.0
nosy_names = ['gvanrossum', 'tim.peters', 'jhylton', 'jvr', 'jorend', 'smurf']
pr_nums = []
priority = 'normal'
resolution = 'wont fix'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue524804'
versions = ['Python 2.2']

@jvr
Copy link
Mannequin Author

jvr mannequin commented Mar 2, 2002

Given a file created with this snippet:

  >>> f = open("tmp.txt", "w")
  >>> for i in range(10000):
  ...     f.write("%s\n" % i)
  ... 
  >>> f.close()

Iterating over a file multiple times has unexpected
behavior:

  >>> f = open("tmp.txt")
  >>> for line in f:
  ...     print line.strip()
  ...     break
  ... 
  0
  >>> for line in f:
  ...     print line.strip()
  ...     break
  ... 
  1861
  >>> 

I expected the last output line to be 1 instead of
1861.

While I understand the cause (xreadlines being
used by the
file iterator, it reads a big chunk ahead, causing
the actual
filepos to be out of sync), this seems to be an
undocumented
gotcha. The docs say this:

[ ... ] Each iteration returns the same result as
file.readline(), and iteration ends when the
readline()
method returns an empty string.

That is true within one for loop, but not when you
break out
of the loop and start another one, which I think is a
valid
idiom.

Another example of breakage:

  f = open(...)
  for line in f:
      if somecondition(line):
	  break
      ...
  
  data = f.read()  # read rest in one slurp

The fundamental problem IMO is that the file
iterator stacks
*another* state on top of an already stateful object.
In a
sense a file object is already an iterator. The two
states get
out of sync, causing confusing semantics, to say
the least.
The current behavior exposes an implementation
detail that
should be hidden.

I understand that speed is a major issue here, so
a solution
might not be simple.

Here's a report from an actual user:
http://groups.google.com/groups?hl=en&selm=
owen-
0B3ECB.10234615022002%40nntp2.u.washingto
n.edu
The rest of the thread suggests possible
solutions.

Here's what I *think* should happen (but: I'm
hardly aware
of both the fileobject and xreadline innards) is this:
xreadlines should be merged with the file object.
The buffer
that xreadlines implements should be *the* buffer
for the
file object, and *all* read methods should use *
that* buffer
and the according filepos.

Maybe files should grow a .next() method, so iter(f)
can return
f itself. .next() and .readline() are then 100%
equivalent.

@jvr jvr mannequin closed this as completed Mar 2, 2002
@jvr jvr mannequin assigned gvanrossum Mar 2, 2002
@jvr jvr mannequin added the stdlib Python modules in the Lib dir label Mar 2, 2002
@jvr jvr mannequin closed this as completed Mar 2, 2002
@jvr jvr mannequin assigned gvanrossum Mar 2, 2002
@jvr jvr mannequin added the stdlib Python modules in the Lib dir label Mar 2, 2002
@jorend
Copy link
Mannequin

jorend mannequin commented Mar 4, 2002

Logged In: YES
user_id=18139

Agreed on all points of fact. Also +1 on fixing it
by making iter(f).next() and f.readline() equivalent,
one way or another.

...The easy way: make f.__iter__() call readline()
instead of being so aggressive. (Better than nothing,
in my view.)

...The hard way (JvR's proposal): add a level of input
buffering on top of what the C runtime provides.
xreadlines() breaks user expectations precisely
because it does this *halfway*. Doing it right would
not be such a maintenance burden, I think. I'm willing
to write the patch, although others wiser in the ways
of diverse stdio implementations (<wink>) might want
to supervise.

...As it stands, iter(f) seems like a broken
optimization. Which is to say: it's not "undocumented
surprising behavior"; it's a bug.

@jhylton
Copy link
Mannequin

jhylton mannequin commented Mar 8, 2002

Logged In: YES
user_id=31392

If I understand the checkin message Guido wrote for 2.113,
he didn't intend the current behavior.

file_getiter(): make iter(file) be equivalent to
file.xreadlines().
This should be faster.

This means:

(1) "for line in file:" won't work if the xreadlines
module can't be
imported.

(2) The body of "for line in file:" shouldn't use the
file directly;
the effects (e.g. of file.readline(), file.seek() or
even
file.tell()) would be undefined because of the
buffering that goes
on in the xreadlines module.

@tim-one
Copy link
Member

tim-one commented Mar 8, 2002

Logged In: YES
user_id=31435

I'm sure Guido was aware of this. Making the simplest-to-
spell idiom as fast as possible was a deliberate decision
at the time.

@jvr
Copy link
Mannequin Author

jvr mannequin commented Mar 8, 2002

Logged In: YES
user_id=92689

At the cost of, what, sensible, predictable semantics?

  • fast is better than slow
  • but slow is better than unpredictable
    Or something...

@jorend
Copy link
Mannequin

jorend mannequin commented Mar 8, 2002

Logged In: YES
user_id=18139

Tim wrote: "I'm sure Guido was aware of this."

...Wow, really?  That kind of puts a damper on
   my theory.  I guess it can't be a bug after
   all.  :)

Tim wrote: "Making the simplest-to-spell idiom as fast as
possible was a deliberate decision at the time."

...*That* I believe.

@smurf
Copy link
Mannequin

smurf mannequin commented Mar 19, 2002

Logged In: YES
user_id=10327

I'm in favor for the "let files be their own iterator and set .next equal to .readline" solution. The above example can _really_ bite you.

The current xreadlinesmodule.c could be converted to a standalone module, if it really is necessary to optimize.

The trivial solution for this problem is to change CHUNKSIZE (in Modules/xreadlinesmodule) to 1. Or, even better, to convert it into an instance variable, so you can do this:

f=open(...)
fi=f.iter(chunk=2000)
for line in fi:
    ...

if you want speed, or just write

for line in f:

(which internally converts to f.iter(chunk=1))

if you want safety.

I'm not too firm with Python C interfacing, otherwise I'd write a patch... any takers?

@gvanrossum
Copy link
Member

Logged In: YES
user_id=6380

There are two forces at work here.

You want the most common case (a single "for line in file"
that consumes the whole file) to run blindingly fast.

And you want full generality, basically equating next() with
readline() (except raising StopIteration on EOF).

Unfortunately, the only way to go blindingly fast is to do
aggressive buffering, and that's what xreadlines does. Until
we rewrite the entire I/O system so we have total control
over buffering ourselves, it's not easy to mix xreadlines
with other operations (readline, seek, tell).

We could make the default file iterator use readline, but
then it would become much slower, and we'd have to teach
people about xreadlines if they want speed. Or we could use
the current solution, where speed is the default, and you
have to be more careful when you use an unusual coding
style, like breaking out of the for loop and continuing in a
second for loop.

I'm not sure which requirement is more common (speed, or
generality), but since looping over all the lines of a (big)
file is such a common pattern, I would bet the speed is the
more important of the two.

In the past we've had a lot of flak about the slowness of
the general solution of looping over all lines in a file;
the xreadlines-based iterator is much faster, and I am
reluctant to change this back in Python 2.3; I'd rather
document it carefully (after all, "for line in lines" is a
new construct in Python 2.2, and people have to be told
about it; we might as well tell them about the limitations
and how to work around them).

@gvanrossum
Copy link
Member

Logged In: YES
user_id=6380

Closing as won't fix.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir
Projects
None yet
Development

No branches or pull requests

2 participants