read_csv with iterator=True does not seem to work as expected without chunksize #3967

garaud · 2013-06-20T08:03:57Z

Hi there,

I tested:

data = """A,B,C
foo,1,2,3
bar,4,5,6
baz,7,8,9
"""
reader = pd.read_csv(StringIO(data), iterator=True)

I thought that I could do:

for row in reader:
    print(row)

since reader is iterable. Unfortunately, it calls the generator TextFileReader.__iter__:

def __iter__(self):
    try:
        while True:
            yield self.read(self.chunksize)
    except StopIteration:
        pass

where self.chunksize is None. Maybe set self.chunsize to 1 when it's not
defined and there is iterator=True. I'll propose a patch as soon as possible
--- today or tomorrow.

Best regards,
Damien G.

The text was updated successfully, but these errors were encountered:

garaud · 2013-06-20T08:36:36Z

Well, iterator[=False] occurs in the kwd arguments of read_csv but commented in the dictionary _parser_defaults. I wonder if I set, "in a dummy way", chunksize=1 in this dictionary or if it's quite more complicated.

I'm writing a test and waiting for some comments about this.

Edit: Aarghhh. Force self.chunksize=1 in TextFileReader throws me a pretty SegFault :-) Actually, it's quite unhappy to have a new key iterator in *kwds (something which deals with _get_options_with_default). Quite confused.

Damien G.

hayd · 2013-06-20T09:57:54Z

Wouldn't making empty DataFrames be Falsey also solve this ?

Edit: No it wouldn't!

hayd · 2013-06-20T10:02:28Z

I was thinking you could do:

while (yield self.read(self.chunksize)): pass

but that is probably semantically different.

jreback · 2013-06-20T11:02:54Z

i believe this was fixed #3406

hayd · 2013-06-20T11:30:26Z

@jreback This still causes an infinite loop (isn't this the issue?):

In [6]: for row in reader:
    print row

garaud · 2013-06-20T11:38:04Z

@jreback I think it's not. I'm on the latest commit of master and the

while True:
    yield self.read(self.chunksize)

causes the issue when self.chunksize=None.
@hayd Yes, the infinite loop is the issue. Sorry, I was not very clear about that.

garaud · 2013-06-20T12:51:32Z

Update.

>>> reader = pd.read_csv('SampleData.csv', iterator=True, engine='python')
>>> reader.chunksize is None
True
>>> for row in reader:
        print(row)
     A  B  C
foo  1  2  3
bar  4  5  6
baz  7  8  9

No infinite loop here but returns the full DataFrame directly. By the way, the
Python parser engine does not seem to take into account the value of
chunksize.

>>> reader = pd.read_csv('SampleData.csv', chunksize=1, engine='python')
>>> for row in reader:
        print(row)
     A  B  C
foo  1  2  3
bar  4  5  6
     A  B  C
baz  7  8  9

loop in reader 2 by 2 DataFrame. I'll open an other issue about this.

For the C engine:

>>> reader = pd.read_csv('SampleData.csv', iterator=True, engine='c)
>>> reader.chunksize is None
True
>>> for row in reader:
        print(row)
Empty DataFrame
Columns: [Region, Rep, Item, Units, Unit Cost, Total]
Index: []
Empty DataFrame
Columns: [Region, Rep, Item, Units, Unit Cost, Total]
Index: []
...

infinitely. I have the Segmentation Fault when I try to pass the iterator
parameter to TextFileReader as described above.

jreback · 2013-06-20T12:54:08Z

leave it all in this issue its all related

garaud · 2013-06-20T13:40:17Z

OK. The SegFault is my bad (inconsistency between my source and install dirs and *.so files generated by Cython).

Take a look at garaud@b0d8903

Works well with TestCParserHighMemory, TestCParserLowMemory but fails with TestPythonParser. I don't get why Python engine carries out a full loop on TextFileReader when chunksize=None. Keep going, keep going...

Damien G.

jreback · 2013-06-21T00:19:42Z

if you specify iterator=True but not chunksize, this is tantamount to reading the entire file; I don't think you can assume chunksize=1 or really any number. If you want 1, then you need to specify it.

jreback · 2013-06-21T01:24:21Z

@garaud both issues are fixed up in master by #3978, pls confirm when you can

garaud · 2013-06-21T07:08:30Z

OK, @jreback You were faster than me ! It sounds good to me.
When I set iterator=True without chunksize, I expect that my reader is an iterable like a file where I can iterate on it line by line.

Set the chuncksize implies a "real" iterable whereas set iterator=True does not. This does not seem very consistency to me. Maybe I don't really get the meaning of the iterator parameter.

Anyway, the chunksize bug with Python engine was fixed, good job and thanks !

Damien G.

jreback · 2013-06-21T12:20:03Z

prob iterator=True should set some default for chunksize, however (this is what I do in the HDFStore iterator)

I think this was setup to allow differeing chunk sizes, via get_chunk(int) which you can call repeadetly (with possibily differeing sizes)...

pretty easy I thin to make a default chunksize if its not set (and iterator is True), then still allow get_chunk to work (which will just override).....if you think that makes sense, pls make an issue (and PR!)

jreback mentioned this issue Jun 21, 2013

BUG (GH3967) csv parsers would loop infinitely if iterator=True but no chunksize specified #3978

Merged

jreback closed this as completed in #3978 Jun 21, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv with iterator=True does not seem to work as expected without chunksize #3967

read_csv with iterator=True does not seem to work as expected without chunksize #3967

garaud commented Jun 20, 2013

garaud commented Jun 20, 2013

hayd commented Jun 20, 2013

hayd commented Jun 20, 2013

jreback commented Jun 20, 2013

hayd commented Jun 20, 2013

garaud commented Jun 20, 2013

garaud commented Jun 20, 2013

jreback commented Jun 20, 2013

garaud commented Jun 20, 2013

jreback commented Jun 21, 2013

jreback commented Jun 21, 2013

garaud commented Jun 21, 2013

jreback commented Jun 21, 2013

read_csv with iterator=True does not seem to work as expected without chunksize #3967

read_csv with iterator=True does not seem to work as expected without chunksize #3967

Comments

garaud commented Jun 20, 2013

garaud commented Jun 20, 2013

hayd commented Jun 20, 2013

hayd commented Jun 20, 2013

jreback commented Jun 20, 2013

hayd commented Jun 20, 2013

garaud commented Jun 20, 2013

garaud commented Jun 20, 2013

jreback commented Jun 20, 2013

garaud commented Jun 20, 2013

jreback commented Jun 21, 2013

jreback commented Jun 21, 2013

garaud commented Jun 21, 2013

jreback commented Jun 21, 2013