Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv with iterator=True does not seem to work as expected without chunksize #3967

Closed
garaud opened this issue Jun 20, 2013 · 13 comments · Fixed by #3978
Closed

read_csv with iterator=True does not seem to work as expected without chunksize #3967

garaud opened this issue Jun 20, 2013 · 13 comments · Fixed by #3978
Labels
Bug IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@garaud
Copy link
Contributor

garaud commented Jun 20, 2013

Hi there,

I tested:

data = """A,B,C
foo,1,2,3
bar,4,5,6
baz,7,8,9
"""
reader = pd.read_csv(StringIO(data), iterator=True)

I thought that I could do:

for row in reader:
    print(row)

since reader is iterable. Unfortunately, it calls the generator TextFileReader.__iter__:

def __iter__(self):
    try:
        while True:
            yield self.read(self.chunksize)
    except StopIteration:
        pass

where self.chunksize is None. Maybe set self.chunsize to 1 when it's not
defined and there is iterator=True. I'll propose a patch as soon as possible
--- today or tomorrow.

Best regards,
Damien G.

@garaud
Copy link
Contributor Author

garaud commented Jun 20, 2013

Well, iterator[=False] occurs in the kwd arguments of read_csv but commented in the dictionary _parser_defaults. I wonder if I set, "in a dummy way", chunksize=1 in this dictionary or if it's quite more complicated.

I'm writing a test and waiting for some comments about this.

Edit: Aarghhh. Force self.chunksize=1 in TextFileReader throws me a pretty SegFault :-) Actually, it's quite unhappy to have a new key iterator in *kwds (something which deals with _get_options_with_default). Quite confused.

Damien G.

@hayd
Copy link
Contributor

hayd commented Jun 20, 2013

Wouldn't making empty DataFrames be Falsey also solve this ?

Edit: No it wouldn't!

@hayd
Copy link
Contributor

hayd commented Jun 20, 2013

I was thinking you could do:

while (yield self.read(self.chunksize)): pass

but that is probably semantically different.

@jreback
Copy link
Contributor

jreback commented Jun 20, 2013

i believe this was fixed #3406

@hayd
Copy link
Contributor

hayd commented Jun 20, 2013

@jreback This still causes an infinite loop (isn't this the issue?):

In [6]: for row in reader:
    print row

@garaud
Copy link
Contributor Author

garaud commented Jun 20, 2013

@jreback I think it's not. I'm on the latest commit of master and the

while True:
    yield self.read(self.chunksize)

causes the issue when self.chunksize=None.
@hayd Yes, the infinite loop is the issue. Sorry, I was not very clear about that.

@garaud
Copy link
Contributor Author

garaud commented Jun 20, 2013

Update.

>>> reader = pd.read_csv('SampleData.csv', iterator=True, engine='python')
>>> reader.chunksize is None
True
>>> for row in reader:
        print(row)
     A  B  C
foo  1  2  3
bar  4  5  6
baz  7  8  9

No infinite loop here but returns the full DataFrame directly. By the way, the
Python parser engine does not seem to take into account the value of
chunksize.

>>> reader = pd.read_csv('SampleData.csv', chunksize=1, engine='python')
>>> for row in reader:
        print(row)
     A  B  C
foo  1  2  3
bar  4  5  6
     A  B  C
baz  7  8  9

loop in reader 2 by 2 DataFrame. I'll open an other issue about this.

For the C engine:

>>> reader = pd.read_csv('SampleData.csv', iterator=True, engine='c)
>>> reader.chunksize is None
True
>>> for row in reader:
        print(row)
Empty DataFrame
Columns: [Region, Rep, Item, Units, Unit Cost, Total]
Index: []
Empty DataFrame
Columns: [Region, Rep, Item, Units, Unit Cost, Total]
Index: []
...

infinitely. I have the Segmentation Fault when I try to pass the iterator
parameter to TextFileReader as described above.

@jreback
Copy link
Contributor

jreback commented Jun 20, 2013

leave it all in this issue its all related

@garaud
Copy link
Contributor Author

garaud commented Jun 20, 2013

OK. The SegFault is my bad (inconsistency between my source and install dirs and *.so files generated by Cython).

Take a look at garaud@b0d8903

Works well with TestCParserHighMemory, TestCParserLowMemory but fails with TestPythonParser. I don't get why Python engine carries out a full loop on TextFileReader when chunksize=None. Keep going, keep going...

Damien G.

@jreback
Copy link
Contributor

jreback commented Jun 21, 2013

if you specify iterator=True but not chunksize, this is tantamount to reading the entire file; I don't think you can assume chunksize=1 or really any number. If you want 1, then you need to specify it.

@jreback
Copy link
Contributor

jreback commented Jun 21, 2013

@garaud both issues are fixed up in master by #3978, pls confirm when you can

@garaud
Copy link
Contributor Author

garaud commented Jun 21, 2013

OK, @jreback You were faster than me ! It sounds good to me.
When I set iterator=True without chunksize, I expect that my reader is an iterable like a file where I can iterate on it line by line.

Set the chuncksize implies a "real" iterable whereas set iterator=True does not. This does not seem very consistency to me. Maybe I don't really get the meaning of the iterator parameter.

Anyway, the chunksize bug with Python engine was fixed, good job and thanks !

Damien G.

@jreback
Copy link
Contributor

jreback commented Jun 21, 2013

prob iterator=True should set some default for chunksize, however (this is what I do in the HDFStore iterator)

I think this was setup to allow differeing chunk sizes, via get_chunk(int) which you can call repeadetly (with possibily differeing sizes)...

pretty easy I thin to make a default chunksize if its not set (and iterator is True), then still allow get_chunk to work (which will just override).....if you think that makes sense, pls make an issue (and PR!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Data IO issues that don't fit into a more specific label
Projects
None yet
3 participants