Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TemporaryFile as input to read_table raises TypeError: '_TemporaryFileWrapper' object is not an iterator #13398

Closed
mbrucher opened this issue Jun 8, 2016 · 26 comments
Labels
Bug IO CSV read_csv, to_csv
Milestone

Comments

@mbrucher
Copy link
Contributor

mbrucher commented Jun 8, 2016

Although the requirement in the doc says that the input can be a file like object, it doesn't work with objects from tempfile. On Windows, they can't be reopened, so I need to pass the object itself.

Code Sample, a copy-pastable example if possible

import pandas as pd
from tempfile import TemporaryFile
new_file = TemporaryFile("w+")
dataframe = pd.read_table(new_file, skiprows=3, header=None, sep=r"\s*")

Expected Output

Not an exception!

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.1.final.0
python-bits: 64

pandas: 0.18.0

@jreback
Copy link
Contributor

jreback commented Jun 9, 2016

this is only with engine='python' as the sep you gave is a regex (if you use sep='\s+' which is a more typical whitespace) it works as expected.

@jreback
Copy link
Contributor

jreback commented Jun 9, 2016

in the future, pls show the entire show_versions(). you are missing crucial information there (the platform); though you did put it in the comments. we ask for these things to make it easier for people to look.

@jreback jreback added this to the Next Major Release milestone Jun 9, 2016
@jreback
Copy link
Contributor

jreback commented Jun 9, 2016

pull-requests are welcome

@mbrucher
Copy link
Contributor Author

mbrucher commented Jun 9, 2016

Do you mean that if I used sep='\s+', there is no exception?
Yes, I removed some info because it's not relevant here (except the platform, I didn't see I removed the OS) + there are some things that I can't send as well.

@jreback
Copy link
Contributor

jreback commented Jun 10, 2016

yes if u were splitting on white space it would use the c engine which would give u an error that the data file is empty

since u used a regex it went to the python engine and gives that weird error (only on Windows)

@mbrucher
Copy link
Contributor Author

Oh, OK. The thing is that I may have several spaces between columns, so I have to use the regex :(

@jreback
Copy link
Contributor

jreback commented Jun 11, 2016

\s+ is white space with at least a single space having 0 spaces is very weird

@mbrucher
Copy link
Contributor Author

Yes, agreed that 0 spaces is weird :)
BTW, the data file is not empty, I'm passing the file like object, it shouldn't fail in any case!

@jreback
Copy link
Contributor

jreback commented Jun 11, 2016

oh the example above it IS empty

in any case I'd u would like to debug - I think it's a simple fix

@mbrucher
Copy link
Contributor Author

Oh yes, sorry. I forgot I had to remove the data as it is confidential!

@mbrucher
Copy link
Contributor Author

The issue is that you can't call next() on a file apparently.

@gfyoung
Copy link
Member

gfyoung commented Jun 18, 2016

@mbrucher :

  1. If you can't provide the original data, create dummy data that can trigger the exception, particularly example data that could be reproduced by just calling read_table(new_file).

  2. If you have confidentiality issues, can you try reproducing the issue on another machine? Full version output is extremely useful when trying to debug.

  3. How does your tempfile have data? Are you calling new_file.write before you call read_table? If so, make sure to call new_file.seek(0) first so as to reset the stream position. Otherwise, none of your written data will be read (you can see this for yourself if you call new_file.read() before and after calling new_file.seek(0)).

I should add that this advise also applies to normal file objects (i.e. those created by calling open(...)), so this issue with tempfiles is not unique IIUC.

@jreback
Copy link
Contributor

jreback commented Jun 18, 2016

@gfyoung this repros exactly as above with an empty file

@gfyoung
Copy link
Member

gfyoung commented Jun 18, 2016

I know but I thought @mbrucher said the file contained data, and I was addressing that. In any case, unless a more convincing example can provided, I think this is safe to close, as the function does work with tempfiles in the manner I described , data or no data.

@jreback
Copy link
Contributor

jreback commented Jun 18, 2016

no it doesn't on Windows

@jreback
Copy link
Contributor

jreback commented Jun 18, 2016

In [2]: import pandas as pd

In [3]: pd.__version__
Out[3]: '0.18.1+139.ge24ab24'

In [4]: import pandas as pd

In [5]: from tempfile import TemporaryFile

In [6]: new_file = TemporaryFile("w+")

In [7]: dataframe = pd.read_table(new_file, skiprows=3, header=None, sep=r"\s*")
C:\Miniconda2\envs\pandas3.5\Scripts\ipython-script.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not
 support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying
engine='python'.
  if __name__ == '__main__':
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-43d01852f446> in <module>()
----> 1 dataframe = pd.read_table(new_file, skiprows=3, header=None, sep=r"\s*")

C:\Users\conda\Documents\pandas3.5\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, s
queeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_va
lues, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, itera
tor, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols,
error_bad_lines, warn_bad_lines, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lin
es, memory_map, float_precision)
    627                     skip_blank_lines=skip_blank_lines)
    628
--> 629         return _read(filepath_or_buffer, kwds)
    630
    631     parser_f.__name__ = name

C:\Users\conda\Documents\pandas3.5\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
    380
    381     # Create the parser.
--> 382     parser = TextFileReader(filepath_or_buffer, **kwds)
    383
    384     if (nrows is not None) and (chunksize is not None):

C:\Users\conda\Documents\pandas3.5\pandas\io\parsers.py in __init__(self, f, engine, **kwds)
    710             self.options['has_index_names'] = kwds['has_index_names']
    711
--> 712         self._make_engine(self.engine)
    713
    714     def close(self):

C:\Users\conda\Documents\pandas3.5\pandas\io\parsers.py in _make_engine(self, engine)
    894             elif engine == 'python-fwf':
    895                 klass = FixedWidthFieldParser
--> 896             self._engine = klass(self.f, **self.options)
    897
    898     def _failover_to_python(self):

C:\Users\conda\Documents\pandas3.5\pandas\io\parsers.py in __init__(self, f, **kwds)
   1742         # infer column indices from self.usecols if is is specified.
   1743         self._col_indices = None
-> 1744         self.columns, self.num_original_columns = self._infer_columns()
   1745
   1746         # Now self.columns has the set of columns that we will process.

C:\Users\conda\Documents\pandas3.5\pandas\io\parsers.py in _infer_columns(self)
   2068         else:
   2069             try:
-> 2070                 line = self._buffered_line()
   2071
   2072             except StopIteration:

C:\Users\conda\Documents\pandas3.5\pandas\io\parsers.py in _buffered_line(self)
   2136             return self.buf[0]
   2137         else:
-> 2138             return self._next_line()
   2139
   2140     def _empty(self, line):

C:\Users\conda\Documents\pandas3.5\pandas\io\parsers.py in _next_line(self)
   2164             while self.pos in self.skiprows:
   2165                 self.pos += 1
-> 2166                 next(self.data)
   2167
   2168             while True:

C:\Users\conda\Documents\pandas3.5\pandas\io\parsers.py in _read()
   1869         else:
   1870             def _read():
-> 1871                 line = next(f)
   1872                 pat = re.compile(sep)
   1873                 yield pat.split(line.strip())

TypeError: '_TemporaryFileWrapper' object is not an iterator

@mbrucher
Copy link
Contributor Author

So if the file is populated, of course same issue:

import pandas as pd
from tempfile import TemporaryFile
new_file = TemporaryFile("w+")
new_file.write("0 0")
new_file.flush()
new_file.seek(0)
dataframe = pd.read_table(new_file, header=None, sep=r"\s+", engine="python")
print(dataframe)

Tested on OS X with Python 2.7 (brew version), works like a charm, so there must be a difference in the implementation. I don't have a 3.5 on my Mac, so can't try it to see if it's the OS or the Python version :/

@gfyoung I know perfectly well how files work, thank you very much. I've been writing Python for more than a decade now, I hit all these issues in the past and obviously I know how to avoid them. But I guess you haven't tried my code before posting your message.

As @jreback said, it should be "easy" to fix, so I'll have a try when I have time.
A completely different question, be can't use a list of strings to generate a DataFrame? (for instance a filtered file would end up being a list of strings that could be read in pandas, that's actually my use use case. Using a TemporaryFile because I couldn't figure another way).

@jreback
Copy link
Contributor

jreback commented Jun 18, 2016

@mbrucher what do you mean a 'list of strings', do you mean?

you can! The difference is that this is not very efficient as have to be introspected (to figure out what exactly you are passing, as there are many possibilities), and then converted to a storage format (e.g. numpy). These may not necessarily be cheap; hence from the parser has more info available (e.g. it already knows the layout and can infer dtypes directly).

In [12]: DataFrame(['foo', 'bar', 'baz'])
Out[12]: 
     0
0  foo
1  bar
2  baz

In [13]: DataFrame([['foo', 'bar', 'baz']])
Out[13]: 
     0    1    2
0  foo  bar  baz

@mbrucher
Copy link
Contributor Author

Actually I was thinking of something like pd.read_table(["0 0", "1 1"], header=None, sep=r"\s+", engine="python") as the data is not yet parsed in my case (reading a report file that mixes lots of things together, only looking for specific tables that I then append to a list).

@jreback
Copy link
Contributor

jreback commented Jun 18, 2016

Much more efficient to do this with the c-engine, you have whitespace separating. Introduce line separation and you are set.

In [5]: pd.read_csv(StringIO('\n'.join(["0 0", "1 1"])), header=None, sep="\s+")
Out[5]: 
   0  1
0  0  0
1  1  1

@mbrucher
Copy link
Contributor Author

OK, thanks.

It seems that file like object don't implement next(). The issue comes from the fact that to select the type of reader, we check the attribute readline which is used for separators of length 1, but pandas uses next() for the other separators.

@gfyoung
Copy link
Member

gfyoung commented Jun 18, 2016

@mbrucher : Whoa, slow down there, aren't we letting our ego get bit in the way of rationale conversation? First of all, your code gave no indication that you were aware of this, so if you would like to update your code example in the initial post, go right ahead and do so.

Second, I did in fact try it out on a newly-acquired Windows 7 machine using Python 2.7.11 using v0.18.1 and could not reproduce the Exception. In addition, I tested the new examples that were later posted and also got not Exception.

@mbrucher
Copy link
Contributor Author

@gfyoung Which is why I specified the Python version, as there is a change in the API AFAIK on the behavior of next. Anyway, the pull request fixes it and I'm adding a test as we speek.

@gfyoung
Copy link
Member

gfyoung commented Jun 18, 2016

@mbrucher : fair enough - but it's worthwhile to note since this issue you raise isn't then a general Windows bug but rather a change in the way TemporaryFile is written between Python versions.

@mbrucher
Copy link
Contributor Author

They must have forgotten when they changed the next API :(

@jreback jreback modified the milestones: 0.18.2, Next Major Release Jun 19, 2016
jreback pushed a commit that referenced this issue Jun 22, 2016
dcloses #13398

Author: Matthieu Brucher <matthieu.brucher@gmail.com>

Closes #13481 from mbrucher/issue-13398 and squashes the following commits:

8b52631 [Matthieu Brucher] Yet another small update for more general regex
0d54151 [Matthieu Brucher] Simplified
5871625 [Matthieu Brucher] Grammar
aa3f0aa [Matthieu Brucher] lint change
1c33fb5 [Matthieu Brucher] Simplified test and added what's new note.
d8ceb57 [Matthieu Brucher] lint changes
fd20aaf [Matthieu Brucher] Moved the test to the Python parser test file
98e476e [Matthieu Brucher] Using same way of referencing as just above, consistency.
119fb65 [Matthieu Brucher] Added reference to original issue in the test + test the result itself (assuming that previous test is OK)
5af8465 [Matthieu Brucher] Adding a test with Python engine
d8decae [Matthieu Brucher] #13398 Change the way of reading back to readline (consistent with the test before entering the function)
@gfyoung
Copy link
Member

gfyoung commented Jul 2, 2016

@jreback : this issue should have been closed with @mbrucher 's commit (I think it didn't because the commit says "dcloses" instead of "closes")

@jreback jreback closed this as completed Jul 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants