Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Should this raise an error or a warning in HDFStore? #4189

Closed
jreback opened this issue Jul 10, 2013 · 11 comments · Fixed by #4206
Closed

API: Should this raise an error or a warning in HDFStore? #4189

jreback opened this issue Jul 10, 2013 · 11 comments · Fixed by #4206
Labels
API Design IO Data IO issues that don't fit into a more specific label IO HDF5 read_hdf, HDFStore
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Jul 10, 2013

Currently in 0.12.

In [3]: DataFrame(randn(10,2)).to_hdf('test_storer.h5','df')

In [4]: DataFrame(randn(10,2)).to_hdf('test_table.h5','df',table=True)

Selecting from a table is good with or without a where

In [6]: pd.read_hdf('test_table.h5','df',where='index>5')
Out[6]: 
          0         1
6  0.296947 -1.876055
7 -0.200516 -0.641670
8 -0.177828  0.877286
9  0.836903 -1.626247

This raises if where is not None, the theory being that passing a non-None where was a mistake.

However, then you have to a-prior know if you are dealing with a Storer or a Table

In [7]: pd.read_hdf('test_storer.h5','df',where='index>5')
TypeError: cannot pass a where specification when reading a Storer

make this a warning instead?

@jtratner
Copy link
Contributor

This is definitely a confusing error and error message. Is there a reason that someone would want to have a "Storer" vs. a table? (i.e., why is the default to create something that can't be queried later on...speed?) Why is the default not to store it in PyTables format?

You definitely can't just emit a warning, because ignoring the where clause potentially means (unintentionally) loading a big database into memory.

I'm not convinced this is a TypeError either, but I think it needs to emit an error message that not only clearly states what's going on in terms of the file format, but also specifically direct how to make it possible to work that way.

In [7]: pd.read_hdf('test_storer.h5','df',where='index>5')
TypeError: Basic HDF5 tables cannot be queried with a 'where' clause. Either do not supply a where clause or load from a 'PyTable' instead (e.g., HDF5 data created using 'to_hdf(...table=True)).

You probably also need to document this better, given that the only time table=True is mentioned in the dev docs on IO is in the Notes & Caveats section:

You can not append/select/delete to a non-table (table creation is determined on the first append, or by passing `table=True` in a put operation)

@jreback
Copy link
Contributor Author

jreback commented Jul 11, 2013

I never actually use put and table=True...you are prob more familiar with append which always creates a table, while put always creates a storer

These are both PyTable formats of course.

there is a speed difference (mainly in writing) and they are written in different formats.
and you cannot append nor select from a storer, (you can only get it in it entirety)

@jreback
Copy link
Contributor Author

jreback commented Jul 11, 2013

actually this is documented pretty well in the doc strings. what exactly do you find confusing about it?

@jtratner
Copy link
Contributor

@jreback now that you've explained, it's a bit clearer.

I went to the IOTools section of the docs to try to understand what the difference is between the two of them. So, are all the examples with the actual HDFStore assuming that you are working with a PyTable?

@jtratner
Copy link
Contributor

You write:

These are both PyTable formats of course.

Then why can you query one but not the other?

All this said, maybe this is something that becomes really clear when you start to use the functionality. I have a really far removed view of it, since I haven't had the cause to use this (yet).

@jreback
Copy link
Contributor Author

jreback commented Jul 11, 2013

yes...all are PyTables. There is a distinction between the storer format, which stores PyTables arrays. (I could never thing of a good name, and didn't to use basic, maybe array is better). and table which store in the PyTables table format

You can actually index into a storer format, you just can't query (that is you can so things like: array[0:100], kind of like a numpy array on disk).

With a table you can select, that is do something like index>0 & index<100. The real difference is that tables can be appended.

The reason I brought this up is that I happen to sometimes write storer's but mostly write tables. However, I always query them (so in my own code I basically catch the exception and act on it). Just wanted to know what other people do.

leaving it as an exception....maybe i'll update the error message a bit (and I think TypeError is correct, as you are basically doing a query on the wrong type here)

@TomAugspurger
Copy link
Contributor

Agreed on raising the TypeError, but I only ever use tables.

@jtratner
Copy link
Contributor

Yeah, I've got no insight into what other people do with PyTables :P

@jreback
Copy link
Contributor Author

jreback commented Jul 11, 2013

Heres the doc update, should be more clear (This renders nicely, but can't seem to paste it here)

Storer Format

The examples above show storing using put, write the HDF5 to PyTables in a fixed array format, this is called the storer format. These types of stores are are not appendable once written (though you can simply remove them and rewrite). Nor are they queryable; they must be retrieved in their entirety. These offer very fast writing and slightly faster reading than table stores.

Warning A storer format will raise a TypeError if you try to retrieve using a where .
DataFrame(randn(10,2)).to_hdf('test_storer.h5','df')

pd.read_hdf('test_storer.h5','df',where='index>5')
TypeError: cannot pass a where specification when reading a non-table
           this store must be selected in its entirety

@jreback
Copy link
Contributor Author

jreback commented Jul 11, 2013

@jtratner
Copy link
Contributor

that's much clearer - looks good!

On Thu, Jul 11, 2013 at 7:54 PM, jreback notifications@github.com wrote:

@jtratner https://github.com/jtratner, @TomAugspurgerhttps://github.com/TomAugspurgerbetter?
http://pandas.pydata.org/pandas-docs/dev/io.html#storer-format


Reply to this email directly or view it on GitHubhttps://github.com//issues/4189#issuecomment-20850585
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design IO Data IO issues that don't fit into a more specific label IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants