Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic detection of HDF5 dataset identifier fails when data contains categoricals #13231

Closed
chrish42 opened this issue May 19, 2016 · 14 comments
Labels
Bug IO HDF5 read_hdf, HDFStore
Milestone

Comments

@chrish42
Copy link
Contributor

We use HDF5 to store our pandas dataframes on disk. We only store one dataframe per HDF5, so the feature of pandas.read_hdf() that allows omitting the key when a HDF file contains a single Pandas object is very nice for our workflow.

However, said feature doesn't work when the dataframe saved contains one or more categorical columns:

import pandas as pd

df = pd.DataFrame({'col1': [11, 21, 31], 'col2': ['a', 'b', 'a']})

# This works fine.
df.to_hdf('no_cat.hdf5', 'data', format='table')
df2 = pd.read_hdf('no_cat.hdf5')
print((df == df2).all().all())

# But this produces an exception.
df.assign(col2=pd.Categorical(df.col2)).to_hdf('cat.hdf5', 'data', format='table')
df3 = pd.read_hdf('cat.hdf5')

# ValueError: key must be provided when HDF file contains multiple datasets.

It looks like this is because pandas.read_hdf() doesn't ignore the metadata used to store the categorical codes:

print(pd.HDFStore('cat.hdf5'))

<class 'pandas.io.pytables.HDFStore'>
File path: cat.hdf5
/data                                     frame_table  (typ->appendable,nrows->3,ncols->2,indexers->[index])             
/data/meta/values_block_1/meta            series_table (typ->appendable,nrows->2,ncols->1,indexers->[index],dc->[values])

it'd be nice if this feature worked even when some of the columns are categoricals. It should be possible to ignore that metadata that pandas creates when looking if there is only one dataset stored, no?

@chrish42
Copy link
Contributor Author

Cc @laufere.

@jreback
Copy link
Contributor

jreback commented May 19, 2016

yeah this detection needs to be a bit smarter to consider the uniques of the top-level groups (rather than just multiple keys). should be a straightforward fix.

@jreback jreback added this to the 0.18.2 milestone May 19, 2016
@jreback
Copy link
Contributor

jreback commented May 19, 2016

pull-requests welcome!

@pfrcks
Copy link
Contributor

pfrcks commented May 19, 2016

@jreback While looking in the code, it seems that in such a case the list returned by store.keys() is empty which causes it to produce error.
However as you mention just checking the keys will not be a good approach.

store.groups()

[/data (Group) ''
children := ['table' (Table), 'meta' (Group)], /data/meta/values_block_1/meta (Group) ''
children := ['table' (Table)]]

Running the above code yields the results as shown above. Now if we compare this with a HDF5 file which actually has two datasets we get something like this:

import pandas as pd
df = pd.DataFrame({'col1': [11, 21, 31], 'col2': ['a', 'b', 'a']})
df2 = pd.DataFrame({'col1': [11, 21, 31], 'col2': ['a', 'b', 'acc']})
df.to_hdf('no_cat.hdf5', 'data', format='table')
df2.to_hdf('no_cat.hdf5', 'data2', format='table')
df3 = pd.read_hdf('no_cat.hdf5')
Traceback (most recent call last):
File "", line 1, in
File "pandas/io/pytables.py", line 336, in read_hdf
raise ValueError('key must be provided when HDF file contains '
ValueError: key must be provided when HDF file contains multiple datasets.
pd.HDFStore('no_cat.hdf5').groups()
[/data (Group) ''
children := ['table' (Table)], /data2 (Group) ''
children := ['table' (Table)]]

As we can see, if there are actually two datasets, instead of meta(Group) , we get /data2(Group) where data2 is the key provided when writing to the file.
Maybe we can leverage this?

@pfrcks
Copy link
Contributor

pfrcks commented May 19, 2016

@jreback One more thing comes to mind along the lines of what you suggested.

pd.HDFStore('cat.hdf5').keys()
['/data', '/data/meta/values_block_1/meta']

Now we can use these keys to get the unique values

df = pd.HDFStore('cat.hdf5').select('/data')
np.unique(df)
array([11, 21, 31, 'a', 'b'], dtype=object)

df = pd.HDFStore('cat.hdf5').select('/data/meta/values_block_1/meta')
np.unique(df)
array(['a', 'b'], dtype=object)

We can see, that the uniques produced by giving /data/meta/values_block_1/meta as a key is a subset of when we provide /data. But if we go down this road we will also have to consider the key name while making a decision because it might happen that there are two dataframes in the hdf5 file where the uniques of one is a subset of other.

Am I missing something here?

@jreback
Copy link
Contributor

jreback commented May 19, 2016

@chrish42 you can just iterate over the groups with tables. look at how .keys() is implemented.

@pfrcks
Copy link
Contributor

pfrcks commented May 20, 2016

@jreback did you want to refer to me or to chrish42 only?

@jreback
Copy link
Contributor

jreback commented May 20, 2016

oh sorry meant that as a general comment

@pfrcks
Copy link
Contributor

pfrcks commented May 20, 2016

@jreback Using the approach you suggested, I can get key names and then use them to get the individual tables as well, but my question, like I asked in a comment before, is that using those tables, even if we get unique values from both the tables, how can we be certain that one of them has meta information just because the unique values of one of the tables is a subset of another?

@pfrcks
Copy link
Contributor

pfrcks commented May 20, 2016

@jreback any comments?Can you guide me in the right direction?

@jreback
Copy link
Contributor

jreback commented May 20, 2016

@pfrcks just look at the top-level groups.

@chrish42
Copy link
Contributor Author

chrish42 commented Jun 3, 2016

@jreback I'm working on this during (alone so far) the PyCon sprints. So far, I've set up a development environment and added a test that fails. I have a couple questions.

First, should the metadata (like the categories, etc.) be hidden from the user by HDFStore or not? (i.e. the keys(), groups(), etc. method don't show the metadata table.)

And second, is there way to know from the attributes or other that a table is a metadata table? What would be the best way to do this? I see that the HDFStore.groups() method already does a bunch of filtering out. Not sure what is the best way to do this for categorical metadata...

@jreback
Copy link
Contributor

jreback commented Jun 3, 2016

we don't currently hide the metadata from the main display. Its prob ok to hide it (though do that after).

when you are iterating over groups, you can tell if there is meta data by seeing if you are on a table node that also has meta

@chrish42
Copy link
Contributor Author

chrish42 commented Jun 3, 2016

Let me know what you think of that pull request. Should I open a separate bug to hide the meta data?

Also, while readings the tests for pytables IO, I noticed an (old?) var = something_truthy and value1 or value2, that I replaced by an if-expression. But if the required version of pytables is now >= 2.2, that line could simply go away.

@jreback jreback closed this as completed in 5a9b498 Jun 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore
Projects
None yet
3 participants