Automatic detection of HDF5 dataset identifier fails when data contains categoricals #13231

chrish42 · 2016-05-19T18:31:16Z

We use HDF5 to store our pandas dataframes on disk. We only store one dataframe per HDF5, so the feature of pandas.read_hdf() that allows omitting the key when a HDF file contains a single Pandas object is very nice for our workflow.

However, said feature doesn't work when the dataframe saved contains one or more categorical columns:

import pandas as pd

df = pd.DataFrame({'col1': [11, 21, 31], 'col2': ['a', 'b', 'a']})

# This works fine.
df.to_hdf('no_cat.hdf5', 'data', format='table')
df2 = pd.read_hdf('no_cat.hdf5')
print((df == df2).all().all())

# But this produces an exception.
df.assign(col2=pd.Categorical(df.col2)).to_hdf('cat.hdf5', 'data', format='table')
df3 = pd.read_hdf('cat.hdf5')

# ValueError: key must be provided when HDF file contains multiple datasets.

It looks like this is because pandas.read_hdf() doesn't ignore the metadata used to store the categorical codes:

print(pd.HDFStore('cat.hdf5'))

<class 'pandas.io.pytables.HDFStore'>
File path: cat.hdf5
/data                                     frame_table  (typ->appendable,nrows->3,ncols->2,indexers->[index])             
/data/meta/values_block_1/meta            series_table (typ->appendable,nrows->2,ncols->1,indexers->[index],dc->[values])

it'd be nice if this feature worked even when some of the columns are categoricals. It should be possible to ignore that metadata that pandas creates when looking if there is only one dataset stored, no?

The text was updated successfully, but these errors were encountered:

chrish42 · 2016-05-19T18:33:21Z

Cc @laufere.

jreback · 2016-05-19T18:38:06Z

yeah this detection needs to be a bit smarter to consider the uniques of the top-level groups (rather than just multiple keys). should be a straightforward fix.

jreback · 2016-05-19T18:38:33Z

pull-requests welcome!

pfrcks · 2016-05-19T20:01:02Z

@jreback While looking in the code, it seems that in such a case the list returned by store.keys() is empty which causes it to produce error.
However as you mention just checking the keys will not be a good approach.

store.groups()

[/data (Group) ''
children := ['table' (Table), 'meta' (Group)], /data/meta/values_block_1/meta (Group) ''
children := ['table' (Table)]]

Running the above code yields the results as shown above. Now if we compare this with a HDF5 file which actually has two datasets we get something like this:

import pandas as pd
df = pd.DataFrame({'col1': [11, 21, 31], 'col2': ['a', 'b', 'a']})
df2 = pd.DataFrame({'col1': [11, 21, 31], 'col2': ['a', 'b', 'acc']})
df.to_hdf('no_cat.hdf5', 'data', format='table')
df2.to_hdf('no_cat.hdf5', 'data2', format='table')
df3 = pd.read_hdf('no_cat.hdf5')
Traceback (most recent call last):
File "", line 1, in
File "pandas/io/pytables.py", line 336, in read_hdf
raise ValueError('key must be provided when HDF file contains '
ValueError: key must be provided when HDF file contains multiple datasets.
pd.HDFStore('no_cat.hdf5').groups()
[/data (Group) ''
children := ['table' (Table)], /data2 (Group) ''
children := ['table' (Table)]]

As we can see, if there are actually two datasets, instead of meta(Group) , we get /data2(Group) where data2 is the key provided when writing to the file.
Maybe we can leverage this?

pfrcks · 2016-05-19T20:43:20Z

@jreback One more thing comes to mind along the lines of what you suggested.

pd.HDFStore('cat.hdf5').keys()
['/data', '/data/meta/values_block_1/meta']

Now we can use these keys to get the unique values

df = pd.HDFStore('cat.hdf5').select('/data')
np.unique(df)
array([11, 21, 31, 'a', 'b'], dtype=object)

df = pd.HDFStore('cat.hdf5').select('/data/meta/values_block_1/meta')
np.unique(df)
array(['a', 'b'], dtype=object)

We can see, that the uniques produced by giving /data/meta/values_block_1/meta as a key is a subset of when we provide /data. But if we go down this road we will also have to consider the key name while making a decision because it might happen that there are two dataframes in the hdf5 file where the uniques of one is a subset of other.

Am I missing something here?

jreback · 2016-05-19T21:06:48Z

@chrish42 you can just iterate over the groups with tables. look at how .keys() is implemented.

pfrcks · 2016-05-20T03:44:37Z

@jreback did you want to refer to me or to chrish42 only?

jreback · 2016-05-20T04:29:56Z

oh sorry meant that as a general comment

pfrcks · 2016-05-20T04:55:52Z

@jreback Using the approach you suggested, I can get key names and then use them to get the individual tables as well, but my question, like I asked in a comment before, is that using those tables, even if we get unique values from both the tables, how can we be certain that one of them has meta information just because the unique values of one of the tables is a subset of another?

pfrcks · 2016-05-20T10:41:54Z

@jreback any comments?Can you guide me in the right direction?

jreback · 2016-05-20T12:13:53Z

@pfrcks just look at the top-level groups.

chrish42 · 2016-06-03T00:35:59Z

@jreback I'm working on this during (alone so far) the PyCon sprints. So far, I've set up a development environment and added a test that fails. I have a couple questions.

First, should the metadata (like the categories, etc.) be hidden from the user by HDFStore or not? (i.e. the keys(), groups(), etc. method don't show the metadata table.)

And second, is there way to know from the attributes or other that a table is a metadata table? What would be the best way to do this? I see that the HDFStore.groups() method already does a bunch of filtering out. Not sure what is the best way to do this for categorical metadata...

jreback · 2016-06-03T00:44:02Z

we don't currently hide the metadata from the main display. Its prob ok to hide it (though do that after).

when you are iterating over groups, you can tell if there is meta data by seeing if you are on a table node that also has meta

chrish42 · 2016-06-03T23:11:05Z

Let me know what you think of that pull request. Should I open a separate bug to hide the meta data?

Also, while readings the tests for pytables IO, I noticed an (old?) var = something_truthy and value1 or value2, that I replaced by an if-expression. But if the required version of pytables is now >= 2.2, that line could simply go away.

jreback added Bug IO HDF5 read_hdf, HDFStore Difficulty Novice labels May 19, 2016

jreback added this to the 0.18.2 milestone May 19, 2016

chrish42 added a commit to chrish42/pandas that referenced this issue Jun 3, 2016

Add test that fails for GitHub bug pandas-dev#13231

b3a5773

chrish42 mentioned this issue Jun 3, 2016

Make pd.read_hdf('data.h5') work when pandas object stored contained categorical columns #13359

Closed

4 tasks

jreback closed this as completed in 5a9b498 Jun 5, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic detection of HDF5 dataset identifier fails when data contains categoricals #13231

Automatic detection of HDF5 dataset identifier fails when data contains categoricals #13231

chrish42 commented May 19, 2016

chrish42 commented May 19, 2016

jreback commented May 19, 2016

jreback commented May 19, 2016

pfrcks commented May 19, 2016

pfrcks commented May 19, 2016

jreback commented May 19, 2016

pfrcks commented May 20, 2016

jreback commented May 20, 2016

pfrcks commented May 20, 2016

pfrcks commented May 20, 2016

jreback commented May 20, 2016

chrish42 commented Jun 3, 2016

jreback commented Jun 3, 2016

chrish42 commented Jun 3, 2016

Automatic detection of HDF5 dataset identifier fails when data contains categoricals #13231

Automatic detection of HDF5 dataset identifier fails when data contains categoricals #13231

Comments

chrish42 commented May 19, 2016

chrish42 commented May 19, 2016

jreback commented May 19, 2016

jreback commented May 19, 2016

pfrcks commented May 19, 2016

pfrcks commented May 19, 2016

jreback commented May 19, 2016

pfrcks commented May 20, 2016

jreback commented May 20, 2016

pfrcks commented May 20, 2016

pfrcks commented May 20, 2016

jreback commented May 20, 2016

chrish42 commented Jun 3, 2016

jreback commented Jun 3, 2016

chrish42 commented Jun 3, 2016