Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH group attribute access for HDFStore #7334

Closed
wants to merge 1 commit into from

Conversation

wabu
Copy link
Contributor

@wabu wabu commented Jun 4, 2014

Summary:

  • store.set_attrs(key, **attrs) to set attributes of a group
  • store.get_attrs(key, attrs, default) returns namedtuple with values from a groups attributes or single attribute directly if attrs is specified as a single string
  • to_hdf, put and append have optional attrs=dict(...) argument to update attributes when storing an object

Examples:

from pandas.io.pytables import get_attrs
df.to_hdf('h5', 'df', attrs=dict(a=1, b=2))
a,b = get_attrs('h5', 'df', ['a', 'b'])
df = read_hdf('h5', 'df')
st.put('df', df)
st.set_attrs('df', a=1, b=2)
a,b,c = st.get_attrs('df', 'a b c', default=None)
a = st.get_attrs('df', 'a')
b, = st.get_attrs('df', ['b'])
st.append_to_mutple({'grp/a': ['A1', 'A2'], 'grp/b': ['B1', 'B2']}, 'grp/a')
st.set_attrs('grp', a=1, b=2)
attrs = st.get_attrs('group', ['a', 'b'])

@jreback jreback added the HDF5 label Jun 4, 2014
@jreback
Copy link
Contributor

jreback commented Jun 4, 2014

this was originally discussed in #2485. Given that I show this in the cookbook, might be a nice feature.

Couple of issues:

  • you could deserialize this and store in df.attrs as a dict (would need to add attrs to core/generic/NDFrame._metadata so this is recognized.
  • what happens if you append and specify attrs on multiple calls: merge the data, overwrite, raise?

@jreback
Copy link
Contributor

jreback commented Jun 4, 2014

@cpcloud I know you use HDF5 a bit....what do you think about this feature?

@cpcloud
Copy link
Member

cpcloud commented Jun 4, 2014

👍 on this. @wabu can u give an usage example of reading attributes in the top of the PR?

-1 on df.attrs (for this PR anyway, +1 for metadata generally) bc then you can't really have attrs on group you can only have them for a particular frame.

in my use cases i have metadata that i don't want to store as an array and it may apply to all the frames in a particular group.

one ubiquitous example is sampling rate, which may apply to multiple frames but i don't want to store as a single element series (which is what i do now).

as for appending data, seems like should overwrite if new values for attributes are provided and keep whatever's there already. similar to how dict.update works.

@wabu
Copy link
Contributor Author

wabu commented Jun 4, 2014

when appending, attribute values are overwritten, but older keys stay:

df.to_hdf('h5file', 'df', attrs=dict(a=1, b=1), append=True)
df.to_hdf('h5file', 'df', attrs=dict(a=2, c=2), append=True)
with pd.get_store('h5file') as st:
    print(repr(st.get_storer('df').attrs))
    a := 2,
    b := 1,
    c := 2,

so I'll document the behavior?
Moreover as noted in #2485 I add a warning about the size limitation of the meta store.

@jreback
Copy link
Contributor

jreback commented Jun 4, 2014

hmm, maybe better than putting this on the append call, how about a pair of get_attrs an set_attrs? more generic that way (could be on the append call too I guess, I just don't want to add an arg to select to return_attrs), I think kind of clunkty

@jreback jreback added this to the 0.14.1 milestone Jun 4, 2014
@cpcloud
Copy link
Member

cpcloud commented Jun 4, 2014

API like

store.get_attrs('nested/group', ['a', 'b'])
store.set_attrs('nested/group', dict(a=1, b=2))

?

@jreback
Copy link
Contributor

jreback commented Jun 4, 2014

yep, though could also have a name or something to handle different attr nodes. maybe better than adding in append.

@wabu
Copy link
Contributor Author

wabu commented Jun 4, 2014

get_attrs and set_attrs sound good. I still would like to have df.to_hdf(..., attrs=...), so one can store metadata easily. I don't know how to handle it for the read_hdf case. Perhaps returning the data and something like a named tuple, so one can:

df, attrs = read_hdf(..., attrs=['a', 'b'])
print(attrs.b)
df, (a,b) = read_hdf(..., attrs=['a', 'b'])

set_attrs could also accept kwargs, and get_attrs returns a namedtuple?

a, b = store.get_attrs('nested/group', 'a', 'b')
store.set_attrs('nested/group', a=1, b=2)

@jreback
Copy link
Contributor

jreback commented Jun 4, 2014

@wabu i think that is a reasonable interface. You will restrict the keys to only non-space keys

e.g. 'a', 'b' are valid but a column with spaces is not (but no big deal). you will have to check this on creation. the reason is that the named tuple won't work with them. alternatively you can return a dict (might be cleaner).

you are ONLY going to return the attrs IF attrs is specified (so the API is unchanged) for read_hdf.

also prob then need to add to append/put/select/to_hdf/read_hdf for consistency.

@wabu wabu changed the title ENH hdf write accepts attrs dict for group attribtes ENH group attribute access for HDFStore Jun 8, 2014
@wabu
Copy link
Contributor Author

wabu commented Jun 9, 2014

Here's my impl. Still have to update docs n' stuff, but wanted to make get a first feedback. I updated the first comment to give an overview of the changes. I'm using store.update_attrs, so it's clear that it works like dict.update.

to_hdf, put, append and read_hdf, get, select, select_columns, select_as_coordinates all have optional argument attrs, append_to_multple/select_as_multple raise an TypeError, as attrs agument is invalid.

@@ -65,6 +66,7 @@ def _ensure_encoding(encoding):

Term = Expr

_raise_attribute = object()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tthis is very odd to do this? what is the purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use it as a distinguishable default value. If another default value is give, it will be used when an attribute is missing, if no default value is given, an error should be raise. A simple return_attrs(..., default=None) can't be used as the code can't decide if it should return None as default or raise as default is not given.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have any better ideas how to handle this? It would also be used for get_attrs. One could also have a look at *args and **kwargs, but I think ugly as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that get_attrs is basically get so on non-existence of an attribute just return default - I think that's fine

@jreback
Copy link
Contributor

jreback commented Jun 9, 2014

his api is getting to be way too bloated I would revert back to only having a set_attrs/get_attrs (which accept a store or a file). You can then have the default argument their. Anytime you have to multiply name an argument, e.g. default_attrs then the api needs work Eliminate everything else.

@wabu
Copy link
Contributor Author

wabu commented Jun 9, 2014

yes, totally with you on this. where should get_attrs/set_attrs go if they accept a file?
added this to stay consistent over all store functions, but already was not happy with the complexity when implementing it.

@jreback
Copy link
Contributor

jreback commented Jun 9, 2014

I think it would be ok to directly import, e.g. from pandas.io.pytables import get_attr,set_attr if you want to use them directly with a filename (w/o a store). could also add to global imports at some point (though name would have to be different), or I guess as part of to_hdf/from_hdf (but that get's the same issue about the API bloating)

@jreback jreback modified the milestones: 0.15.0, 0.14.1 Jun 22, 2014
@jreback
Copy link
Contributor

jreback commented Jun 22, 2014

@wabu ?

- HDFStore.set_attrs to store attribute of a group
- HDFStore.get_attrs access to attributes
- toplevel set_attrs/get_attrs with hdf5-file passed as string
- attrs dict for to_hdf, put, append to store attributes
@wabu
Copy link
Contributor Author

wabu commented Jun 23, 2014

ok, did the set_attrs, get_attrs thing, also let the attrs=dict() for the write functions (to_hdf, put, append), so one is encouraged to put meta data with the data and it did not complicate the code, but I can remove this also.

sets attributes of a node

Note that the size of the metastore for a group inside a hdf5 file is
limited and already used for internal metadata, so be carefull about
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

careful

@jreback jreback modified the milestones: 0.15.0, 0.15.1 Jul 6, 2014
@jreback
Copy link
Contributor

jreback commented Sep 4, 2014

@wabu status on this?

@jreback
Copy link
Contributor

jreback commented Sep 14, 2014

@wabu status?

@jreback jreback modified the milestones: 0.15.1, 0.15.0 Sep 14, 2014
@jreback
Copy link
Contributor

jreback commented Jan 18, 2015

closing for now. @wabu if you'd like to update, happy to reopen.

@jreback jreback closed this Jan 18, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants