Description
renamed carray
package: https://github.com/Blosc/bcolz
Soliciting any comments on this proposal to create a columnar access table in HDFStore
.
This is actually very straightforward to do.
need a new kw argument to describe the type of format for
storage: ``format='s|t|c'` (also allows expansion in the future to other formats)
s
is theStorer
format (e.g.store['df'] = value
), implied currently
with `put``t
is theTable
format (e.g.store.append('df',value)
, created
withtable=True
when usingput
or usingappend
c
is aCTable
format (new), which is a column oriented table
so will essentially deprecate append=,table=
keywords (or just translate them)
to a format=
kw.
df.to_hdf('test.h5','df',format='c')
Will have a master node which holds the structure.
Will store a format with a single column from a DataFrame
in a sub-node of the
master.
advantages:
- index(s) are kept in their own columns (this is true with
Table
now) - allows easy delete/add of columns (somewhat tricky in the
Table
format) - allows appends (interesting twist is that have to keep the indices in sync)
- selection is straightforward as everything is indexed the same
- selecting a small number of columns relative to the total should be faster than an equivalent
Table
- API will be the same as current. This is essentially an extension of the
append_as_multiple / select_as_multiple
multi-table accssors. - can be included/coexist alongside existing
Table/Storer
s
disadvantages:
- selecting lots of columns will be somewhat slower that an equivalent
Table
- requires syncing of all the indices (the coordinates of all rows)
- delete operations will be somewhat slower than an equivalent
Table
There are actually 2 different formats that could be used here, I propose just the single-file for now. However, The sub-nodes could be spread out in a directory and stored as separate files. This allows concurrent access with some concurrent reads allowed (this is pretty tricky, so hold off on this for now).
This CTable
format will use the existing PyTables
infrastructure under the hood; it is possible to use the ctable
module however http://carray.pytables.org/docs/manual/ (this is basically what BLAZE uses under the hood for its storage backend)