Skip to content

ENH: allow column oriented table storage in HDFStore #4454

Open
@jreback

Description

@jreback

renamed carray package: https://github.com/Blosc/bcolz

Soliciting any comments on this proposal to create a columnar access table in HDFStore.

This is actually very straightforward to do.

need a new kw argument to describe the type of format for
storage: ``format='s|t|c'` (also allows expansion in the future to other formats)

  • s is the Storer format (e.g. store['df'] = value), implied currently
    with `put``
  • t is the Table format (e.g. store.append('df',value), created
    with table=True when using put or using append
  • c is a CTable format (new), which is a column oriented table

so will essentially deprecate append=,table= keywords (or just translate them)
to a format= kw.

df.to_hdf('test.h5','df',format='c')

Will have a master node which holds the structure.
Will store a format with a single column from a DataFrame in a sub-node of the
master.

advantages:

  • index(s) are kept in their own columns (this is true with Table now)
  • allows easy delete/add of columns (somewhat tricky in the Table format)
  • allows appends (interesting twist is that have to keep the indices in sync)
  • selection is straightforward as everything is indexed the same
  • selecting a small number of columns relative to the total should be faster than an equivalent Table
  • API will be the same as current. This is essentially an extension of the append_as_multiple / select_as_multiple multi-table accssors.
  • can be included/coexist alongside existing Table/Storers

disadvantages:

  • selecting lots of columns will be somewhat slower that an equivalent Table
  • requires syncing of all the indices (the coordinates of all rows)
  • delete operations will be somewhat slower than an equivalent Table

There are actually 2 different formats that could be used here, I propose just the single-file for now. However, The sub-nodes could be spread out in a directory and stored as separate files. This allows concurrent access with some concurrent reads allowed (this is pretty tricky, so hold off on this for now).

This CTable format will use the existing PyTables infrastructure under the hood; it is possible to use the ctable module however http://carray.pytables.org/docs/manual/ (this is basically what BLAZE uses under the hood for its storage backend)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions