ENH: allow column oriented table storage in HDFStore

renamed `carray` package: https://github.com/Blosc/bcolz

Soliciting any comments on this proposal to create a columnar access table in `HDFStore`.

This is actually very straightforward to do.

need a new kw argument to describe the type of format for
storage: ``format='s|t|c'` (also allows expansion in the future to other formats)
- `s` is the `Storer` format (e.g. `store['df'] = value`), implied currently
  with `put``
- `t` is the `Table` format (e.g. `store.append('df',value)`, created
  with `table=True` when using `put` or using `append`
- `c` is a `CTable` format (new), which is a column oriented table

so will essentially deprecate `append=,table=` keywords (or just translate them)
to a `format=` kw.

```
df.to_hdf('test.h5','df',format='c')
```

Will have a master node which holds the structure.
Will store a format with a single column from a `DataFrame` in a sub-node of the
master.

advantages:
- index(s) are kept in their own columns (this is true with `Table` now)
- allows easy delete/add of columns (somewhat tricky in the `Table` format)
- allows appends (interesting twist is that have to keep the indices in sync)
- selection is straightforward as everything is indexed the same
- selecting a small number of columns relative to the total should be faster than an equivalent `Table`
- API will be the same as current. This is essentially an extension of the `append_as_multiple / select_as_multiple` multi-table accssors.
- can be included/coexist alongside existing `Table/Storer`s

disadvantages:
- selecting lots of columns will be somewhat slower that an equivalent `Table`
- requires syncing of all the indices (the coordinates of all rows)
- delete operations will be somewhat slower than an equivalent `Table`

There are actually 2 different formats that could be used here, I propose just the single-file for now. However, The sub-nodes could be spread out in a directory and stored as separate files. This allows concurrent access with some concurrent reads allowed (this is pretty tricky, so hold off on this for now).

This `CTable` format will use the existing `PyTables` infrastructure under the hood; it is possible to use the `ctable` module however http://carray.pytables.org/docs/manual/ (this is basically what BLAZE uses under the hood for its storage backend)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: allow column oriented table storage in HDFStore #4454

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

ENH: allow column oriented table storage in HDFStore #4454

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions