# A healthy intro to HDF5

Inspired by notes from [Christopher Lovell](https://www.christopherlovell.co.uk/blog/2016/04/27/h5py-intro.html) and [Andrew Colette](https://www.oreilly.com/library/view/python-and-hdf5/9781491944981/)


Install h5py with `%pip install h5py`

In [3]:
import h5py

In [8]:
hf = h5py.File('data.h5', 'w')  # Create file, truncate if exists

<HDF5 dataset "column2": shape (10,), type "|O">

`hf = h5py.File('data.h5', 'x')` tries to create file, but will fail if the file exists.
`hf = h5py.File('data.h5', 'a')` tries read/write if exists, otherwise creates the file.
[source](https://docs.h5py.org/en/stable/high/file.html?highlight=w%20#opening-creating-files)

In [None]:
hf.create_dataset('tbl_1/column1', data=list(range(10)))

to use compression use:

`hf.create_dataset('tbl_1/column1', data=list(range(10)), compression="lzf",)`

The benefit _may_ be limited, so measure the effect. For LZF compression on

| type               | compress time | decompress time | compression |
|:-------------------|--------------:|----------------:|------------:|
| trivial data       |       18.6 ms |         17.8 ms |      96.65% |
| sine wave w. noise |       65.5 ms |         24.4 ms |      15.53% |
| random data        |       67.8 ms |         24.8 ms |       8.94% |

Source: Python and HDF5: Unlocking Scientific Data, By Andrew Collette


In [None]:
hf.create_dataset('tbl_1/column2', data=[str(i) for i  in range(10)])

In [9]:
hf.close()

In [10]:
hf = h5py.File('data.h5', 'r')  # Readonly, file must exist (default)

In [11]:
hf.keys()

<KeysViewHDF5 ['tbl_1']>

In [14]:
c1 = hf.get('tbl_1/column1')
list(c1)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [17]:
c2 = hf.get('tbl_1/column2')
[i.decode("utf-8") for i in c2]

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

In [18]:
hf.close()

Time to append some data...

In [20]:
hf = h5py.File('data.h5', 'r+')  #Read/write, file must exist

In [21]:
hf.create_dataset('tbl_1/column3', data=[str(i*10) for i  in range(10)])

<HDF5 dataset "column3": shape (10,), type "|O">

In [54]:
from datetime import datetime, timedelta
now = datetime.now()

hf.create_dataset('tbl_1/column4', data=[(now.replace(microsecond=0) + timedelta(days=i)).isoformat() for i in range(10)])
list(hf['tbl_1/column4'])

[b'2022-02-05T17:35:08',
 b'2022-02-06T17:35:08',
 b'2022-02-07T17:35:08',
 b'2022-02-08T17:35:08',
 b'2022-02-09T17:35:08',
 b'2022-02-10T17:35:08',
 b'2022-02-11T17:35:08',
 b'2022-02-12T17:35:08',
 b'2022-02-13T17:35:08',
 b'2022-02-14T17:35:08']

In [55]:
hf.keys()

<KeysViewHDF5 ['tbl_1']>

In [56]:
for k,v in hf['/tbl_1'].items():
    print(k,list(v[:5]))

column1 [0, 1, 2, 3, 4]
column2 [b'0', b'1', b'2', b'3', b'4']
column3 [b'0', b'10', b'20', b'30', b'40']
column4 [b'2022-02-05T17:35:08', b'2022-02-06T17:35:08', b'2022-02-07T17:35:08', b'2022-02-08T17:35:08', b'2022-02-09T17:35:08']


In [57]:
L = []
for k, v in hf['/tbl_1'].items():
    L.append(v[:5])

def rotate(L):
    for row_ix,_ in enumerate(L[0]):
        yield tuple(L[col_ix][row_ix] for col_ix,_ in enumerate(L))

for row in rotate(L):
    print(row)

(0, b'0', b'0', b'2022-02-05T17:35:08')
(1, b'1', b'10', b'2022-02-06T17:35:08')
(2, b'2', b'20', b'2022-02-07T17:35:08')
(3, b'3', b'30', b'2022-02-08T17:35:08')
(4, b'4', b'40', b'2022-02-09T17:35:08')


Finally - always remember:

In [58]:
hf.close()