# A healthy intro to HDF5

Inspired by notes from [Wolfgang Kopp](https://gist.github.com/wkopp/1443c258a95021da2c4e9630da155f13), [Christopher Lovell](https://www.christopherlovell.co.uk/blog/2016/04/27/h5py-intro.html) and [Andrew Colette](https://www.oreilly.com/library/view/python-and-hdf5/9781491944981/)

The usecase for Tablite is given as:

- Create, read and update larger-than-memory dataset efficiently.
- Have clean exit from Python without loss of data (or having to recompute)
- Have fast calculations (in memory, on demand)
- Be able to append, extend, filter, concatenate, group, etc. using a simple python api.

So I need a wrapper around something that gives the convenience of tablite api. What is that something?

- SQLite? Works very well as file format, but is quite slow. Even with all the locks turned off. On spinning disks we've seen throughput as low as 27,000 rows per second.
- numpy? Is certainly fast enough for most use cases, but requires mmap to get to disk. The stream of bytes is linear, so non-linear read will be slow (proof follows).
- HDF5? Is a bit more bloated than mmap'ed numpy, but since the bloat mostly resides on disk it wont matter. The bloat is indices that overcome the random read access issue.
- Why not use pyTables? Well I'm no expert, but from the documentation I couldn't find the option to handle data that resides in memory in such a way that pythons `atexit` function to drop particular tables to disk without having to do a lot of magic.

So what "smells" right? Probably HDF5 (the hint was in the title).

- HDF5 file created using `io.bytesIO` for in-memory usage.
- HDF5 file on disk for larger-than-memory usage.
- A simple wrapper for the pythonic functionality I need that allows me to use `__del__` to save to disk. This is automatically invoked by python `atexit` as a part of the garbage collection.

However before I prematurely commit to HDF5, let's do a performance test:

In [1]:
import io
import pathlib
import numpy as np
import h5py
import cProfile

In [2]:
# let's make a tempfolder to wipe at the end.
tmp = pathlib.Path("./tmp")
tmp.mkdir(exist_ok=True)
tmp.exists()
print(str(tmp.absolute()))

c:\Data\github\github-pages\content\tmp


In [3]:
# Ordinary numpy array
arr = np.arange(50_000_000, dtype=np.float64)

In [4]:
# Memory map
nmmarr = np.memmap( shape=arr.shape, filename=tmp /"benchmark.nmm", mode='w+', dtype=np.float64)
nmmarr[:] = arr[:]

In [5]:
# hdf5 file
f = h5py.File(tmp / "benchmark.hdf5", "w", driver='core')
d = f.create_dataset("mydataset", arr.shape, dtype=arr.dtype)
d[:] = arr[:]
f.close()
f = h5py.File(tmp / "benchmark.hdf5", "r", driver='core')

In [6]:
fh = io.BytesIO()
f2 = h5py.File(fh, 'r+')
h5io = f2.create_dataset('mydataset', arr.shape, dtype=arr.dtype, data=arr[:])

In [7]:
hdarr = f.get('mydataset')

In [8]:
pyarr = [v for v in arr]

In [9]:
def run(x, bs=10000, check_sum=1249999975000000.0):  # test function!
    j = 0
    for i in range(len(x)//bs):
        j+= sum(x[(i*bs):((i+1)*bs)])
    assert j == check_sum,j

In [10]:
def np_run(x, bs=10000, check_sum=1249999975000000.0):  # test function using broadcasting!
    j = 0
    for i in range(len(x)//bs):
        j+= x[(i*bs):((i+1)*bs)].sum()
    assert j == check_sum,j

In [11]:
cProfile.run('run(nmmarr)')  

         50040005 function calls in 29.730 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.014    0.014   29.730   29.730 4288317929.py:1(run)
     5000    0.007    0.000    0.023    0.000 <__array_function__ internals>:2(may_share_memory)
        1    0.000    0.000   29.730   29.730 <string>:1(<module>)
     5000    0.013    0.000    0.039    0.000 memmap.py:288(__array_finalize__)
 50010000   19.611    0.000   19.650    0.000 memmap.py:333(__getitem__)
     5000    0.002    0.000    0.002    0.000 multiarray.py:1368(may_share_memory)
        1    0.000    0.000   29.730   29.730 {built-in method builtins.exec}
     5000    0.003    0.000    0.003    0.000 {built-in method builtins.hasattr}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.len}
     5000   10.067    0.002   29.663    0.006 {built-in method builtins.sum}
     5000    0.015    0.000    0.015    0.000 {built-in method num

In [12]:
cProfile.run('run(arr)')  

         5005 function calls in 3.124 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.007    0.007    3.124    3.124 4288317929.py:1(run)
        1    0.000    0.000    3.124    3.124 <string>:1(<module>)
        1    0.000    0.000    3.124    3.124 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.len}
     5000    3.117    0.001    3.117    0.001 {built-in method builtins.sum}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}




In [13]:
cProfile.run('np_run(arr)')

         15005 function calls in 0.117 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.005    0.005    0.117    0.117 957082684.py:1(np_run)
        1    0.000    0.000    0.117    0.117 <string>:1(<module>)
     5000    0.002    0.000    0.110    0.000 _methods.py:46(_sum)
        1    0.000    0.000    0.117    0.117 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
     5000    0.108    0.000    0.108    0.000 {method 'reduce' of 'numpy.ufunc' objects}
     5000    0.003    0.000    0.112    0.000 {method 'sum' of 'numpy.ndarray' objects}




In [14]:
cProfile.run('run(hdarr)') 

         30026 function calls (30024 primitive calls) in 3.274 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.017    0.017    3.274    3.274 4288317929.py:1(run)
        1    0.000    0.000    3.274    3.274 <string>:1(<module>)
        4    0.000    0.000    0.000    0.000 base.py:305(id)
        1    0.000    0.000    0.000    0.000 dataset.py:409(shape)
     5000    0.002    0.000    0.002    0.000 dataset.py:469(_fast_reader)
        1    0.000    0.000    0.000    0.000 dataset.py:572(_extent_type)
        1    0.000    0.000    0.000    0.000 dataset.py:631(__len__)
        1    0.000    0.000    0.000    0.000 dataset.py:642(len)
        1    0.000    0.000    0.000    0.000 dataset.py:683(_fast_read_ok)
     5000    0.008    0.000    0.092    0.000 dataset.py:691(__getitem__)
      2/1    0.000    0.000    0.000    0.000 functools.py:973(__get__)
        1    0.000    0.000    3.274    3.274 {built-i

In [15]:
cProfile.run('run(h5io)')

         35025 function calls (35023 primitive calls) in 3.367 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.017    0.017    3.367    3.367 4288317929.py:1(run)
        1    0.000    0.000    3.367    3.367 <string>:1(<module>)
     5003    0.001    0.000    0.001    0.000 base.py:305(id)
        1    0.000    0.000    0.000    0.000 dataset.py:409(shape)
     5000    0.099    0.000    0.100    0.000 dataset.py:469(_fast_reader)
        1    0.000    0.000    0.000    0.000 dataset.py:572(_extent_type)
        1    0.000    0.000    0.000    0.000 dataset.py:631(__len__)
        1    0.000    0.000    0.000    0.000 dataset.py:642(len)
        1    0.000    0.000    0.000    0.000 dataset.py:683(_fast_read_ok)
     5000    0.022    0.000    0.217    0.000 dataset.py:691(__getitem__)
      2/1    0.000    0.000    0.000    0.000 functools.py:973(__get__)
        1    0.000    0.000    3.367    3.367 {built-i

In [16]:
cProfile.run('run(pyarr)') # pure python function for comparison. As this is all in memory. This won't scale.

         5005 function calls in 2.192 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.374    0.374    2.192    2.192 4288317929.py:1(run)
        1    0.000    0.000    2.192    2.192 <string>:1(<module>)
        1    0.000    0.000    2.192    2.192 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.len}
     5000    1.818    0.000    1.818    0.000 {built-in method builtins.sum}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}




#### In summary

- nmarr = 29.730 seconds (Worst, numpy array memory mapped)
- h5io = 3.367 seconds (python of H5 w. bytesIO)
- hdarr = 3.274 seconds (python on H5)
- arr = 3.124 seconds (python on np array.)
- pyarr = 2.192 seconds (pure python)
- arr (np_run) = 0.117 seconds (Best, optimised numpy)


In [17]:
f.close()
del nmmarr

## Performance says it's HDF5. Can we handle all datatypes?

With the performance question out of the way, the next question is whether hdf5 can handle all the dirty data cases that tablite copes with.

They are:

- Booleans
- Integers
- Floats
- Nones
- Strings
- Datetimes
- Times
- Dates
- A mixture of them all.

It is of course best for all the cases where hdf5 can handle the datatype natively.
The [h5py FAQ](https://docs.h5py.org/en/latest/faq.html) answers that well: Integers, Floats, Strings (fixed length, variable length), Booleans are handled natively.

As strings are handled as bytes, the unicode encoding needs to included. This suits me well as tablite often is used to process data with various UTF-8 dialects.

> datetime64 and timedelta64, can optionally be stored in HDF5 opaque data using opaque_dtype(). h5py will read this data back with the same dtype, but non-python will probably not understand the datatype.

By sticking to ISO8601 format we can store the Time as bytes and convert it on demand. Likewise for Dates as we don't want to impose a false sense of time onto dates. By using [numbas `jit`](https://numba.pydata.org/) compiler this can probably be very fast, though I'm not fond of the dependency.

Finally: The data structure: HDF5 supports that each column is its own dataset inside the file. This suits me very well as all I have to keep track of is to assure that the columns won't be distorted. Renaming column is solved by manipulating the HDF5 references:

> `>>> myfile["two"] = myfile["one"]`
> `>>> del myfile["one"]`
> [Andrew](https://groups.google.com/g/h5py/c/rGqWfX-H4No)


The [modes](https://docs.h5py.org/en/stable/high/file.html?highlight=w%20#opening-creating-files) of the `h5py.File` creation are :

- `x` tries to create file, but will fail if the file exists.
- `a` tries read/write if exists, otherwise creates the file.
- `w` create file, truncate if it exists



I choose to create it and fail if it already exists. For tablite I can raise a IOError and ask the user to load `Table.from_file(....h5)`; or I can allow the user to create the table as Table(use_disk=`table.h5`) as "load if exists, otherwise create."

In [18]:
hf = h5py.File(tmp/'table1.h5', 'x')  # Create file, truncate if exists

I now want to create a dataset for a column of data and set the datasets metadata as the python users datatype.

In [19]:
dset = hf.create_dataset('column1', data=list(range(10)))
dset.attrs['datatype'] = 'int'

I could use compression when creating the dataset, as it adds delay to read/writes that others are linear, I question whether it's beneficial. For LZF compression on column1 the create_dataset call is extended with `compression="lzf"`:

`hf.create_dataset('column1', data=list(range(10)), compression="lzf",)`

Andrew Collette recommends to measure the effect as the benefit of compression depends on the dataset:

| type               | compress time | decompress time | compression |
|:-------------------|--------------:|----------------:|------------:|
| trivial data       |       18.6 ms |         17.8 ms |      96.65% |
| sine wave w. noise |       65.5 ms |         24.4 ms |      15.53% |
| random data        |       67.8 ms |         24.8 ms |       8.94% |



In [20]:
dset = hf.create_dataset('column2', data=[str(i) for i  in range(10)])
dset.attrs['datatype'] = 'str'

In [21]:
hf.close()

As I've created the dataset and closed I can now reopen and inspect the file.

In [22]:
hf = h5py.File(tmp/'table1.h5', 'r')  # Readonly, file must exist (default)

In [23]:
print(hf.keys())
for name, dset in hf.items():
    print(name, {k:v for k,v in dset.attrs.items()}, list(dset))

<KeysViewHDF5 ['column1', 'column2']>
column1 {'datatype': 'int'} [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
column2 {'datatype': 'str'} [b'0', b'1', b'2', b'3', b'4', b'5', b'6', b'7', b'8', b'9']


As you can see above, `column2`s strings are encoded to bytes. To decode I'd need:

In [24]:
c2 = hf.get('column2')
print(c2.attrs['datatype'], [i.decode("utf-8") for i in c2])

str ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']


And finally - always remember to close the file handle.

In [25]:
hf.close()

Time to append some data...

In [26]:
hf = h5py.File(tmp/'table1.h5', 'r+')  #Read/write, file must exist

In [27]:
dset = hf.create_dataset('column3', data=[float(i*10) for i  in range(10)])
dset.attrs['datatype'] = 'float'

In [28]:
from datetime import datetime, timedelta
now = datetime.now()
data = [(now.replace(microsecond=0) + timedelta(days=i)).isoformat() for i in range(10)]

In [29]:
data

['2022-03-12T10:18:59',
 '2022-03-13T10:18:59',
 '2022-03-14T10:18:59',
 '2022-03-15T10:18:59',
 '2022-03-16T10:18:59',
 '2022-03-17T10:18:59',
 '2022-03-18T10:18:59',
 '2022-03-19T10:18:59',
 '2022-03-20T10:18:59',
 '2022-03-21T10:18:59']

In [30]:
dset = hf.create_dataset('column4', data=data)
dset.attrs['datatype'] = 'datetime'

At this point I now have 4 columns:

In [31]:
hf.keys()

<KeysViewHDF5 ['column1', 'column2', 'column3', 'column4']>

In [32]:
for k,v in hf.items():
    print(k,{k:v for k,v in v.attrs.items()}, list(v[:5]))

column1 {'datatype': 'int'} [0, 1, 2, 3, 4]
column2 {'datatype': 'str'} [b'0', b'1', b'2', b'3', b'4']
column3 {'datatype': 'float'} [0.0, 10.0, 20.0, 30.0, 40.0]
column4 {'datatype': 'datetime'} [b'2022-03-12T10:18:59', b'2022-03-13T10:18:59', b'2022-03-14T10:18:59', b'2022-03-15T10:18:59', b'2022-03-16T10:18:59']


To view this as rows I can gather and rotate a sensible small sample:

In [33]:
L = []
for k, v in hf.items():
    L.append(v[:5])

def rotate(L):
    for row_ix,_ in enumerate(L[0]):
        yield tuple(L[col_ix][row_ix] for col_ix,_ in enumerate(L))

for row in rotate(L):
    print(row)

(0, b'0', 0.0, b'2022-03-12T10:18:59')
(1, b'1', 10.0, b'2022-03-13T10:18:59')
(2, b'2', 20.0, b'2022-03-14T10:18:59')
(3, b'3', 30.0, b'2022-03-15T10:18:59')
(4, b'4', 40.0, b'2022-03-16T10:18:59')


Finally - always remember:

In [34]:
hf.close()

In [35]:
for file in tmp.iterdir():
    file.unlink()
tmp.rmdir()

# But what about datatypes?

Since numpy 1.2.0 the [datatypes](https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations) are inferred from pythons' type system.

So numpy detects the type automatically **if** the datatypes are **homogenous**.

In [36]:
import numpy as np

In [37]:
L = [ 12345678.8765432, 234565432, True, "Fish"]

for i in L:
    npxed = np.array([i])
    print(type(npxed), npxed.dtype, npxed)

<class 'numpy.ndarray'> float64 [12345678.8765432]
<class 'numpy.ndarray'> int32 [234565432]
<class 'numpy.ndarray'> bool [ True]
<class 'numpy.ndarray'> <U4 ['Fish']


(!)Be wary though, the int32 is limited and overflow errors are not uncommon

However if the datatypes are a **heterogenous**, numpy uses bytes for storage:

In [38]:
nn =np.array([1,1.23, "fish", False, None])
print(type(nn), nn.dtype, "<-- numpy \"object\" type")
print(nn)
for v in nn:
    print(type(v),v)

<class 'numpy.ndarray'> object <-- numpy "object" type
[1 1.23 'fish' False None]
<class 'int'> 1
<class 'float'> 1.23
<class 'str'> fish
<class 'bool'> False
<class 'NoneType'> None


So how does HDF5 react to that?

I'm first going to create a HDF5 file in memory, and then poke it with some mixed datatypes.

In [39]:
import io
filehandle = io.BytesIO()
h5file = h5py.File(filehandle, "r+")

In [40]:
try:
    h5file.create_dataset('column1', data=nn)
except TypeError as e:
    print(e)

Object dtype dtype('O') has no native HDF5 equivalent


So HDF5 does not have an equivalent datatype. What then? Can it at least handle `None`s?

In [41]:
data=[1,2,3,None,5,6]
try:
    h5file.create_dataset('column1', data=[1,2,3,None,5,6])
except TypeError as e:
    print(e)

Object dtype dtype('O') has no native HDF5 equivalent


Nope. `None`s aren't allowed either. So the fallback option is to turn this mixed pot into bytes.

In [42]:
from collections import defaultdict
data=[1,2,3,None,5,6]*50_000  # 300_000 values.
try:
    dset = h5file.create_dataset('column1', data=data)
except TypeError as e:
    dtypes = defaultdict(int)
    for v in data:
        dtypes[type(v).__name__] += 1
    print("datatype was non HDF5, so utf-8 encoded bytes are used")
    dset = h5file.create_dataset('column1', data=[str(v) for v in data])
    dset.attrs['datatype'] = str(dtypes)
except Exception:
    raise

datatype was non HDF5, so utf-8 encoded bytes are used


In [43]:
for k,v in h5file.items():
    print(k,{k:v for k,v in v.attrs.items()}, list(v[:5]))

column1 {'datatype': "defaultdict(<class 'int'>, {'int': 250000, 'NoneType': 50000})"} [b'1', b'2', b'3', b'None', b'5']


With this information I can apply tablite's type detection and use the histogram to guess the datatype.

In [44]:
h5file.close()