# 3-Chunking in HDF5

> Objectives:
> * Explain the concept of data chunking
> * Show how to create and read datasets that are chunked
> * Learn how to choose reasonable chunk sizes for your datasets

The HDF5 library supports several layouts so as to store datasets.

* Continuous layout:
  ![Continuous](img/dset_contiguous4x4.jpg)
  More compact, and usually it can be read faster.  Typically used for small datasets (< 1 MB).
  
* Chunked layout:
  ![Chunked](img/dset_chunked4x4.jpg)
  Datasets can be enlarged and compressed.  Can be read fast using a fast decompressor. Typically used for large datasets.

In [1]:
import numpy as np
import h5py

In [2]:
import os
import shutil
data_dir = "chunking"
if os.path.exists(data_dir):
    shutil.rmtree(data_dir)
os.mkdir(data_dir)

## Creating chunked datasets

To facilitate "playing" with chunksize, we define a function that creates 3 files with the same dataset, but with different chunking:

In [6]:
def create_files(size, chunksize):
    data = np.arange(size, dtype=np.int64)

    # Contiguous array
    with h5py.File(os.path.join(data_dir, "continuous.h5"), "w") as f:
        f.create_dataset(data=data, name="data", dtype=np.int64)

    # Simple chunking
    with h5py.File(os.path.join(data_dir, "chunked.h5"), "w") as f:
        dset = f.create_dataset("data", (size,), chunks=(chunksize,), dtype=np.int64)
        dset[:] = data

    # Automatic chunking and unlimited resizing
    with h5py.File(os.path.join(data_dir, "automatic.h5"), "w") as f:
        dset = f.create_dataset("data", (0,), chunks=True, maxshape=(None,), dtype=np.int64)
        dset.resize((size,))
        dset[:] = data

In [7]:
create_files(size=1000, chunksize=100)

In [8]:
!h5ls -v {data_dir}/chunked.h5

Opened "chunking/chunked.h5" with sec2 driver.
data                     Dataset {1000/1000}
    Location:  1:800
    Links:     1
    Chunks:    {100} 800 bytes
    Storage:   8000 logical bytes, 8000 allocated bytes, 100.00% utilization
    Type:      native long long


In [9]:
!h5ls -v {data_dir}/automatic.h5

Opened "chunking/automatic.h5" with sec2 driver.
data                     Dataset {1000/Inf}
    Location:  1:800
    Links:     1
    Chunks:    {1024} 8192 bytes
    Storage:   8000 logical bytes, 8192 allocated bytes, 97.66% utilization
    Type:      native long long


In [12]:
!ls -lh chunking

total 36K
-rw-r--r-- 1 tomkooij 197613 12K Jun 22 10:27 automatic.h5
-rw-r--r-- 1 tomkooij 197613 12K Jun 22 10:27 chunked.h5
-rw-r--r-- 1 tomkooij 197613 10K Jun 22 10:27 continuous.h5


# Automatic chunksize in h5py and PyTables

Both `h5py` and `PyTables` can automaticly choose a (sane) chunksize. (Both use the same algorithm).

In `h5py` use the `maxshape` kwarg:: `create_dataset(..., maxshape=(N, ...))`

In `PyTables` use the `expectedrows` kwarg:: `create_table(..., expectedrows=N)`

**rule of thumb**: provide `expectedrows` for datasets >10 Megabytes.

## Reading chunked datasets

In [13]:
for h5file in ("continuous.h5", "chunked.h5", "automatic.h5"):
    print("reading %s..." % h5file)
    %timeit h5py.File(os.path.join(data_dir, h5file))['data'][:]

reading continuous.h5...
861 µs ± 43.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
reading chunked.h5...
888 µs ± 6.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
reading automatic.h5...
838 µs ± 3.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


### Exercise 1

In the example above, set the `size` to 10 million and choose a minimal `chunksize` that offers a reasonable filesize and read speed.

In [14]:
#
#
# SOLUTION STARTS HERE
#
#

In [15]:
create_files(size=1000*1000*10, chunksize=100000)
!ls -lh chunking/*.h5

-rw-r--r-- 1 tomkooij 197613 77M Jun 22 10:28 chunking/automatic.h5
-rw-r--r-- 1 tomkooij 197613 77M Jun 22 10:28 chunking/chunked.h5
-rw-r--r-- 1 tomkooij 197613 77M Jun 22 10:28 chunking/continuous.h5


In [16]:
for h5file in ("continuous.h5", "chunked.h5", "automatic.h5"):
    print("reading %s..." % h5file)
    %timeit h5py.File(os.path.join(data_dir, h5file))['data'][:]

reading continuous.h5...
70 ms ± 8.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
reading chunked.h5...
87.9 ms ± 4.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
reading automatic.h5...
193 ms ± 7.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Exercise 2

Using the 10 million datasets above, retrieve just a small slice (say [10000:30000]) for each and time the time it takes to read.  Do you think that the whole dataset needs to be read in any case?

In [17]:
#
#
# SOLUTION STARTS HERE
#
#

In [18]:
for h5file in ("continuous.h5", "chunked.h5", "automatic.h5"):
    print("reading %s..." % h5file)
    %timeit h5py.File(os.path.join(data_dir, h5file))['data'][10000:30000]

reading continuous.h5...
946 µs ± 62.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
reading chunked.h5...
1.23 ms ± 90.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
reading automatic.h5...
1.15 ms ± 65.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


# Choosing the right chunksize

Chunks are **atomic** objects in the HDF5 library. For each access to an element in a chunk, the entire chunk needs to be read/written. (If a *uncompressed* chunk is larger than the chunkcache it will be partially read/written).

**Caution**: Although, in some cases we can improve performance by choosing chunksize (esp. shape), 
**choosing a the wrong chunksize can drastically decrease performance**

Note that:
- Small chunks: Overhead because of read/write for every chunk. (But: chunkcache). Do not use chunks smaller than disk block size (4k/8k)
- Large chunks: For access to a single item a large chunk has to be read from disk.
- The chunkcache is 1Mbytes (`h5py`) or 2Mbytes (`pytables`) by default. Do chunks fit in cache? Does a row of chunks fit in the cache?
- Chunkcache cannot be (easily) changed or disabled in `pytables` and `h5py`.


### Exercise 3

Investigate the difference by reading a dataset row (`dataset[i]`) and by column (`dataset[:, i]`).

 ![Contiguous](img/chunking-contiguous.png)
Remember the HDF5 library is a C-library: It store data by rows. (Rows are fast, columns are slow).

We start of with a contiguous dataset. Try improving this by chosing chunksize.

Try automatic chunksize first.

In [21]:
FILENAME = os.path.join(data_dir, "chunksize.h5")
f = h5py.File(FILENAME, 'w')

In [22]:
shape = (10000, 1000)

Contiguous

In [23]:
dset = f.create_dataset('contiguous', shape, dtype=np.int16)
dset[:] = np.ones(shape)

In [24]:
%timeit dset[2]  # read row

154 µs ± 5.28 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


Reading a row is fast: 1 seek, 1 read operation.

Reading a column is slow: Many seek operation, 1 read operation per row. (See figure above):

In [27]:
%timeit dset[:, 2]  # read column

8.08 ms ± 849 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Can you improve this?

Try automatic chuncksize first:

In [None]:
#
#
# SOLUTION STARTS HERE
#
#

In [None]:
dset = f.create_dataset('auto', shape,  chunks=True, dtype=np.int16)
dset[:] = np.ones(shape)

In [None]:
dset.chunks

In [None]:
%timeit dset[2]

In [None]:
%timeit dset[:, 2]

Columnar chunks

 ![Columnar chunks](img/chunking-colchunks.png)


In [None]:
dset = f.create_dataset('col_chunks', shape, chunks=(5000, 10), dtype=np.int16)
dset[:] = np.ones(shape)

In [None]:
%timeit dset[2]

In [None]:
%timeit dset[:, 2]

In [None]:
f.close()