# Part-3A-Chunking in HDF5

> Objectives:
> * Explain the concept of data chunking
> * Show how to create and read datasets that are chunked
> * Learn how to choose reasonable chunk sizes for your datasets

The HDF5 library supports several layouts so as to store datasets.

* Continuous layout:
  ![Continuous](img/dset_contiguous4x4.jpg)
  More compact, and usually it can be read faster.  Typically used for small datasets (< 1 MB).
  
* Chunked layout:
  ![Chunked](img/dset_chunked4x4.jpg)
  Datasets can be enlarged and compressed.  Can be read fast using a fast decompressor. Typically used for large datasets.

In [2]:
import numpy as np
import h5py

In [3]:
import os
import shutil
data_dir = "chunking"
if os.path.exists(data_dir):
    shutil.rmtree(data_dir)
os.mkdir(data_dir)

## Creating chunked datasets

In [4]:
def create_files(size, chunksize):
    data = np.arange(size, dtype=np.int64)

    # Contiguous array
    with h5py.File(os.path.join(data_dir, "continuous.h5"), "w") as f:
        f.create_dataset(data=data, name="data", dtype=np.int64)

    # Simple chunking
    with h5py.File(os.path.join(data_dir, "chunked.h5"), "w") as f:
        dset = f.create_dataset("data", (size,), chunks=(chunksize,), dtype=np.int64)
        dset[:] = data

    # Automatic chunking and unlimited resizing
    with h5py.File(os.path.join(data_dir, "automatic.h5"), "w") as f:
        dset = f.create_dataset("data", (0,), chunks=True, maxshape=(None,), dtype=np.int64)
        dset.resize((size,))
        dset[:] = data

In [5]:
create_files(size=1000, chunksize=100)

In [6]:
!h5ls -v {data_dir}/chunked.h5

Opened "chunking/chunked.h5" with sec2 driver.
data                     Dataset {1000/1000}
    Location:  1:800
    Links:     1
    Chunks:    {100} 800 bytes
    Storage:   8000 logical bytes, 8000 allocated bytes, 100.00% utilization
    Type:      native long long


In [7]:
!h5ls -v {data_dir}/automatic.h5

Opened "chunking/automatic.h5" with sec2 driver.
data                     Dataset {1000/Inf}
    Location:  1:800
    Links:     1
    Chunks:    {1024} 8192 bytes
    Storage:   8000 logical bytes, 8192 allocated bytes, 97.66% utilization
    Type:      native long long


In [8]:
%ls -l chunking

 Volume in drive D is Data
 Volume Serial Number is 2ACD-5F91

 Directory of D:\SciPy2017


 Directory of D:\SciPy2017\chunking

19-06-2017  09:56    <DIR>          .
19-06-2017  09:56    <DIR>          ..
19-06-2017  09:56            11.688 automatic.h5
19-06-2017  09:56            11.496 chunked.h5
19-06-2017  09:56            10.144 continuous.h5
               3 File(s)         33.328 bytes
               2 Dir(s)  13.246.529.536 bytes free


File Not Found


## Reading chunked datasets

In [18]:
for h5file in ("continuous.h5", "chunked.h5", "automatic.h5"):
    print("reading %s..." % h5file)
    %timeit h5py.File(os.path.join(data_dir, h5file))['data'][:]

reading continuous.h5...
859 µs ± 62.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
reading chunked.h5...
908 µs ± 54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
reading automatic.h5...
903 µs ± 93.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


### Exercise 1

In the example above, set the `size` to 10 million and choose a minimal `chunksize` that offers a reasonable filesize and read speed.

In [19]:
create_files(size=1000*1000*10, chunksize=100000)
!ls -lh chunking/*.h5

-rw-r--r-- 1 tomkooij 197613 77M Jun 19 09:58 chunking/automatic.h5
-rw-r--r-- 1 tomkooij 197613 77M Jun 19 09:58 chunking/chunked.h5
-rw-r--r-- 1 tomkooij 197613 77M Jun 19 09:58 chunking/continuous.h5


In [20]:
for h5file in ("continuous.h5", "chunked.h5", "automatic.h5"):
    print("reading %s..." % h5file)
    %timeit h5py.File(os.path.join(data_dir, h5file))['data'][:]

reading continuous.h5...
68 ms ± 5.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
reading chunked.h5...
89.3 ms ± 4.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
reading automatic.h5...
210 ms ± 17.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Exercise 2

Using the 10 million datasets above, retrieve just a small slice (say [10000:30000]) for each and time the time it takes to read.  Do you think that the whole dataset needs to be read in any case?

In [11]:
for h5file in ("continuous.h5", "chunked.h5", "automatic.h5"):
    print("reading %s..." % h5file)
    %timeit h5py.File(os.path.join(data_dir, h5file))['data'][10000:30000]

reading continuous.h5...
707 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
reading chunked.h5...
720 µs ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
reading automatic.h5...
724 µs ± 4.81 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


# Choosing the right chunksize

Chunks are **atomic** objects in the HDF5 library. For each access to an element in a chunk, the entire chunk needs to be read/written

Make sure you:
- a 
- b
- c


### Exercise 3

Investigate the difference by reading a dataset row (`dataset[i]`) and by column (`dataset[:, i]`). Remeber the HDF5 library is a C-library: It store data by rows. (Rows are fast, columns are slow).

We start of with a contiguous dataset. Try improving this by chosing chunksize.

Try automatic chunksize first.

In [77]:
FILENAME = os.path.join(data_dir, "chunksize.h5")
f = h5py.File(FILENAME, 'w')

In [78]:
shape = (10000, 1000)

Contiguous

In [79]:
dset = f.create_dataset('contiguous', shape, dtype=np.int16)
dset[:] = np.ones(shape)

In [80]:
%timeit dset[2]

159 µs ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [81]:
%timeit dset[:, 2]

6.95 ms ± 95.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Automatic chuncksize

In [82]:
#
#
# SOLUTION STARTS HERE
#
#

In [84]:
dset = f.create_dataset('auto', shape,  chunks=True, dtype=np.int16)
dset[:] = np.ones(shape)

In [85]:
dset.chunks

(313, 63)

In [86]:
%timeit dset[2]

183 µs ± 5.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [87]:
%timeit dset[:, 2]

2.5 ms ± 30 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Columnar chunks

In [88]:
dset = f.create_dataset('col_chunks', shape, chunks=(5000, 10), dtype=np.int16)
dset[:] = np.ones(shape)

In [89]:
%timeit dset[2]

4.33 ms ± 38.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [90]:
%timeit dset[:, 2]

1.68 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [91]:
f.close()