## Manipulating large arrays with HDF5

Numpy arrays can be saved in **Hierarchical Data Format (HDF5)** format. An HDF5 file contains one or several datasets (arrays or heterogeneous tables) organized into a POSIX-like hierarchy. Datasets may be accessed lazily with memory mapping. 

`h5py` is a Python package designed to deal with HDF5 files with a NumPy like programming interface.

HDF5 is especially useful when many arrays need to be saved in a single file. HDF5 is generally used in big projects, when large arrays have to be organized within a hierarchical structure. For example, it is largely used at NASA and other scientific institutions. Researchers can store recorded data across multiple devices, multiple trials, and multiple experiments. 

In HDF5, the data is organized within a tree. Nodes are either **groups** (analogous to folders in a file system) or **datasets** (analogous to files). A group can contain subgroups and datasets, whereas datasets only contain data. Both groups and datasets can contain attributes (metadata) that have a basic data type (integer or floating point number, string, and so on).

In HDF5, a dataset may be stored in a **contiguous** block of memory, or in **chunks**. Chunks are atomic objects and HDF5 can only read and write entire chunks. Chunks are internally organized within a tree data structure called a B-tree. When we create a new array or table, we can specify the chunk shape. 

In [2]:
import numpy as np
import h5py

In [3]:
f = h5py.File('myfile.h5', 'w')
f.create_group('/experiment1')

<HDF5 group "/experiment1" (0 members)>

In [4]:
f['/experiment1'].attrs['date'] = '2018-01-01'

In [5]:
x = np.random.rand(1000, 1000)
f['/experiment1'].create_dataset('array1', data=x)

<HDF5 dataset "array1": shape (1000, 1000), type "<f8">

In [6]:
f.close()

In [7]:
f = h5py.File('myfile.h5', 'r')
f['/experiment1'].attrs['date']

'2018-01-01'

In [8]:
y = f['/experiment1/array1']
type(y)

h5py._hl.dataset.Dataset

In [9]:
np.array_equal(x[0, :], y[0, :])

True

In [10]:
f.close()

In [11]:
import os
os.remove('myfile.h5')