# 1 Structuring Datasets

>Objectives:
>
> * How to use a hierarchy to structure datasets inside the same file
> * Use the hierarchy in h5py and PyTables
> * Interactive auto-completion
> * How to use attributes in h5py and PyTables

## Using the Hierarchy

In HDF5, all nodes stem from a root ("/").  The nodes can be either `Groups` or `Datasets` (also know as `Leaves` in PyTables).  `Groups` are the equivalent of directories on a filesystem and can container `Datasets` or other `Groups`.  A `Dataset` is a container for data.

The hdf5 file that was shown in the introduction has the following layout:
```
/                        Group
/hisparc                 Group
/hisparc/cluster_aarhus  Group
/hisparc/cluster_aarhus/station_20002 Group
/hisparc/cluster_aarhus/station_20002/blobs Dataset {41496/Inf}
/hisparc/cluster_aarhus/station_20002/events Dataset {20748/Inf}
/hisparc/cluster_amsterdam Group
/hisparc/cluster_amsterdam/station_101 Group
/hisparc/cluster_amsterdam/station_101/blobs Dataset {96688/Inf}
/hisparc/cluster_amsterdam/station_101/events Dataset {24172/Inf}
/hisparc/cluster_amsterdam/station_102 Group
/hisparc/cluster_amsterdam/station_102/blobs Dataset {89792/Inf}
/hisparc/cluster_amsterdam/station_102/config Dataset {1/Inf}
/hisparc/cluster_amsterdam/station_102/events Dataset {44895/Inf}
/hisparc/cluster_amsterdam/station_104 Group
[...]
```

The HDF5 hierachy resembles POSIX style paths to `Groups` and `Datasets`. 


In [1]:
import numpy as np
import tables
import h5py

In [2]:
import os
import shutil
data_dir = "structuring"
if os.path.exists(data_dir):
    shutil.rmtree(data_dir)
os.mkdir(data_dir)

## PyTables

`pytables` provides high level access to the HDF5 library (in python) with a pythonic API: 

In [3]:
import tables

Create a HDF5 file:

In [4]:
FILENAME = os.path.join(data_dir, "layout.h5")
f = tables.open_file(FILENAME, "w")

Create a group:

In [5]:
group = f.create_group('/', 'a_group')
group

/a_group (Group) ''
  children := []

Inside this group we can create many datasets:

In [6]:
f.create_array(group, "my_array1", np.arange(10))
f.create_array(group, "my_array2", np.ones(100).reshape(10, 10));

or another group:

In [7]:
f.create_group('/a_group', 'another_group')

/a_group/another_group (Group) ''
  children := []

Let's look at the structure of the HDF5 file:

In [8]:
print(f)

structuring\layout.h5 (File) ''
Last modif.: 'Tue Jun 27 08:45:14 2017'
Object Tree: 
/ (RootGroup) ''
/a_group (Group) ''
/a_group/my_array1 (Array(10,)) ''
/a_group/my_array2 (Array(10, 10)) ''
/a_group/another_group (Group) ''



With that, you can endow your datasets with any hierachy that would fit better to your needs.

### Natural naming in PyTables

In PyTables, you may access nodes as attributes on a Python object, namely `f.root.a_group.some_data`.  This is known as natural naming.

In [9]:
f.root.a_group.my_array1

/a_group/my_array1 (Array(10,)) ''
  atom := Int32Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None

compare `file.get_node()`:

In [10]:
f.get_node('/a_group/my_array1')

/a_group/my_array1 (Array(10,)) ''
  atom := Int32Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None

Natural naming supports `<TAB>` completion:

In [11]:
f.root.a_group

/a_group (Group) ''
  children := ['my_array1' (Array), 'my_array2' (Array), 'another_group' (Group)]

In [12]:
f.close()

## h5py

`h5py` provides high level access to the HDF5 library while keeping as close to python `dict` like objects and `numpy`. Let's look at the HDF5 hierarchy in `h5py`:

In [13]:
import h5py

In [14]:
f = h5py.File(FILENAME, 'a')

In [15]:
list(f)

['a_group']

The `h5py.File` object acts as a dictonary, which exposes the groups and datasets:

In [16]:
f['/a_group']

<HDF5 group "/a_group" (3 members)>

Using the `dict` like property of the `h5py.File` object, we can view and access its members:

In [17]:
grp = f['/a_group']
list(grp.items())

[('another_group', <HDF5 group "/a_group/another_group" (0 members)>),
 ('my_array1', <HDF5 dataset "my_array1": shape (10,), type "<i4">),
 ('my_array2', <HDF5 dataset "my_array2": shape (10, 10), type "<f8">)]

Or just list() the `group`:

In [18]:
list(grp)

['another_group', 'my_array1', 'my_array2']

`<TAB>` completion must be enabled in h5py:

In [19]:
h5py.enable_ipython_completer()

In [20]:
# use <TAB> completion:
f['/a_group/my_array1']

<HDF5 dataset "my_array1": shape (10,), type "<i4">

In [21]:
f.create_group('/a_group/YAG')

<HDF5 group "/a_group/YAG" (0 members)>

In [22]:
f.close()

# Attributes

Attributes are small named pieces of data attached directly to Group and Dataset objects. Attributes are what makes HDF5 a “self-describing” format.  This is the official way to store metadata in HDF5.

Investigate the way `h5py` and `PyTables` store and load attributes:

```
# h5py: .attrs dict-like interface
f['some_group'].attrs['name'] = ...

# pytables: get_node_attrs() and set_node_attrs()
```

**Assignment**:

Using either `h5py` or `pytables`:
 * create a group (and/or dataset)
 * store some attributes in the group (and/or dataset).
 * read them with the *other* package.

*Optional*:
 * store a string in an attribute using `pytables`. Read with `h5py`
 * `h5py` can store scalars and `numpy` arrays. `pytables` will also store python objects. Store a **class** in an attribute.

In the next notebook we will look at datasets and datatypes. 