# Part 1B: Datatypes in HDF5

> Objectives:
> * How to create (and read) HDF5 files with datasets of homogeneous, heterogenous and nested datatypes
> * See how h5py and PyTables achieves the same thing with their own APIs
> * Be introduced to the `IsDescription` class in PyTables for declaring tables (instead of NumPy dtypes)

In [1]:
import numpy as np
import h5py
import tables

In [2]:
import os
import shutil
data_dir = "datatypes"
if os.path.exists(data_dir):
    shutil.rmtree(data_dir)
os.mkdir(data_dir)

## Homogeneous datatypes

In [3]:
arr_to_store = np.arange(10, dtype=np.int8)
arr_to_store

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int8)

### Using h5py

In [4]:
FILENAME = os.path.join(data_dir, "homogenous_h5py.h5")
f = h5py.File(FILENAME, "w")

The `h5py.File` object supports both the `create_dataset` method and a `dict` like access:

In [6]:
f.create_dataset(data=arr_to_store, name="mydata")

RuntimeError: Unable to create link (Name already exists)

In [7]:
f['/mydata2'] = arr_to_store    # data can be accessed in a NumPy-like interface

In [8]:
list(f)

['mydata', 'mydata2']

Read the dataset with `[:]` or `[...]`:

In [10]:
f['/mydata'][:]

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int8)

In [11]:
f.close()

The HDF5 library provides `h5ls` and ``h5dump` to investigate the contents of HDF5 files:

In [14]:
!h5ls {FILENAME}

mydata                   Dataset {10}
mydata2                  Dataset {10}


In [15]:
!h5ls -rv {FILENAME}

Opened "datatypes\homogenous_h5py.h5" with sec2 driver.
/                        Group
    Location:  1:96
    Links:     1
/mydata                  Dataset {10/10}
    Location:  1:800
    Links:     1
    Storage:   10 logical bytes, 10 allocated bytes, 100.00% utilization
    Type:      native signed char
/mydata2                 Dataset {10/10}
    Location:  1:1672
    Links:     1
    Storage:   10 logical bytes, 10 allocated bytes, 100.00% utilization
    Type:      native signed char


In [16]:
!h5dump {FILENAME}

HDF5 "datatypes\homogenous_h5py.h5" {
GROUP "/" {
   DATASET "mydata" {
      DATATYPE  H5T_STD_I8LE
      DATASPACE  SIMPLE { ( 10 ) / ( 10 ) }
      DATA {
      (0): 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
      }
   }
   DATASET "mydata2" {
      DATATYPE  H5T_STD_I8LE
      DATASPACE  SIMPLE { ( 10 ) / ( 10 ) }
      DATA {
      (0): 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
      }
   }
}
}


In [17]:
!ls -lh {data_dir}

total 4.0K
-rw-r--r-- 1 tomkooij 197613 2.2K Jun 22 08:59 homogenous_h5py.h5


### Using PyTables

In [18]:
import tables

In [19]:
FILENAME = os.path.join(data_dir, "homogenous_pytables.h5")
f2 = tables.open_file(FILENAME, "w")

In `PyTables` datasets are wrapped into high levels objects:
* **Array**: homogenous dataset
* **CArray**: homogenous dataset, chunked storage (more on this later)
* **EArray**: homogenous dataset, extendable. Supports `.append()`.
* **Table**: compound dataset, extendable. Supports `.append()`.

In [21]:
f2.create_array(f2.root, name="mydata", obj=arr_to_store)

NodeError: group ``/`` already has a child node named ``mydata``

Reading a dataset into memory is similar to `h5py` with `[:]` or `[...]`:

In [27]:
f2.root.mydata[:]  # data can be accessed in a NumPy-like interface

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int8)

`PyTables` also provides a `.read()` method:

In [28]:
f2.root.mydata.read()  

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int8)

In [29]:
f2

File(filename=datatypes\homogenous_pytables.h5, title='', mode='w', root_uep='/', filters=Filters(complevel=0, shuffle=False, bitshuffle=False, fletcher32=False, least_significant_digit=None))
/ (RootGroup) ''
/mydata (Array(10,)) ''
  atom := Int8Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'irrelevant'
  chunkshape := None

In [30]:
f2.close()

In [34]:
!h5ls -r {FILENAME}

/                        Group
/mydata                  Dataset {10}


In [35]:
!ls -lh {data_dir}

total 8.0K
-rw-r--r-- 1 tomkooij 197613 2.2K Jun 22 08:59 homogenous_h5py.h5
-rw-r--r-- 1 tomkooij 197613 2.2K Jun 22 09:05 homogenous_pytables.h5


## Compound Datatypes

In [36]:
dtype = np.dtype([("myfield1", np.int32), ("myfield2", np.float64), ("myfield3", "S5")])
table_to_store = np.fromiter(((i, i**2, "foo_%d"%i) for i in range(10)), dtype=dtype)

In [37]:
table_to_store

array([(0,   0., b'foo_0'), (1,   1., b'foo_1'), (2,   4., b'foo_2'),
       (3,   9., b'foo_3'), (4,  16., b'foo_4'), (5,  25., b'foo_5'),
       (6,  36., b'foo_6'), (7,  49., b'foo_7'), (8,  64., b'foo_8'),
       (9,  81., b'foo_9')],
      dtype=[('myfield1', '<i4'), ('myfield2', '<f8'), ('myfield3', 'S5')])

### Using h5py

In [38]:
FILENAME = os.path.join(data_dir, "compound_h5py.h5")
f = h5py.File(FILENAME, "w")

In [39]:
f['mydata'] = table_to_store

In [40]:
f['mydata']

<HDF5 dataset "mydata": shape (10,), type "|V17">

In [41]:
f['mydata'].dtype

dtype([('myfield1', '<i4'), ('myfield2', '<f8'), ('myfield3', 'S5')])

In [42]:
f['mydata'][:]

array([(0,   0., b'foo_0'), (1,   1., b'foo_1'), (2,   4., b'foo_2'),
       (3,   9., b'foo_3'), (4,  16., b'foo_4'), (5,  25., b'foo_5'),
       (6,  36., b'foo_6'), (7,  49., b'foo_7'), (8,  64., b'foo_8'),
       (9,  81., b'foo_9')],
      dtype=[('myfield1', '<i4'), ('myfield2', '<f8'), ('myfield3', 'S5')])

In [43]:
f.close()

In [44]:
!h5ls -v {FILENAME}

Opened "datatypes\compound_h5py.h5" with sec2 driver.
mydata                   Dataset {10/10}
    Location:  1:800
    Links:     1
    Storage:   170 logical bytes, 170 allocated bytes, 100.00% utilization
    Type:      struct {
                   "myfield1"         +0    native int
                   "myfield2"         +4    native double
                   "myfield3"         +12   5-byte null-padded ASCII string
               } 17 bytes


### Exercise

Open `datatypes\compound_h5py.h5` in PyTables and investigate the dataset.

Look at the dataset description. Read it from disk. Look at the dtype.

*Optional: Create a new table. Can you append some data?*

In [46]:
FILENAME = os.path.join(data_dir, "compound_h5py.h5")

In [47]:
!ptdump {FILENAME}

/ (RootGroup) ''
/mydata (Table(10,)) ''


In [48]:
#
#
# SOLUTION STARTS HERE!!!
#
#

In [49]:
fileh = tables.open_file(FILENAME, 'r')

In [50]:
fileh

File(filename=datatypes\compound_h5py.h5, title='', mode='r', root_uep='/', filters=Filters(complevel=0, shuffle=False, bitshuffle=False, fletcher32=False, least_significant_digit=None))
/ (RootGroup) ''
/mydata (Table(10,)) ''
  description := {
  "myfield1": Int32Col(shape=(), dflt=0, pos=0),
  "myfield2": Float64Col(shape=(), dflt=0.0, pos=1),
  "myfield3": StringCol(itemsize=5, shape=(), dflt=b'', pos=2)}
  byteorder := 'little'
  chunkshape := (3855,)

In [51]:
x = fileh.root.mydata[:]
x

array([(0,   0., b'foo_0'), (1,   1., b'foo_1'), (2,   4., b'foo_2'),
       (3,   9., b'foo_3'), (4,  16., b'foo_4'), (5,  25., b'foo_5'),
       (6,  36., b'foo_6'), (7,  49., b'foo_7'), (8,  64., b'foo_8'),
       (9,  81., b'foo_9')],
      dtype=[('myfield1', '<i4'), ('myfield2', '<f8'), ('myfield3', 'S5')])

In [52]:
x.dtype

dtype([('myfield1', '<i4'), ('myfield2', '<f8'), ('myfield3', 'S5')])

In [53]:
x.shape

(10,)

### Using PyTables (using numpy.dtype)

In PyTables a compound dataset is called a `Table`.

To store a table we use `create_table`:

In [80]:
FILENAME = os.path.join(data_dir, "compound_pytables1.h5")
f2 = tables.open_file(FILENAME, "w")

In [81]:
table = f2.create_table(f2.root, name="mydata", description=table_to_store.dtype)
table

/mydata (Table(0,)) ''
  description := {
  "myfield1": Int32Col(shape=(), dflt=0, pos=0),
  "myfield2": Float64Col(shape=(), dflt=0.0, pos=1),
  "myfield3": StringCol(itemsize=5, shape=(), dflt=b'', pos=2)}
  byteorder := 'little'
  chunkshape := (3855,)

The `Table` class has high level functions, such as `append()` and `remove_row()`:

In [82]:
table.append(table_to_store)  

In [83]:
table.read()

array([(0,   0., b'foo_0'), (1,   1., b'foo_1'), (2,   4., b'foo_2'),
       (3,   9., b'foo_3'), (4,  16., b'foo_4'), (5,  25., b'foo_5'),
       (6,  36., b'foo_6'), (7,  49., b'foo_7'), (8,  64., b'foo_8'),
       (9,  81., b'foo_9')],
      dtype=[('myfield1', '<i4'), ('myfield2', '<f8'), ('myfield3', 'S5')])

In [84]:
table.remove_row(5)

In [85]:
table.read()

array([(0,   0., b'foo_0'), (1,   1., b'foo_1'), (2,   4., b'foo_2'),
       (3,   9., b'foo_3'), (4,  16., b'foo_4'), (6,  36., b'foo_6'),
       (7,  49., b'foo_7'), (8,  64., b'foo_8'), (9,  81., b'foo_9')],
      dtype=[('myfield1', '<i4'), ('myfield2', '<f8'), ('myfield3', 'S5')])

In [89]:
table_to_store[1]

(1,  1., b'foo_1')

In [92]:
f2.close()

### Using PyTables (using tables.description)

In PyTables it is convenient to define compound datasets using the `tables.IsDescription` class, instead of (complicated) numpy dtypes.

In [70]:
class MyTable(tables.IsDescription):
    myfield1 = tables.Int32Col()
    myfield2 = tables.Float64Col()
    myfield3 = tables.StringCol(itemsize=5)

In [71]:
FILENAME = os.path.join(data_dir, "compound_pytables2.h5")
f3 = tables.open_file(FILENAME, "w")

In [72]:
t = f3.create_table(f3.root, "mydata", MyTable)

In [73]:
t.append(table_to_store)

In [74]:
f3.close()

### Comparing the HDF5 files:

In [76]:
!ls -l {data_dir}

total 156
-rw-r--r-- 1 tomkooij 197613  2314 Jun 22 09:06 compound_h5py.h5
-rw-r--r-- 1 tomkooij 197613 69879 Jun 22 09:09 compound_pytables1.h5
-rw-r--r-- 1 tomkooij 197613 69879 Jun 22 09:09 compound_pytables2.h5
-rw-r--r-- 1 tomkooij 197613  2174 Jun 22 08:59 homogenous_h5py.h5
-rw-r--r-- 1 tomkooij 197613  2154 Jun 22 09:05 homogenous_pytables.h5


Hmm, it seems like PyTables files are larger than h5py ones, why?  Let's introspect a bit into the files:

In [77]:
!h5ls {data_dir}/compound_h5py.h5

mydata                   Dataset {10}


In [78]:
!h5ls {data_dir}/compound_pytables1.h5

mydata                   Dataset {9/Inf}


We see that the dimensionality of the table created with PyTables is `{10/Inf}` (or `{9/Inf}` if we deleted a row), indicating that the dataset is chunked, whereas the one created with h5py is just `{10}`, which means that it is not using chunking.  As chunked datasets take more space than non-chunked ones, this is why PyTables are larger.

The reason why PyTables tables are chunked by default is that they can be enlarged and compressed, and chunking is required in order to allow that.  More on chunking later.

## Nested fields

The `PyTables` API provides some useful tools to handle nested columns in compound datasets (tables).

In [99]:
class NestedTable(tables.IsDescription):
    """A nested table"""
    name = tables.StringCol(10, pos=0)
    
    class momentum(tables.IsDescription):
        p_x = tables.Float64Col()
        p_y = tables.Float64Col()
        p_z = tables.Float64Col() 

In [100]:
FILENAME = os.path.join(data_dir, "nested.h5")
f4 = tables.open_file(FILENAME, "w")

ValueError: The file 'datatypes\nested.h5' is already opened.  Please close it before reopening in write mode.

In [101]:
t = f4.create_table(f4.root, "mydata", NestedTable)

NodeError: group ``/`` already has a child node named ``mydata``

In [102]:
t

/mydata (Table(0,)) ''
  description := {
  "name": StringCol(itemsize=10, shape=(), dflt=b'', pos=0),
  "momentum": {
    "p_x": Float64Col(shape=(), dflt=0.0, pos=0),
    "p_y": Float64Col(shape=(), dflt=0.0, pos=1),
    "p_z": Float64Col(shape=(), dflt=0.0, pos=2)}}
  byteorder := 'little'
  chunkshape := (1927,)

In [103]:
dtype = t.dtype
dtype

dtype([('name', 'S10'), ('momentum', [('p_x', '<f8'), ('p_y', '<f8'), ('p_z', '<f8')])])

In [104]:
table_to_store = np.fromiter((("foo_%s"%i, (i, 10+i, 20+i)) for i in range(10)), dtype=dtype)
table_to_store

array([(b'foo_0', ( 0.,  10.,  20.)), (b'foo_1', ( 1.,  11.,  21.)),
       (b'foo_2', ( 2.,  12.,  22.)), (b'foo_3', ( 3.,  13.,  23.)),
       (b'foo_4', ( 4.,  14.,  24.)), (b'foo_5', ( 5.,  15.,  25.)),
       (b'foo_6', ( 6.,  16.,  26.)), (b'foo_7', ( 7.,  17.,  27.)),
       (b'foo_8', ( 8.,  18.,  28.)), (b'foo_9', ( 9.,  19.,  29.))],
      dtype=[('name', 'S10'), ('momentum', [('p_x', '<f8'), ('p_y', '<f8'), ('p_z', '<f8')])])

Let's investigate the `p_x` nested column:

In [106]:
table_to_store['momentum']['p_x']

array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.])

In [107]:
t.append(table_to_store)

In [108]:
t.read()

array([(b'foo_0', ( 0.,  10.,  20.)), (b'foo_1', ( 1.,  11.,  21.)),
       (b'foo_2', ( 2.,  12.,  22.)), (b'foo_3', ( 3.,  13.,  23.)),
       (b'foo_4', ( 4.,  14.,  24.)), (b'foo_5', ( 5.,  15.,  25.)),
       (b'foo_6', ( 6.,  16.,  26.)), (b'foo_7', ( 7.,  17.,  27.)),
       (b'foo_8', ( 8.,  18.,  28.)), (b'foo_9', ( 9.,  19.,  29.))],
      dtype=[('name', 'S10'), ('momentum', [('p_x', '<f8'), ('p_y', '<f8'), ('p_z', '<f8')])])

### Using the `Cols` accessor. (PyTables)

`table.col(name)` reads the entire column.

`table.col('momentum')` will read the entire column (array) in memory and slice.

In [109]:
t.col('momentum')[2:5]

array([( 2.,  12.,  22.), ( 3.,  13.,  23.), ( 4.,  14.,  24.)],
      dtype=[('p_x', '<f8'), ('p_y', '<f8'), ('p_z', '<f8')])

Using the `cols` accessor, we can access the column without reading the entire column in memory:

In [110]:
t.cols.momentum

/mydata.cols.momentum (Cols), 3 columns
  p_x (Column(10,), float64)
  p_y (Column(10,), float64)
  p_z (Column(10,), float64)

In [111]:
t.cols.momentum[2:5]

array([( 2.,  12.,  22.), ( 3.,  13.,  23.), ( 4.,  14.,  24.)],
      dtype=[('p_x', '<f8'), ('p_y', '<f8'), ('p_z', '<f8')])

Nested columns can be accessed by the `Cols` accessor using natural naming: 

In [112]:
t.cols.momentum.p_x

/mydata.cols.momentum.p_x (Column(10,), float64, idx=None)

In [113]:
f4.close()

### Exercise

Investigate reading a small part of a large table from disk.

 * Store the table in a HDF5 file. (Using either PyTables or h5py).
 * Read elements [20:30] from the p_x column.

Is the entire datafile being read from disk?

In [121]:
class NestedTable(tables.IsDescription):
    """A nested table"""
    i = tables.Int32Col()
    
    class momentum(tables.IsDescription):
        p_x = tables.Float64Col()
        p_y = tables.Float64Col()
        p_z = tables.Float64Col() 
        
dtype = tables.description.dtype_from_descr(NestedTable)

Create a "large" dataset:

In [122]:
N = int(1e6)
table_to_store = np.fromiter(((i, (i, i, i)) for i in range(N)), dtype=dtype)

Use the IPython magic `%time` or `%timeit` to time the reading of data from disk.

In [123]:
#
#
# SOLUTION STARTS HERE
#
#

In [126]:
f = tables.open_file('big.h5', 'w')

In [127]:
f.create_table('/', 'mydata', table_to_store)

/mydata (Table(1000000,)) ''
  description := {
  "i": Int32Col(shape=(), dflt=0, pos=0),
  "momentum": {
  "p_x": Float64Col(shape=(), dflt=0.0, pos=0),
  "p_y": Float64Col(shape=(), dflt=0.0, pos=1),
  "p_z": Float64Col(shape=(), dflt=0.0, pos=2)}}
  byteorder := 'little'
  chunkshape := (4681,)

In [128]:
%time f.root.mydata[:]

Wall time: 35.5 ms


array([(     0, (  0.00000000e+00,   0.00000000e+00,   0.00000000e+00)),
       (     1, (  1.00000000e+00,   1.00000000e+00,   1.00000000e+00)),
       (     2, (  2.00000000e+00,   2.00000000e+00,   2.00000000e+00)),
       ...,
       (999997, (  9.99997000e+05,   9.99997000e+05,   9.99997000e+05)),
       (999998, (  9.99998000e+05,   9.99998000e+05,   9.99998000e+05)),
       (999999, (  9.99999000e+05,   9.99999000e+05,   9.99999000e+05))],
      dtype=[('i', '<i4'), ('momentum', [('p_x', '<f8'), ('p_y', '<f8'), ('p_z', '<f8')])])

In [129]:
%time f.root.mydata.cols.momentum.p_x[20:30]

Wall time: 2 ms


array([ 20.,  21.,  22.,  23.,  24.,  25.,  26.,  27.,  28.,  29.])

In [130]:
f.close()

In `h5py` there is no equivalent: 

In [131]:
f = h5py.File('big.h5', 'a')
dset = f['mydata']

In [None]:
%time dset['momentum']['p_x'][20:30]  # this reads the entire nested columns and selects `p_x`

In [134]:
%time dset['momentum']['p_x'][20:30]

Wall time: 51.6 ms


array([ 20.,  21.,  22.,  23.,  24.,  25.,  26.,  27.,  28.,  29.])

In [135]:
f.close()

# Exercise

Using the hierarchy and compound datasets (tables).

`ufo-scrubbed.csv` is a (scrubbed) partial dataset of UFO Sightings from across the world.

Store the UFO sightings in HDF5. Assume the real (full) dataset is VERY large. Store the data in multiple tables (geospatial).
Make sure you use correct dtype etc.

Use h5py and/or pytables.

**The objective of this excercise is to get hands-on experience with handling compound datatypes, groups, dtypes etc.**

-------------



Please, do not use `pandas` in this exercise (we will use it later on). But let's use pandas for a quick overview of the dataset:

In [284]:
import pandas as pd
df = pd.read_csv('datasets/ufo_scrubbed.csv', low_memory=False)
df.head()

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700.0,45 minutes,This event took place in early fall around 194...,4/27/2004,29.883056,-97.941111
1,10/10/1949 21:00,lackland afb,tx,,light,7200.0,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20.0,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
3,10/10/1956 21:00,edna,tx,us,circle,20.0,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.978333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900.0,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.418056,-157.803611


Read the dataset CSV and create a dict of sighting per country:

In [285]:
import csv

In [286]:
with open('datasets/ufo_scrubbed.csv', 'r') as csvfile:
    reader = csv.reader(csvfile)
    dataset = [tuple(line) for line in reader]

In [287]:
dataset[:2]

[('datetime',
  'city',
  'state',
  'country',
  'shape',
  'duration (seconds)',
  'duration (hours/min)',
  'comments',
  'date posted',
  'latitude',
  'longitude'),
 ('10/10/1949 20:30',
  'san marcos',
  'tx',
  'us',
  'cylinder',
  '2700',
  '45 minutes',
  'This event took place in early fall around 1949-50. It occurred after a Boy Scout meeting in the Baptist Church. The Baptist Church sit',
  '4/27/2004',
  '29.8830556',
  '-97.9411111')]

Let's create a dictonary of sightings per country:

In [288]:
from collections import defaultdict
sightings = defaultdict(list)

for sighting in dataset[1:]:
    dt, city, state, country, _, duration, _, comments, _, lat, lon = sighting
    sightings[country].append((dt, city, state, duration, lat, lon))

In [289]:
sightings.keys()

dict_keys(['us', '', 'gb', 'ca', 'au', 'de'])

In [290]:
len(sightings['de'])

105

Each list in the dictonary is a `numpy.recarray` like object:

In [291]:
sightings['de'][:3]

[('10/13/2006 00:02', 'berlin (germany)', '', '120', '52.516667', '13.4'),
 ('10/20/2012 18:00', 'berlin (germany)', '', '1500', '52.516667', '13.4'),
 ('10/8/2012 17:10', 'obernheim (germany)', '', '2', '49.366667', '7.583333')]

In [292]:
#
#
# SOLUTION STARTS HERE
#
#

Manually set the `dtype` (or use `tables.IsDescription`)

In [293]:
dtype=np.dtype([('datetime', 'S16'), ('city', 'S20'), ('state', 'S2'), ('duration', np.float32), ('lat', 'f8'), ('lon', 'f8')])

In [294]:
np.array(sightings['us'], dtype=dtype)

array([ (b'10/10/1949 20:30', b'san marcos', b'tx',  2700.,  29.8830556,  -97.9411111),
       (b'10/10/1956 21:00', b'edna', b'tx',    20.,  28.9783333,  -96.6458333),
       (b'10/10/1960 20:00', b'kaneohe', b'hi',   900.,  21.4180556, -157.8036111),
       ...,
       (b'9/9/2013 22:00', b'napa', b'ca',  1200.,  38.2972222, -122.2844444),
       (b'9/9/2013 22:20', b'vienna', b'va',     5.,  38.9011111,  -77.2655556),
       (b'9/9/2013 23:00', b'edmond', b'ok',  1020.,  35.6527778,  -97.4777778)],
      dtype=[('datetime', 'S16'), ('city', 'S20'), ('state', 'S2'), ('duration', '<f4'), ('lat', '<f8'), ('lon', '<f8')])

In [303]:
fn = os.path.join(data_dir, 'ufo.h5')
f = h5py.File(fn, 'w')

In [304]:
for country in sightings.keys():
    if country=='':
        group_name = 'other'
    else:
        group_name = country
    f.create_group(group_name)
    f.create_dataset('%s/data' % group_name, data=np.array(sightings[country], dtype=dtype))

In [305]:
list(f)

['au', 'ca', 'de', 'gb', 'other', 'us']

In [306]:
f.close()

In [309]:
!h5ls -r {fn}

/                        Group
/au                      Group
/au/data                 Dataset {538}
/ca                      Group
/ca/data                 Dataset {3000}
/de                      Group
/de/data                 Dataset {105}
/gb                      Group
/gb/data                 Dataset {1905}
/other                   Group
/other/data              Dataset {9670}
/us                      Group
/us/data                 Dataset {65114}


# Exercise (Optional)

In `PyTables` datasets are wrapped into high levels objects:
* **Array**: homogenous dataset
* **CArray**: homogenous dataset, chunked storage (more on this later)
* **EArray**: homogenous dataset, extendable. Supports `.append()`.
* **Table**: compound dataset, extendable. Supports `.append()`.

1) Create a hdf5 file in `PyTables` with each of the above for dataset types. Open it in `h5py` (and/or use `h5ls` etc). Note the equivalent `h5py` datatype.

2) Create the same file in `h5py`, make sure the datasets end up as the above four `PyTables` dataset types. Use `ptdump` to view the file in `PyTables` format:

In [None]:
#
#
# SOLUTION STARTS HERE
#
#


In [339]:
fn = os.path.join(data_dir, 'types.h5')
with h5py.File(fn, 'w') as f:
    # Array
    f['/array'] = np.arange(10)
    # CArray
    dset = f.create_dataset('/carray', (10,), chunks=True)
    dset[:] = np.arange(10)
    # EArray
    dset = f.create_dataset('/earray', (10,), maxshape=(None,))
    dset[:] = np.arange(10)
    dset.resize(20, 0)
    dset[10:] = np.arange(10)
    # Table
    dtype=np.dtype([('c1', 'i4'), ('c2','f4')])
    table = np.array([(1, 2.5), (3, 6.7)], dtype=dtype)
    f['/table'] = table             

In [340]:
!ptdump {fn}

/ (RootGroup) ''
/array (Array(10,)) ''
/carray (CArray(10,)) ''
/earray (EArray(20,)) ''
/table (Table(2,)) ''
