# zh5

Yet another HDF5 reader.

In [1]:
import zh5

## Performance

In [2]:
import time

Perfomance description:

- Remote data access to object storage (OpenStack Object Store project, known as **Swift**).
- Same file formated as:
  - HDF5 page aggregated file (page size of 4096 bytes).
  - HDF5 split file (metadata and raw data files).

In [3]:
results = []

### Page file

In [4]:
for run in range(3):
    open_start = time.time()
    f = zh5.PagedFile("https://api.cloud.ifca.es:8080/swift/v1/tests/ch330a.pc19790301-def.nc")
    open_end = time.time()
    
    dataset_start = time.time()
    ds = f["UM_m01s30i204_vn1106"]
    dataset_end = time.time()
    
    read_start = time.time()
    ts = ds[:, 0, 0:960, :].mean(axis=(1, 2, 3))
    read_end = time.time()
    
    f.close()
    
    results.append({
        "type": "page",
        "run": run,
        "open": open_end - open_start,
        "lookup": dataset_end - dataset_start,
        "read": read_end - read_start,
        "cache_hits": f.cache_hits,
        "cache_misses": f.cache_misses,
    })

### Split file

In [5]:
for run in range(3):
    open_start = time.time()
    f = zh5.SplitFile("https://api.cloud.ifca.es:8080/swift/v1/tests/ch330a.pc19790301-def")
    open_end = time.time()
    
    dataset_start = time.time()
    ds = f["UM_m01s30i204_vn1106"]
    dataset_end = time.time()
    
    read_start = time.time()
    ts = ds[:, 0, 0:960, :].mean(axis=(1, 2, 3))
    read_end = time.time()
    
    f.close()
    
    results.append({
        "type": "split",
        "run": run,
        "open": open_end - open_start,
        "lookup": dataset_end - dataset_start,
        "read": read_end - read_start,
        "cache_hits": None,
        "cache_misses": None,
    })

### Regular file

Don't ever bother ...

## Analysis

In [6]:
import pandas as pd

In [7]:
df = pd.DataFrame.from_records(results)
df

Unnamed: 0,type,run,open,lookup,read,cache_hits,cache_misses
0,page,0,1.691056,12.839349,5.881265,4121.0,56.0
1,page,1,1.814325,13.17261,6.382858,4121.0,56.0
2,page,2,1.966927,12.867188,6.005512,4121.0,56.0
3,split,0,1.923662,0.007544,5.495594,,
4,split,1,1.499715,0.015313,5.598474,,
5,split,2,2.020635,0.015629,6.057909,,


Pay attention to the `lookup` column, which records the time required to locate the dataset in the file. In an HDF5 paged file, **56** `cache_misses` occur, which causes 56 HTTP connections to locate metada in the file. In an HDF5 split file, all metada is loaded into memory when opening the file, thus the low values for lookup.

In [8]:
df.groupby("type").mean()

Unnamed: 0_level_0,run,open,lookup,read,cache_hits,cache_misses
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
page,1.0,1.824102,12.959716,6.089878,4121.0,56.0
split,1.0,1.814671,0.012829,5.717326,,


In [9]:
df.groupby("type").std()

Unnamed: 0_level_0,run,open,lookup,read,cache_hits,cache_misses
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
page,1.0,0.138195,0.184896,0.261222,0.0,0.0
split,1.0,0.277036,0.004579,0.299406,,
