# Creating Generic HDF5 Files

The DL1 files shown in the previous tutorials are created and read by subclasses to the `HDF5Writer` and `HDF5Reader` classes, respectively. These classes can be used for more custom purposes, such as the storage of some data in a tabular format. I personally find this very useful, and many of my personal scripts store data into a HDF5 file as a intermediary step (using `HDF5Writer`), while a second script will create the plot from this file (using `HDF5Reader`).

## Reminder about HDF5 and DataFrames

The .h5 extension is used by HDF5 files https://support.hdfgroup.org/HDF5/whatishdf5.html.

Inside the HDF5 files are HDFStores, which are the format pandas DataFrames are stored inside HDF5 files. You can read about HDFStores here: https://pandas.pydata.org/pandas-docs/stable/io.html#hdf5-pytables.

Pandas DataFrames are a tabular data structure widely used by data scientists for Python analysis: https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe. They allow easy querying, sorting, grouping, and processing of data.

## HDF5Writer Example

The most straight-forward way to write to a HDF5 file is via the `write` method:

In [None]:
import pandas as pd
import numpy as np
from CHECLabPy.core.io import HDF5Writer

x = np.arange(100)

y2 = x**2
df2 = pd.DataFrame(dict(
    x=x,
    y=y2,
))

y5 = x**5
df5 = pd.DataFrame(dict(
    x=x,
    y=y5,
))

metadata_2 = dict(
    size=x.size,
    power=2,
)
metadata_5 = dict(
    size=x.size,
    power=5,
)

with HDF5Writer("refdata/data1.h5") as writer:
    writer.write(data_2=df2, data_5=df5)
    writer.add_metadata(key='data_2', **metadata_2)
    writer.add_metadata(key='data_5', **metadata_5)
    # Add a second metadata field for the data_5 table
    writer.add_metadata(key='data_5', name='test', **metadata_5)

However, if you are instead iterating through a dataset, and cannot hold the entire result in memory for storage, you can instead use the `append` method. This is used in the extract_dl1 script.

In [None]:
import pandas as pd
import numpy as np
from CHECLabPy.core.io import HDF5Writer

metadata = dict(
    size=100*3,
)

with HDF5Writer("refdata/data2.h5") as writer:
    for x in range(100):
        power = np.array([2, 4, 5])
        y = x**power
        df = pd.DataFrame(dict(
            x=x,
            y=y,
            power=power,
        ))
        writer.append(df, key='data')
    writer.add_metadata(key='data', **metadata)

If you are processing data from a TIO or DL1 file, you may wish to store the pixel mapping inside the HDF5 file with your results, which could be useful for plotting the results later:

In [None]:
# Plotting a camera image of charge extracted per pixel for the nth event
import pandas as pd
from CHECLabPy.core.io import HDF5Writer
from CHECLabPy.core.io import DL1Reader

dl1_path = "refdata/Run17473_dl1.h5"
reader = DL1Reader(dl1_path)

pixel, charge = reader.select_columns(['pixel', 'charge_cc'])

df = pd.DataFrame(dict(
    pixel=pixel,
    charge=charge,
))

with HDF5Writer("refdata/data3.h5") as writer:
    writer.write(data=df)
    writer.add_mapping(reader.mapping)

## HDF5Reader Example

It is possible to see what contents of a file are accessible with the `dataframe_keys` and `metadata_keys` attributes:

In [None]:
from CHECLabPy.core.io import HDF5Reader

with HDF5Reader("refdata/data1.h5") as reader:
    print(reader.dataframe_keys)
    print(reader.metadata_keys)

Reading the data back from the file is achieved as follows:

In [None]:
from CHECLabPy.core.io import HDF5Reader

with HDF5Reader("refdata/data1.h5") as reader:
    df_2 = reader.read("data_2")
    df_5 = reader.read("data_5")
    metadata_2 = reader.get_metadata(key='data_2')
    metadata_5 = reader.get_metadata(key='data_5', name='test')
    
print(df_2)
print(metadata_2)

In [None]:
from CHECLabPy.core.io import HDF5Reader

with HDF5Reader("refdata/data2.h5") as reader:
    df = reader.read("data")
    metadata = reader.get_metadata(key='data')
    
print(df)
print(metadata)

In [None]:
from CHECLabPy.core.io import HDF5Reader

with HDF5Reader("refdata/data3.h5") as reader:
    df = reader.read("data")
    mapping = reader.get_mapping()
    
print(df)
print(mapping)