# Exploring a `qp` file

This notebook takes you through what the data structure of an Ensemble looks like, and what a `qp` HDF5 file contains.

In [2]:
import qp
import h5py
import tables_io
import numpy as np
from scipy import stats

## What's in a `qp` file?

First, let's read in an Ensemble from an HDF5 file using `qp` and take a look at the metadata.

In [None]:
ens_i = qp.read("../assets/interp-ensemble.hdf5")
ens_i

Ensemble(the_class=interp,shape=(3, 50))

In [72]:
ens_i.metadata

{'pdf_name': array([b'interp'], dtype='|S6'),
 'pdf_version': array([0]),
 'xvals': array([-1.        , -0.87755102, -0.75510204, -0.63265306, -0.51020408,
        -0.3877551 , -0.26530612, -0.14285714, -0.02040816,  0.10204082,
         0.2244898 ,  0.34693878,  0.46938776,  0.59183673,  0.71428571,
         0.83673469,  0.95918367,  1.08163265,  1.20408163,  1.32653061,
         1.44897959,  1.57142857,  1.69387755,  1.81632653,  1.93877551,
         2.06122449,  2.18367347,  2.30612245,  2.42857143,  2.55102041,
         2.67346939,  2.79591837,  2.91836735,  3.04081633,  3.16326531,
         3.28571429,  3.40816327,  3.53061224,  3.65306122,  3.7755102 ,
         3.89795918,  4.02040816,  4.14285714,  4.26530612,  4.3877551 ,
         4.51020408,  4.63265306,  4.75510204,  4.87755102,  5.        ])}

### Using h5py
Now that we know for sure this file contains an Ensemble, let's use `h5py` to read in the data in the file and get a look at how it's formatted. 

In [5]:
fileobj = h5py.File("../assets/interp-ensemble.hdf5", "r")
list(fileobj.keys())

['ancil', 'data', 'meta']

This HDF5 file has 3 keys, **meta** for metadata, **data** for objdata, and **ancil** for ancillary data. Each of these is a group object. Let's take a closer look at them to see what's stored in them:

In [7]:
print(f"meta: {list(fileobj['meta'].keys())}")
print(f"data : {list(fileobj['data'].keys())}")
print(f"ancil: {list(fileobj['ancil'].keys())}")

meta: ['pdf_name', 'pdf_version', 'xvals']
data : ['yvals']
ancil: ['ids']


Each of these groups contains at least one dataset. If you look back at the Ensemble metadata dictionary we printed out earlier, you can see that the **meta** group has the same metadata keys as the Ensemble metadata dictionary. Each of these keys is its own dataset. If we print out the datasets for all the metadata keys, we can see the same information that we saw earlier:

In [None]:
# print out the contents of the metadata datasets
print(f"pdf_name: {fileobj['meta']['pdf_name'][:]}")
print(f"pdf_version: {fileobj['meta']['pdf_version'][:]}")
print(f"xvals: {fileobj['meta']['xvals'][:]}")

pdf_name: [b'interp']
pdf_version: [0]
xvals: [[-1.         -0.87755102 -0.75510204 -0.63265306 -0.51020408 -0.3877551
  -0.26530612 -0.14285714 -0.02040816  0.10204082  0.2244898   0.34693878
   0.46938776  0.59183673  0.71428571  0.83673469  0.95918367  1.08163265
   1.20408163  1.32653061  1.44897959  1.57142857  1.69387755  1.81632653
   1.93877551  2.06122449  2.18367347  2.30612245  2.42857143  2.55102041
   2.67346939  2.79591837  2.91836735  3.04081633  3.16326531  3.28571429
   3.40816327  3.53061224  3.65306122  3.7755102   3.89795918  4.02040816
   4.14285714  4.26530612  4.3877551   4.51020408  4.63265306  4.75510204
   4.87755102  5.        ]]


### Using tables_io

Now let's take a look at the file using `tables_io` to read in the data in the file and get a look at how it's formatted. `tables_io` also uses `h5py` to read in the file, but it only takes one function call to read in the whole file to a dictionary of dictionaries, or a `TableDict-like` object. For more information about `tables_io`, check out its [documentation](https://tables-io.readthedocs.io/en/latest/index.html).

In [None]:
file_tab_i = tables_io.read("../assets/interp-ensemble.hdf5")
file_tab_i.keys()

odict_keys(['ancil', 'data', 'meta'])

It's an ordered dictionary with three keys: **meta** for metadata, **data** for objdata, and **ancil** for ancillary data. Let's take a look at each of these dictionaries to see how they're formatted:

In [74]:
file_tab_i["meta"]

OrderedDict([('pdf_name', array([b'interp'], dtype='|S6')),
             ('pdf_version', array([0])),
             ('xvals',
              array([[-1.        , -0.87755102, -0.75510204, -0.63265306, -0.51020408,
                      -0.3877551 , -0.26530612, -0.14285714, -0.02040816,  0.10204082,
                       0.2244898 ,  0.34693878,  0.46938776,  0.59183673,  0.71428571,
                       0.83673469,  0.95918367,  1.08163265,  1.20408163,  1.32653061,
                       1.44897959,  1.57142857,  1.69387755,  1.81632653,  1.93877551,
                       2.06122449,  2.18367347,  2.30612245,  2.42857143,  2.55102041,
                       2.67346939,  2.79591837,  2.91836735,  3.04081633,  3.16326531,
                       3.28571429,  3.40816327,  3.53061224,  3.65306122,  3.7755102 ,
                       3.89795918,  4.02040816,  4.14285714,  4.26530612,  4.3877551 ,
                       4.51020408,  4.63265306,  4.75510204,  4.87755102,  5.        ]]))])

In [75]:
file_tab_i["data"]

OrderedDict([('yvals',
              array([[5.62574037e-11, 8.27495969e-10, 1.03658971e-08, 1.10586652e-07,
                      1.00473916e-06, 7.77425458e-06, 5.12293674e-05, 2.87497474e-04,
                      1.37405425e-03, 5.59279029e-03, 1.93868833e-02, 5.72324401e-02,
                      1.43890240e-01, 3.08088307e-01, 5.61789868e-01, 8.72423581e-01,
                      1.15381371e+00, 1.29956738e+00, 1.24657012e+00, 1.01833210e+00,
                      7.08462651e-01, 4.19758300e-01, 2.11805109e-01, 9.10182281e-02,
                      3.33100380e-02, 1.03818963e-02, 2.75570708e-03, 6.22937139e-04,
                      1.19925131e-04, 1.96621492e-05, 2.74540600e-06, 3.26465262e-07,
                      3.30614716e-08, 2.85142657e-09, 2.09438737e-10, 1.31010660e-11,
                      6.97928711e-13, 3.16643299e-14, 1.22344470e-15, 4.02580927e-17,
                      1.12817599e-18, 2.69249754e-20, 5.47253547e-22, 9.47276278e-24,
                      1.3964312

In [76]:
file_tab_i["ancil"]

OrderedDict([('ids', array([1., 2., 3.]))])

We can get a similar data structure from the Ensemble itself by using the method `build_tables`, which is what is called to create a dictionary of the three main data tables before writing an Ensemble to file. 

In [77]:
tables_i = ens_i.build_tables()
tables_i.keys()

dict_keys(['meta', 'data', 'ancil'])

We can compare the metadata tables generated by the `build_tables` method to the ones read in from file by `tables_io`: 

In [78]:
print("From build_tables method:")
print(tables_i["meta"])
print("From file:")
print(file_tab_i["meta"])

From build_tables method:
{'pdf_name': array([b'interp'], dtype='|S6'), 'pdf_version': array([0]), 'xvals': array([[-1.        , -0.87755102, -0.75510204, -0.63265306, -0.51020408,
        -0.3877551 , -0.26530612, -0.14285714, -0.02040816,  0.10204082,
         0.2244898 ,  0.34693878,  0.46938776,  0.59183673,  0.71428571,
         0.83673469,  0.95918367,  1.08163265,  1.20408163,  1.32653061,
         1.44897959,  1.57142857,  1.69387755,  1.81632653,  1.93877551,
         2.06122449,  2.18367347,  2.30612245,  2.42857143,  2.55102041,
         2.67346939,  2.79591837,  2.91836735,  3.04081633,  3.16326531,
         3.28571429,  3.40816327,  3.53061224,  3.65306122,  3.7755102 ,
         3.89795918,  4.02040816,  4.14285714,  4.26530612,  4.3877551 ,
         4.51020408,  4.63265306,  4.75510204,  4.87755102,  5.        ]])}
From file:
OrderedDict({'pdf_name': array([b'interp'], dtype='|S6'), 'pdf_version': array([0]), 'xvals': array([[-1.        , -0.87755102, -0.75510204, -0.6326

They are essentially identical. One thing to note is that `build_tables` can also encode any strings in the **ancil** table, if you provide it with the appropriate arguments. This is useful if you will be writing to HDF5 files. 

## Creating a `qp` file from scratch 

Now let's try to create an Ensemble file from scratch, and see if we can read it in as an Ensemble. First, we need a metadata table (dictionary) with the appropriate keys. Let's make it an **interpolation** parameterized Ensemble as well, so it will need: "pdf_name", "pdf_version", and "xvals". Note that for this to be a `Table-like` object, there are a few things we have to make sure to do:
1. All values must be iterable, so they all must be arrays
2. They must all have the same length, or first dimension. Since "xvals" will inherently have more values than any other value in the metadata here, we make it a 2D array, so the first dimension is 1. 

In [12]:
xvals = np.array([np.linspace(0,5,10, )])
new_meta = {"pdf_name": np.array(["interp".encode()]),"pdf_version": np.array([0]),"xvals": xvals}
new_meta

{'pdf_name': array([b'interp'], dtype='|S6'),
 'pdf_version': array([0]),
 'xvals': array([[0.        , 0.55555556, 1.11111111, 1.66666667, 2.22222222,
         2.77777778, 3.33333333, 3.88888889, 4.44444444, 5.        ]])}

So far, that looks like it matches the format of our Ensemble tables above. Now let's make a **data** table with the "yvals" for 3 distributions.

In [13]:
yvals = np.array([[0,1,2,3,4,4,3,2,1,0],[0,0.1,0.2,0.3,0.4,0.4,0.3,0.2,0.1,0],[0,1,2,3,4,5,4,2,1,0]])
new_data = {"yvals": yvals}
new_data

{'yvals': array([[0. , 1. , 2. , 3. , 4. , 4. , 3. , 2. , 1. , 0. ],
        [0. , 0.1, 0.2, 0.3, 0.4, 0.4, 0.3, 0.2, 0.1, 0. ],
        [0. , 1. , 2. , 3. , 4. , 5. , 4. , 2. , 1. , 0. ]])}

In [14]:
ancil = np.linspace(0,2,3)
new_ancil = {"ids": ancil}
new_ancil

{'ids': array([0., 1., 2.])}

### Using `tables_io`

Now we can use `tables_io` to write out the HDF5 file:

In [None]:
data_tables = {"meta":new_meta, "data": new_data, "ancil": new_ancil}
tables_io.write(data_tables, "../assets/new-interp-ensemble.hdf5")

'new-interp-ensemble.hdf5'

Now the tables wrote to file, but is it a `qp` file? Let's check:

In [None]:
qp.is_qp_file("../assets/new-interp-ensemble.hdf5")

True

Yay! We've successfully created a `qp` file. Let's try reading it in as an Ensemble to make sure. 

In [None]:
new_ens = qp.read("../assets/new-interp-ensemble.hdf5")
new_ens

Ensemble(the_class=interp,shape=(3, 10))

### Using `h5py` 

We can also use `h5py` to write out the HDF5 manually, like so:

In [None]:
# create the file object 
f = h5py.File("../assets/h5py-interp-ensemble.hdf5", "w")

# create the necessary groups 
meta_g = f.create_group("meta")
data_g = f.create_group("data")
ancil_g = f.create_group("ancil")

# populate the groups with the datasets 

# metadata
f["meta"]["pdf_name"] = new_meta["pdf_name"]
f["meta"]["pdf_version"] = new_meta["pdf_version"]
f["meta"]["xvals"] = new_meta["xvals"]

# data 
f["data"]["yvals"] = new_data["yvals"]

# ancil 
f["ancil"]["ids"] = new_ancil["ids"]

# make sure the file object has the right groups before closing
f.keys()

In [19]:
f.close()

Now let's test that this is also a `qp` approved file:

In [20]:
qp.is_qp_file("../assets/h5py-interp-ensemble.hdf5")

True