# Example Usage

This notebooks demonstrates how to use this package.
The main features are:

- A `Cuts` class that can be used to select jets.
- A set of `Flavours` defining commomn jet flavours.
- An `H5Reader` class allowing for batched reading of jets across multiple files.
- An `H5Writer` class allowing for batched writing of jets.

We can start by getting some dummy data to work with.

In [None]:
from ftag.hdf5 import get_dummy_file

fname, f = get_dummy_file()
jets = f["jets"]

### Cuts

The `Cuts` class provides an interface for applying selections to structured nummpy arrays loaded from HDF5 files.
To take a look, first import the `Cuts`:


In [None]:
from ftag import Cuts

Instances of `Cuts` can be defined from lists of strings or tuples of strings and values. For example

In [None]:
kinematic_cuts = Cuts.from_list(["pt > 20e3", "abs_eta < 2.5"])
flavour_cuts = Cuts.from_list([("HadronConeExclTruthLabelID", "==", 5)])

It's easy to combine cuts

In [None]:
combined_cuts = kinematic_cuts + flavour_cuts

And then apply them to a a structured array with 

In [None]:
idx, selected_jets = combined_cuts(jets)

Both the selected indices and the selected jets are returned. The indices can be used to reapply the same selection to another array (e.g. tracks). The return values `idx` and `values` can also be accessed by name:

In [None]:
idx = combined_cuts(jets).idx
selected_jets = combined_cuts(jets).values

### Flavours

A list of flavours is provided.

In [None]:
from ftag import Flavours

Flavours.bjets

Flavour(name='bjets', label='$b$-jets', cuts=['HadronConeExclTruthLabelID == 5'], colour='#1f77b4')

`dict` like access is also supported:

In [None]:
Flavours["qcd"]

Flavour(name='qcd', label='QCD', cuts=['R10TruthLabel_R22v1 == 10'], colour='#38761D')

As you can see from the output, each flavour has a `name`, a `label` and `colour` (used for plotting), and a `Cuts` instance, which can be used to select jets of the given flavour.
For example:

In [None]:
bjets = Flavours.bjets.cuts(jets).values

Probability names are also accessible using `.px`:

In [None]:
[f.px for f in Flavours]

['pb', 'pc', 'pu', 'ptau', 'phbb', 'phcc', 'ptop', 'pqcd']

### H5Reader

The `H5Reader` class allows you to read (batches) of jets from one or more HDF5 files.

- Variables are specified as `dict[str, list[str]]`.
- By default the reader will randomly access chunks in the file, giving you a weakly shuffled set of jets.

For example to load 300 jets using three batches of size 100:


In [None]:
from ftag.hdf5 import H5Reader

reader = H5Reader(fname, batch_size=100)
data = reader.load({"jets": ["pt", "eta"]}, num_jets=300)
len(data["jets"])

300

To transparently load jets across several files `fname` can also be a pattern including wildcards (`*`).
Behind the scenes files are globbed and merged into a [virtual dataset](https://docs.h5py.org/en/stable/vds.html).
So the following also works:

In [None]:
from pathlib import Path

sample_dir = Path(fname).parent
reader = H5Reader(sample_dir / "*.h5", batch_size=100)

If you have globbed several files, you can easily get the total number of jets across all files with

In [None]:
reader.num_jets

4000

You can also load tracks alongside jets (or by themselves) by specifying an additional entry in the `variables` dict:

In [None]:
data = reader.load({"jets": ["pt", "eta"], "tracks": ["deta", "dphi"]}, num_jets=300)
data["tracks"].dtype

dtype([('deta', '<f4'), ('dphi', '<f4')])

You can apply cuts to the jets as they are loaded. For example, to load 1000 jets which satisfy $p_T > 20$ GeV:

In [None]:
data = reader.load({"jets": ["pt"]}, num_jets=1000, cuts=Cuts.from_list(["pt > 20e3"]))
assert data["jets"]["pt"].min() > 20e3

Rather than return a single `dict` of arrays, the reader can also return a generator of batches.
This is useful when you want to work with a large number of jets, but don't want to load them all into memory at once.

In [None]:
reader = H5Reader(fname, batch_size=100)
stream = reader.stream({"jets": ["pt", "eta"]}, num_jets=300)
for batch in stream:
    jets = batch["jets"]
    # do processing on batch...

### H5Writer


The `H5Writer` class complents the reader class by allowing you to easily write batches of jetes to a target file.

In [None]:
from ftag.hdf5 import H5Writer
from tempfile import NamedTemporaryFile

out_fname = NamedTemporaryFile(suffix=".h5").name
variables = {"jets": reader.get_dtype("jets").names}
writer = H5Writer(
    src=fname,
    dst=out_fname,
    variables=variables,
    num_jets=1000,
    shuffle=False,
)

To write jets in batches to the output file, you can use the `write` method:

In [None]:
reader = H5Reader(fname, batch_size=100, shuffle=False)
stream = reader.stream(variables, num_jets=1000)
for batch in stream:
    writer.write(batch)
writer.close()

When you are finished you need to manually close the file using `H5Writer.close()`.
The two files will now have the same contents (since we disabled shuffling):

In [None]:
import h5py
assert (h5py.File(fname)["jets"][:] == h5py.File(out_fname)["jets"][:]).all()