# The HEP dataset

High Energy Physics data is made of many statistically independent collision events. Laying data into an "event class", then serialise and write out `N` instances of the class into a file would be very inefficient. In ROOT, a dataset is organised columns that can store elements of any C++ type:
* fundamental types: `int`, `float`
* C++ standard collections: `std::vector`, `std::map`
* User created C++ classes

The ROOT dataset is represented by the `TTree` class and can be simply called a tree. Columns in the dataset are instances of the `TBranch` class and can be also called branches.

<center><img src="images/rdf_2.png"></center>

ROOT format is logically and phisically (on disk) a columnar format. Different columns can be read independently from disk. This translates into faster IO performance with respect to other dataset formats (HDF5, SQL).

# HEP data analysis with RDataFrame
RDataFrame allows reading and writing trees, aiming at making HEP analysis easy to write and fast to perform.

In [None]:
import ROOT

treename = "myDataset"
filename = "https://github.com/root-project/root/raw/master/tutorials/dataframe/df017_vecOpsHEP.root"
df = ROOT.RDataFrame(treename, filename)

print(f"Columns in the dataset: {df.GetColumnNames()}")

# Working with collections and object selections

RDataFrame reads collections as the special type [ROOT::VecOps::RVec](https://root.cern/doc/master/classROOT_1_1VecOps_1_1RVec.html) (e.g. a branch containing an array of floating point numbers can be read as a ROOT::VecOps::RVec<float>). C-style arrays (with variable or static size), std::vectors and most other collection types can be read this way. When reading ROOT data, column values of type ROOT::VecOps::RVec<T> perform no copy of the underlying array.

ROOT::VecOps::RVec is a container similar to std::vector (and can be used just like a std::vector) but it also offers a rich interface to operate on the array elements in a vectorised fashion, similarly to Python's NumPy arrays.

In [None]:
npy_dict = df.AsNumpy(["E"])

for row, vec in enumerate(npy_dict["E"]):
    print(f"\nRow {row} contains:\n{vec}")

### Define a new column with operations on RVecs

In [None]:
df1 = df.Define("good_pt", "sqrt(px*px + py*py)[E>100]")

`sqrt(px*px + py*py)[E>100]`:
* `px`, `px` and `E` are columns which elements are `RVec`s.
* Operations on `RVec`s like sum, product, sqrt preserve the dimensions of the vector.
* The `operator[]` selects elements of the RVec that pass a certain mask.
* `E > 100` returns a mask, that is a vector representing values that pass the selection (e.g. `[0, 1, 0, 0]`)

### Now we can plot the newly defined column values in a histogram

In [None]:
c = ROOT.TCanvas()
h = df1.Histo1D(("pt", "pt", 16, 0, 4), "good_pt")
h.Draw()
c.Draw()

# Save dataset to ROOT file after processing

With RDataFrame, you can read your dataset add new columns with processed values and finally use `Snapshot` to save the resulting data to a ROOT file in TTree format.

In [None]:
out_treename = "outtree"
out_filename = "outtree.root"
snapdf = df1.Snapshot(out_treename, out_filename)

# Result of a Snapshot is still an RDataFrame that can be further used
snapdf.Display().Print()

# Cutflow reports
Filters applied to the dataset can be given a name. The `Report` method will gather information about filter efficiency and show the data flow between subsequent cuts on the original dataset.


In [None]:
df = ROOT.RDataFrame("sig_tree", "https://root.cern/files/Higgs_data.root")

filter1 = df.Filter("lepton_eta > 0", "Lepton eta cut")
filter2 = filter1.Filter("lepton_phi < 1", "Lepton phi cut")

rep = df.Report()
rep.Print()

# Using C++ functions in Python
Since we still want to perform complex operations in Python but plain Python code is prone to be slow and not thread-safe, you can inject C++ functions doing the work in your event loop during runtime. This mechanism uses the C++ interpreter cling shipped with ROOT, making this possible in a single line of code.

In [None]:
ROOT.gInterpreter.Declare(
    """
    float getfloatvalue(unsigned long long entrynumber){
        return entrynumber;
    }
    
    float squareval(float val){
        return val * val;
    }
    """)

In [None]:
# Create a new RDataFrame from scratch with 100 consecutive entries
df = ROOT.RDataFrame(100)

# Create a new column using the previously declared C++ functions
df1 = df.Define("a", "getfloatvalue(rdfentry_)")
df2 = df1.Define("b", "squareval(a)")

# Show the two columns created in a graph
c = ROOT.TCanvas()
graph = df2.Graph("a","b")
graph.SetMarkerStyle(20)
graph.SetMarkerSize(0.5)
graph.SetMarkerColor(ROOT.kBlue)
graph.SetTitle("My graph")
graph.Draw("AP")
c.Draw()


# Using all cores of your machine with multithreaded RDataFrame
RDataFrame can transparently perform multi-threaded event loops to speed up the execution of its actions. Users have to call `ROOT::EnableImplicitMT()` before constructing the RDataFrame object to indicate that it should take advantage of a pool of worker threads. Each worker thread processes a distinct subset of entries, and their partial results are merged before returning the final values to the user.

RDataFrame operations such as Histo1D or Snapshot are guaranteed to work correctly in multi-thread event loops. User-defined expressions, such as strings or lambdas passed to Filter, Define, Foreach, Reduce or Aggregate will have to be thread-safe, i.e. it should be possible to call them concurrently from different threads.

In [None]:
%%time
# Get a first baseline measurement

treename = "Events"
filename = "root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root"
df = ROOT.RDataFrame(treename, filename)

df.Sum("nMuon").GetValue()

In [None]:
%%time
# Activate multithreading capabilities
# By default takes all available cores on the machine
ROOT.EnableImplicitMT()

treename = "Events"
filename = "root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root"
df = ROOT.RDataFrame(treename, filename)

df.Sum("nMuon").GetValue()