# Working with collections and object selections

RDataFrame reads collections as the special type [ROOT::RVec](https://root.cern/doc/master/classROOT_1_1VecOps_1_1RVec.html) (e.g. a branch containing an array of floating point numbers can be read as a `ROOT::RVec<float>`). C-style arrays (with variable or static size), `std::vectors` and most other collection types can be read this way. When reading ROOT data, column values of type `ROOT::RVec<T>` perform no copy of the underlying array.

`RVec` is a container similar to `std::vector` (and can be used just like a `std::vector`) but it also offers a rich interface to operate on the array elements in a vectorised fashion, similarly to Python's NumPy arrays.

In [None]:
import ROOT

treename = "myDataset"
filename = "https://github.com/root-project/root/raw/master/tutorials/dataframe/df017_vecOpsHEP.root"
df = ROOT.RDataFrame(treename, filename)

print(f"Columns in the dataset: {df.GetColumnNames()}")

Let's now export the data as a dictionary of numpy arrays to quickly inspect it. For each row, `E` is an array of values:

In [None]:
npy_dict = df.AsNumpy(["E"])

for row, vec in enumerate(npy_dict["E"]):
    print(f"\nRow {row} contains:\n{vec}")

### Define a new column with operations on RVecs

In [None]:
df1 = df.Define("good_pt", "sqrt(px*px + py*py)[E>100]")

`sqrt(px*px + py*py)[E>100]`:
* `px`, `px` and `E` are columns the elements of which are `RVec`s
* Operations on `RVec`s like sum, product, sqrt preserve the dimensionality of the array
* `[E>100]` selects the elements of the array that satisfy the condition
* `E > 100`: boolean expressions on `RVec`s such as `E > 100` return a mask, that is an array with information on which values pass the selection (e.g. `[0, 1, 0, 0]` if only the second element satisfies the condition)

### Now we can plot the newly defined column values in a histogram

In [None]:
c = ROOT.TCanvas()
h = df1.Histo1D(("pt", "pt", 16, 0, 4), "good_pt")
h.Draw()
c.Draw()

# Save dataset to ROOT file after processing

With RDataFrame, you can read your dataset, add new columns with processed values and finally use `Snapshot` to save the resulting data to a ROOT file in TTree format.

In [None]:
out_treename = "outtree"
out_filename = "outtree.root"
snapdf = df1.Snapshot(out_treename, out_filename)

# Result of a Snapshot is still an RDataFrame that can be further used
snapdf.Display().Print()

# Cutflow reports
Filters applied to the dataset can be given a name. The `Report` method will gather information about filter efficiency and show the data flow between subsequent cuts on the original dataset.


In [None]:
df = ROOT.RDataFrame("sig_tree", "https://root.cern/files/Higgs_data.root")

filter1 = df.Filter("lepton_eta > 0", "Lepton eta cut")
filter2 = filter1.Filter("lepton_phi < 1", "Lepton phi cut")

rep = df.Report()
rep.Print()

# Using C++ functions in Python
Since we still want to perform complex operations in Python but plain Python code is prone to be slow and not thread-safe, you can inject C++ functions doing the work in your event loop during runtime. This mechanism uses the C++ interpreter cling shipped with ROOT, making this possible in a single line of code. Let's start by defining a function that will allow us to change the type of a the RDataFrame dataset entry numbers (stored in the special column "rdfentry") from `unsigned long long` to `float`.

In [None]:
ROOT.gInterpreter.ProcessLine(
"""
float getfloatvalue(unsigned long long entrynumber){
    return entrynumber;
}
""")

Then let's define another function that takes a `float` values and computes its square.

In [None]:
ROOT.gInterpreter.ProcessLine(
"""
float squareval(float val){
    return val * val;
}
""")

And now let's use these functions with RDataFrame! We start by creating an empty RDataFrame with 100 consecutive entries and defining new columns on it:

In [None]:
# Create a new RDataFrame from scratch with 100 consecutive entries
df = ROOT.RDataFrame(100)

# Create a new column using the previously declared C++ functions
df1 = df.Define("a", "getfloatvalue(rdfentry_)")
df2 = df1.Define("b", "squareval(a)")

We can now plot the values of the columns in a graph:

In [None]:
# Show the two columns created in a graph
c = ROOT.TCanvas()
graph = df2.Graph("a","b")
graph.SetMarkerStyle(20)
graph.SetMarkerSize(0.5)
graph.SetMarkerColor(ROOT.kBlue)
graph.SetTitle("My graph")
graph.Draw("AP")
c.Draw()

# Using all cores of your machine with multi-threaded RDataFrame
RDataFrame can transparently perform multi-threaded event loops to speed up the execution of its actions. Users have to call `ROOT::EnableImplicitMT()` before constructing the RDataFrame object to indicate that it should take advantage of a pool of worker threads. Each worker thread processes a distinct subset of entries, and their partial results are merged before returning the final values to the user.

RDataFrame operations such as Histo1D or Snapshot are guaranteed to work correctly in multi-thread event loops. User-defined expressions, such as strings or lambdas passed to Filter, Define, Foreach, Reduce or Aggregate will have to be thread-safe, i.e. it should be possible to call them concurrently from different threads.

In [None]:
%%time
# Get a first baseline measurement

treename = "Events"
filename = "root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root"
df = ROOT.RDataFrame(treename, filename)

df.Sum("nMuon").GetValue()

In [None]:
%%time
# Activate multithreading capabilities
# By default takes all available cores on the machine
ROOT.EnableImplicitMT()

treename = "Events"
filename = "root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root"
df = ROOT.RDataFrame(treename, filename)

df.Sum("nMuon").GetValue()

# Disable implicit multithreading when done
ROOT.DisableImplicitMT()

# Working with `numpy` arrays

In [None]:
import numpy

np_dict = {colname: numpy.random.rand(100) for colname in ["a","b","c"]}

df = ROOT.RDF.MakeNumpyDataFrame(np_dict)

print(f"Columns in the RDataFrame: {df.GetColumnNames()}")

# Multiple concurrent RDataFrame runs

In [None]:
ROOT.EnableImplicitMT()
treename1 = "myDataset"
filename1 = "https://github.com/root-project/root/raw/master/tutorials/dataframe/df017_vecOpsHEP.root"
treename2 = "mydataset"
filename2 = "data/example_dataset.root"

df1 = ROOT.RDataFrame(treename1, filename1)
df2 = ROOT.RDataFrame(treename2, filename2)
h1 = df1.Histo1D("px")
h2 = df2.Histo1D("c")
 

ROOT.RDF.RunGraphs((h1, h2))

In [None]:
c = ROOT.TCanvas()
h1.Draw()
c.Draw()

In [None]:
c = ROOT.TCanvas()
h2.Draw()
c.Draw()