
# ROOT training: RDataFrame basics

## RDataFrame in a nutshell: from data to aggregations

<img src="images/rdf_overview.png">

## One high-level interface for many use cases

<img src="images/rdf_platforms.png">

## Checking file contents

In [1]:
import ROOT

treename = "dataset"
filename = "data/example_file.root" # this can be a list of file names, or a glob like `data/*.root`
df = ROOT.RDataFrame(treename, filename)
df.Display().Print()

Welcome to JupyROOT 6.27/01
+-----+------------+------------+------------+--------------+
| Row | a          | b          | vec1       | vec2         | 
+-----+------------+------------+------------+--------------+
| 0   | 0.97771140 | 0.99974175 | -3.22012f  | 0.894402f    | 
+-----+------------+------------+------------+--------------+
| 1   | 2.2802012  | 0.48497361 | -1.80835f  | 0.0800873f   | 
|     |            |            | 0.236065f  | 0.479906f    | 
|     |            |            | -3.97713f  | 0.519888f    | 
|     |            |            | -0.293643f | 0.317273f    | 
+-----+------------+------------+------------+--------------+
| 2   | 0.56348245 | 0.39231399 |            |              | 
+-----+------------+------------+------------+--------------+
| 3   | 3.0421559  | 0.33353925 | 0.727539f  | 0.796610f    | 
|     |            |            | -3.81258f  | 0.331128f    | 
|     |            |            | -2.87416f  | -0.00277938f | 
+-----+------------+------------

In [2]:
df.GetColumnType("vec1")

'ROOT::VecOps::RVec<float>'

## Defining new quantities, selecting events

- use `Define` to tell RDataFrame how to compute a new variable (on-demand, as needed, per event)
- use `Filter` to select events/rows based on boolean conditions

Here we pass simple C++ strings as expressions but in general we can use arbitrarily complex functions, functors and lambda expressions.

In [3]:
full_count = df.Count()
# define a new column c, select some events based on the values of c
filtered_df = df.Define("c", "a+b").Filter("c < 0.5")
filtered_count = filtered_df.Count()
print("full count: ", full_count.GetValue())
print("filtered count: ", filtered_count.GetValue())

full count:  2000
filtered count:  111


## The RDataFrame workflow

1. build a dataframe object by specifying your data-set
2. apply transformations:
  * filter interesting events, or
  * define or re-define quantities (e.g. select some muons) 
3. request results (e.g. histograms or new ROOT files)

### The golden rule
Book all transformations and request results _before you access any of the results_: this lets RDataFrame accumulate work and then produce all results at the same time, upon first access to any of them.

## Good and bad workflow examples

In [4]:
# Good: RDataFrame can produce both results in one loop over the data
print("n. runs before: ", df.GetNRuns())
count = df.Count()
mean = df.Mean("a")
print("count: ", count.GetValue())
print(f"mean: {mean.GetValue():.3}")
print("n. runs after: ", df.GetNRuns())

n. runs before:  2
count:  2000
mean: 13.9
n. runs after:  3


In [5]:
# Bad: RDataFrame must loop over data twice
print("n. runs before: ", df.GetNRuns())
count = df.Count()
print("count: ", count.GetValue())
mean = df.Mean("a")
print(f"mean={mean.GetValue():.3}")
print("n. runs after: ", df.GetNRuns())

n. runs before:  3
count:  2000
mean=13.9
n. runs after:  5


## Filling histograms

In [6]:
%jsroot on
c = ROOT.TCanvas()
# Histo2D, Histo3D and HistoND are also available 
# A weight variable can be passed as an extra argument
h = df.Histo1D("vec1") 
h.Draw()
c.Draw()

## Histograms with model and weight

In [7]:
histo_name = "myhist"
histo_title = "My Weighted Histogram"
nbinsx = 100
xlow = -6
xup = 6

# The traditional TH1D constructor
# ROOT.TH1D(histo_name, histo_title, nbinsx, xlow, xup)

# With RDataFrame
c = ROOT.TCanvas()
h = df.Define("weight", "a*b")\
      .Histo1D((histo_name, histo_title, nbinsx, xlow, xup), "vec1", "weight")
h.Draw("HIST")
c.Draw()

## Working with collections and object selections

RDataFrame reads collections as the special type `ROOT::RVec`.
`RVec` is a container similar to `std::vector` (and can be used just like a `std::vector`) but it also offers a rich interface to operate on the array elements in a vectorised fashion, similarly to Python's NumPy arrays.

In [8]:
df.Define("good_v1", "vec1[vec2 > 0]").Display(["good_v1", "vec1", "vec2"]).Print()

+-----+------------+------------+--------------+
| Row | good_v1    | vec1       | vec2         | 
+-----+------------+------------+--------------+
| 0   | -3.22012f  | -3.22012f  | 0.894402f    | 
+-----+------------+------------+--------------+
| 1   | -1.80835f  | -1.80835f  | 0.0800873f   | 
|     | 0.236065f  | 0.236065f  | 0.479906f    | 
|     | -3.97713f  | -3.97713f  | 0.519888f    | 
|     | -0.293643f | -0.293643f | 0.317273f    | 
+-----+------------+------------+--------------+
| 2   |            |            |              | 
+-----+------------+------------+--------------+
| 3   | 0.727539f  | 0.727539f  | 0.796610f    | 
|     | -3.81258f  | -3.81258f  | 0.331128f    | 
|     |            | -2.87416f  | -0.00277938f | 
+-----+------------+------------+--------------+
| 4   | -4.70625f  | -4.70625f  | 0.427770f    | 
|     | 0.0288365f | -4.44909f  | -0.800848f   | 
|     |            | 0.0288365f | 0.398534f    | 
+-----+------------+------------+--------------+


## Writing new ROOT datasets

With RDataFrame, you can read your dataset, add new columns with processed values and finally use `Snapshot` to save the resulting data to a ROOT file in `TTree` format.

In [9]:
df = ROOT.RDataFrame("dataset","data/example_file.root")
df = df.Define("c", "a+b")\
       .Filter("c > 0.5")
df.Snapshot(treename="outtree", filename="newfile.root", columnList=["a","b","c"]);

In [10]:
%%bash
rootls -lt newfile.root

TTree  Sep 12 08:22 2022 outtree;1 "outtree" 
  a  "a/D"  15183
  b  "b/D"  15183
  c  "c/D"  15183
  Cluster INCLUSIVE ranges:
   - # 0: [0, 1888]
  The total number of clusters is 1


## Using all your CPU cores with multi-thread RDataFrame
Call `ROOT.EnableImplicitMT()` before constructing the RDataFrame object to turn on multi-thread execution.

All RDataFrame operations such as `Histo1D` or `Snapshot` will run in parallel over all available cores.

User-defined expressions, such as strings or lambdas passed to `Filter` and `Define` will have to be thread-safe, i.e. it should be possible to call them concurrently from different threads.

In [11]:
df = ROOT.RDataFrame("dataset","data/example_file.root") # a single-thread RDataFrame
print(df.GetNSlots())
ROOT.EnableImplicitMT()
df = ROOT.RDataFrame("dataset","data/example_file.root") # a multi-thread RDataFrame
print(df.GetNSlots())

1
16


## Dataframe transformations

| Method    | Description                                                |
|-------------------|------------------------------------------------------------|
| Alias           | Introduce an alias for an existing column.                 |
| Define          | Creates a new quantity, computed on-demand per event.      |
| Filter          | Select events/rows.                                        |
| Redefine        | Change the value and/or type of an existing column.        |
| Vary            | Define systematic variations for one or more columns.      |
| ...               | ...                                                        |

## Data aggregations

| Method                        | Description                                                                          |
|------------------------------------|--------------------------------------------------------------------------------------|
| Count                            | Return the number of events processed.                                               |
| Display                          | Return a printable object representing the dataset contents.                       |
| Graph                            | Fill a TGraph with given column values.                                       |
| Histo1D, Histo2D, Histo3D    | Fill a one-/two-/three-dimensional histogram with given values and weights.     |
| Max, Min, Mean, StdDev, Stats | Return common statistics for processed column values.                              |
| Report                     | Return a cut-flow report with stats about applied Filters. |
| Snapshot        | Writes processed data-set to a new TTree.              |
| ...                                | ...  

## Queries

These operations do not modify the dataframe or book computations but simply return information on the RDataFrame object.

| Method           | Description                                                                              |
|---------------------|------------------------------------------------------------------------------------------|
| GetColumnNames    | Return the names of all the available columns of the dataset.                               |
| GetColumnType     | Return the type of a given column as a string.                                           |
| SaveGraph         | Export the computation graph of an RDataFrame in graphviz format for easy inspection.     |
| ...                 | ...                                                                                      |

## Time for a stretch! And an RDataFrame exercise!