# ROOT RDataFrame

ROOT's declarative analysis interface. Users define their analysis as a sequence of operations to be performed on the data-frame object; the framework takes care of the management of the loop over entries as well as low-level details such as I/O and parallelisation. RDataFrame provides methods to perform most common operations required by ROOT analyses; at the same time, users can just as easily specify custom code that will be executed in the event loop.
<img src="images/rdf_1.png">

# HEP data analysis with RDataFrame
RDataFrame allows reading and writing trees, aiming at making HEP analysis easy to write and fast to perform.

In [None]:
import ROOT

treename = "mydataset"
filename = "data/example_dataset.root"
df = ROOT.RDataFrame(treename, filename)

print(f"Columns in the dataset: {df.GetColumnNames()}")

In [None]:
def1 = df.Define("e", "a+b")

fil1 = def1.Filter("e < 75")

count = fil1.Count()
mean = fil1.Mean("e")
display = fil1.Display()

print(f"Number of rows after filter: {count.GetValue()}")
print(f"Mean pf column d after filter: {mean.GetValue()}")
print("Dataset contents:")
display.Print()

# Think about data-flow
RDataFrame is built with a modular and flexible workflow in mind, summarised as follows:

* build a data-frame object by specifying your data-set
* apply a series of transformations to your data
  * filter (e.g. apply some cuts) or
  * define a new column (e.g. the result of an expensive computation on columns)
* apply actions to the transformed data to produce results (e.g. fill a histogram)

### Important Note
Make sure to book all transformations and actions before you access the contents of any of the results: this lets RDataFrame accumulate work and then produce all results at the same time, upon first access to any of them.

In [None]:
# Consider events after the 50th
df1 = df.Filter("a >= 50")

# Book in advance all Mean operations on the dataset columns
cols = df.GetColumnNames()
mean_ops = [df1.Mean(col) for col in cols]

# Ask the result of one mean operation.
# RDataFrame will process the whole computation graph
print(f"Number of RDataFrame runs so far: {df.GetNRuns()}")
print(f"First mean result is: {mean_ops[0].GetValue()}")
print(f"Number of RDataFrame runs so far: {df.GetNRuns()}")

In [None]:
# Print all results, the event loop won't be run another time
print(f"Number of RDataFrame runs so far: {df.GetNRuns()}")
for col, mean_op in zip(cols, mean_ops):
    print(f"Mean value of {col}: {mean_op.GetValue()}")
print(f"Number of RDataFrame runs so far: {df.GetNRuns()}")

# Operation categories in RDataFrame
There are 2 main types of operations you can act on on your dataset with RDataFrame:

**Transformations**: manipulate the dataset, not asking for a final result.

| Transformation    | Description                                                |
|-------------------|------------------------------------------------------------|
| Define()          | Creates  a new column in the dataset.                      |
| Filter()          | Filter rows based on user-defined conditions.              |

**Actions**: aggregate (parts of) the dataset into a result.

| Action                        | Description                                                                          |
|------------------------------------|--------------------------------------------------------------------------------------|
| Count()                            | Return the number of events processed.                                               |
| Display()                          | Provides a printable object representing the dataset contents.                       |
| Graph()                            | Fills a TGraph  with the two columns provided.                                       |
| Histo1D(), Histo2D(), Histo3D()    | Fill a one-, two-, three-dimensional histogram with the processed column values.     |
| Max(), Min()                       | Return the maximum(minimum) of processed column values.                              |
| Snapshot()        | Writes processed data-set to disk.              |
| ...                                | ...  

There are also other operations that do not fall in the previous categories, useful to query information about your dataset and the RDataFrame status.

| Operation           | Description                                                                              |
|---------------------|------------------------------------------------------------------------------------------|
| Alias()             | Introduce an alias for a particular column name.                                         |
| GetColumnNames()    | Get the names of all the available columns of the dataset.                               |
| GetColumnType()     | Return the type of a given column as a string.                                           |
| SaveGraph()         | Store the computation graph of an RDataFrame in graphviz format for easy inspection.     |
| ...                 | ...                                                                                      |

# Documentation
https://root.cern/doc/master/classROOT_1_1RDataFrame.html

* ROOT website: https://root.cern
* Have a question about ROOT? https://root-forum.cern.ch
* Have a bug to report? https://github.com/root-project/root/issues
* Have some code ready to go in the next ROOT release?
 * https://github.com/root-project/root/pulls
 * Github pull requests are always welcome: simple (and not so simple)
 * bug fixes, typos, missing documentation, tutorials...
