# ROOT RDataFrame

ROOT's declarative analysis interface. Users define their analysis as a sequence of operations to be performed on the data-frame object; the framework takes care of the management of the loop over entries as well as low-level details such as I/O and parallelisation. RDataFrame provides methods to perform most common operations required by ROOT analyses; at the same time, users can just as easily specify custom code that will be executed in the event loop.
<img src="images/rdf_1.png">

# Creating an RDataFrame
ROOT dataframes can be created from various sources. It shows its strengths when working with ROOT datasets, but it also supports `.csv` files and `numpy.array`s:

In [None]:
import ROOT

df = ROOT.RDF.MakeCsvDataFrame("https://root.cern/files/tutorials/tdf014_CsvDataSource_MuRun2010B.csv")

print(f"Columns in the RDataFrame: {df.GetColumnNames()}")

In [None]:
import numpy

np_dict = {colname: numpy.random.rand(100) for colname in ["a","b","c"]}

df = ROOT.RDF.MakeNumpyDataFrame(np_dict)

print(f"Columns in the RDataFrame: {df.GetColumnNames()}")

In [None]:
def1 = df.Define("d", "a+b-c")

fil1 = def1.Filter("d > 1")

count = fil1.Count()
mean = fil1.Mean("d")
display = fil1.Display()

print(f"Number of rows after filter: {count.GetValue()}")
print(f"Mean pf column d after filter: {mean.GetValue()}")
print("Dataset contents:")
display.Print()

# Think about data-flow
RDataFrame is built with a modular and flexible workflow in mind, summarised as follows:

* build a data-frame object by specifying your data-set
* apply a series of transformations to your data
  * filter (e.g. apply some cuts) or
  * define a new column (e.g. the result of an expensive computation on columns)
* apply actions to the transformed data to produce results (e.g. fill a histogram)

### Important Note
Make sure to book all transformations and actions before you access the contents of any of the results: this lets RDataFrame accumulate work and then produce all results at the same time, upon first access to any of them.

In [None]:
# Grab mean values of multiple columns in the dataset
df = ROOT.RDF.MakeCsvDataFrame("https://root.cern/files/tutorials/tdf014_CsvDataSource_MuRun2010B.csv")

# Consider only events with muons with opposite charge
df1 = df.Filter("Q1 != Q2")
# Book in advance all Mean operations on the desired columns
good_cols = ["px1","py1","pz1","pt1","px2","py2","pz2","pt2"]
mean_ops = [df1.Mean(good_col) for good_col in good_cols]

# Later get the result from the operations
# The RDataFrame event loop will process all actions in one go
for col, mean_op in zip(good_cols, mean_ops):
    print(f"Mean value of {col}: {mean_op.GetValue()}")


# Operation categories in RDataFrame
There are 2 main types of operations you can act on on your dataset with RDataFrame:

**Transformations**: manipulate the dataset, not asking for a final result.

| Transformation    | Description                                                |
|-------------------|------------------------------------------------------------|
| Define()          | Creates  a new column in the dataset.                      |
| Filter()          | Filter rows based on user-defined conditions.              |

**Actions**: aggregate (parts of) the dataset into a result.

| Action                        | Description                                                                          |
|------------------------------------|--------------------------------------------------------------------------------------|
| Count()                            | Return the number of events processed.                                               |
| Display()                          | Provides a printable object representing the dataset contents.                       |
| Graph()                            | Fills a TGraph  with the two columns provided.                                       |
| Histo1D(), Histo2D(), Histo3D()    | Fill a one-, two-, three-dimensional histogram with the processed column values.     |
| Max(), Min()                       | Return the maximum(minimum) of processed column values.                              |
| Snapshot()        | Writes processed data-set to disk.              |
| ...                                | ...  

There are also other operations that do not fall in the previous categories, useful to query information about your dataset and the RDataFrame status.

| Operation           | Description                                                                              |
|---------------------|------------------------------------------------------------------------------------------|
| Alias()             | Introduce an alias for a particular column name.                                         |
| GetColumnNames()    | Get the names of all the available columns of the dataset.                               |
| GetColumnType()     | Return the type of a given column as a string.                                           |
| SaveGraph()         | Store the computation graph of an RDataFrame in graphviz format for easy inspection.     |
| ...                 | ...                                                                                      |

# Documentation
https://root.cern/doc/master/classROOT_1_1RDataFrame.html

* Mattermost: https://mattermost.web.cern.ch/root
* Have a question about ROOT? https://root-forum.cern.ch
* Have an idea about evolving ROOT?
 * https://root-forum.cern.ch/c/my-root-app-and-ideas
* Have a bug to report? https://root.cern/guidelines-submitting-bug
* Have some code ready to go in the next ROOT release?
 * https://github.com/root-project/root/pulls
 * Github pull requests are always welcome: simple (and not so simple)
 * bug fixes, typos, missing documentation, tutorials...
