## ROOT dataframe tutorial: Dimuon spectrum

This tutorial shows you how to analyze datasets using RDataFrame from a Python notebook. The example analysis performs the following steps:

* Connect a ROOT dataframe to a dataset containing 61 mio. events recorded by CMS in 2012
* Filter the events being relevant for your analysis
* Compute the invariant mass of the selected dimuon candidates
* Plot the invariant mass spectrum showing resonances up to the Z mass

This material is based on the analysis done by Stefan Wunsch, available [here](http://opendata.web.cern.ch/record/12342) in CERN's Open Data portal.

<center><img src="../../../images/dimuonSpectrum.png"></center>

In [None]:
import ROOT   

## Create a ROOT dataframe in Python
First we will create a ROOT dataframe that is connected to a dataset named `Events` stored in a ROOT file. The file is pulled in via [XRootD](http://xrootd.org/) from EOS public, but note how it could also be stored in your CERNBox space or in any other EOS repository accessible from SWAN (e.g. the experiment ones).

The dataset Events is a TTree and has the following branches:

| Branch name | Data type | Description |
|-------------|-----------|-------------|
| `nMuon` | `unsigned int` | Number of muons in this event |
| `Muon_pt` | `float[nMuon]` | Transverse momentum of the muons stored as an array of size `nMuon` |
| `Muon_eta` | `float[nMuon]` | Pseudo-rapidity of the muons stored as an array of size `nMuon` |
| `Muon_phi` | `float[nMuon]` | Azimuth of the muons stored as an array of size `nMuon` |
| `Muon_charge` | `int[nMuon]` | Charge of the muons stored as an array of size `nMuon` and either -1 or 1 |
| `Muon_mass` | `float[nMuon]` | Mass of the muons stored as an array of size `nMuon` |

In [None]:
treename = "Events"
filename = "root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root"
df = ROOT.RDataFrame(treename, filename)

## Run only on a part of the dataset

The full dataset contains half a year of CMS data taking in 2012 with 61 mio events. For the purpose of this example, we use the [Range](https://root.cern/doc/master/classROOT_1_1RDF_1_1RInterface.html#a1b36b7868831de2375e061bb06cfc225) node to run only on a small part of the dataset. This feature also comes in handy in the development phase of your analysis.

Feel free to experiment with this parameter!

In [None]:
# Take only the first 1M events
df_range = # do something here

## Filter relevant events for this analysis

Physics datasets are often general purpose datasets and therefore need extensive filtering of the events for the actual analysis. Here, we implement only a simple selection based on the number of muons and the charge to cut down the dataset in events that are relevant for our study.

In particular, we are applying two filters to keep:
1. Events with exactly two muons
2. Events with muons of opposite charge

In [None]:
# Change the first strings of both following operations to proper C++ expressions
# Use the points 1, 2 above as hints for what to write in your expression
df_2mu = df_range.Filter("DO SOMETHING WITH COLUMN nMuon", "Events with exactly two muons")
df_oc = df_2mu.Filter("DO SOMETHING WITH COLUMN Muon_charge", "Muons with opposite charge")

## Perform complex operations in Python, efficiently!

Operations in the RDataFrame event loop are executed in C++ to ensure performance and allow for multithreading scalability. In many cases, the functions needed for the analysis can be already found in the standard C++ library, in the ROOT library or in your favourite analysis framework. Here, we use a `Define` node to compute the invariant mass of the muons in the dataset. An implementation of this function is already available in the [`ROOT::VecOps`](https://root.cern/doc/master/group__vecops.html) namespace.

In [None]:
df_mass = df_oc.Define("Dimuon_mass", "SOME FUNCTION TO COMPUTE THE INVARIANT MASS WITH COLUMNS (Muon_pt, Muon_eta, Muon_phi, Muon_mass)")

## Make a histogram of the newly created column

In [None]:
# These are the parameters you would give to a histogram object constructor
# Put them in the right order inside the parentheses below
# You are effectively passing a tuple to the `Histo1D` operation as seen previously in other notebooks
nbins = 30000
low = 0.25
up = 300
histo_name = "Dimuon_mass"
histo_title = histo_name

h = df_mass.Histo1D(("PUT HISTOGRAM PARAMETERS HERE IN THE CORRECT ORDER"), "Dimuon_mass")

## Book a Report of the dataframe filters

In [None]:
report = # your code here

## Start data processing
This is the final step of the analysis: retrieving the result. We are expecting to see a plot of the mass of the dimuon spectrum similar to the one shown at the beginning of this exercise (remember we are running on fewer entries in this exercise). Finally in the last cell we should see a report of the filters applied on the dataset.

In [None]:
ROOT.gStyle.SetOptStat(0)
ROOT.gStyle.SetTextFont(42)
c = ROOT.TCanvas("c", "", 800, 700)
c.SetLogx()
c.SetLogy()
h.SetTitle("")
h.GetXaxis().SetTitle("m_{#mu#mu} (GeV)")
h.GetXaxis().SetTitleSize(0.04)
h.GetYaxis().SetTitle("N_{Events}")
h.GetYaxis().SetTitleSize(0.04)
h.Draw()

label = ROOT.TLatex()
label.SetNDC(True)
label.SetTextSize(0.040)
label.DrawLatex(0.100, 0.920, "#bf{CMS Open Data}")
label.SetTextSize(0.030)
label.DrawLatex(0.500, 0.920, "#sqrt{s} = 8 TeV, L_{int} = 11.6 fb^{-1}")

In [None]:
%jsroot on
c.Draw()

In [None]:
report.Print()

## Additional: store all your custom function in a separate header file

In addition, it is possible to store user-defined functions (like the invariant mass shown before) in a separate .h file. In this way, we can keep the notebook only for Dataframe operations.

To achieve this, **open** the .h file `rdataframe-dimuon.h` in the same folder of this notebook, and edit it to perform the same operation of the previous section.\
If you have doubts, look here in the [docs](https://root.cern.ch/doc/master/group__vecops.html#gaa5798925785053643e12a326044fab37) for the invariant mass definition!

In [None]:
#EDIT the .h file

Now you are ready to load it in the notebook execution with this command:

In [None]:
def my_initialization_function():
    ROOT.gInterpreter.Declare("#include \"rdataframe-dimuon.h\"")

my_initialization_function()

And finally, define a new column with the custom function:

In [None]:
df_mass_custom = #your code here

Now we can repeate the steps before, to see if the results are identical

In [None]:
# your code here

## Additional: run the example on the `KubeCluster`

We can also run this example on the `KubeCluster`.

- Create a new `KubeCluster`;
- Scale it with a few workers (a couple will be enough);
- Connect to the cluster, using with the `Client` object;
- Scale the cluster with 2-3 workers.

In [None]:
# paste your Cluster code here!

After creating and scaling the cluster, you can simply copy the code above! 

The only differences:
- Remember to use `ROOT.RDF.Experimental.Distributed.Dask.RDataFrame` instead of `ROOT.RDataFrame`;
- You cannot use the `Range` method (yet!). However, now you are running the full dataset, with ~62M events!
- If you want to use custom functions, stored in a header file, load them in the notebook with the `ROOT.RDF.Experimental.Distributed.initialize(my_initialization_function)`

In [None]:
# your code here