# ROOT dataframe tutorial: Dimuon spectrum

The ROOT dataframe tutorial shows you how to analyze datasets using `RDataFrame`. The example analysis performs the following steps:

1. Connect a ROOT dataframe to a dataset containing 66 mio. events recorded by CMS in 2012
2. Filter the events being relevant for your analysis
3. Compute the invariant mass of the selected dimuon candidates
4. Plot the invariant mass spectrum showing resonances up to the Z mass

The notebook runs out-of-the-box. However, you are encouraged to tweak the code to see the effect on the result! 

Specific questions, which will improve your understanding of the technology, **are marked bold.**

## How to use the notebook

In short: You can execute a cell by selecting it and pressing Ctrl+Enter.

For the full documentation, you can click on `Help` above.

## Outline

The full tutorial consists of three stages and shows you how to use ROOT dataframes ...

1. ... in C++
2. ... in Python
3. ... in Python with advanced features

## Stage 1:  Using C++

Since ROOT is a C++ framework, the first part of the ROOT dataframe tutorial introduces you to the C++ API. Though C++ is probably not as convenient as Python, the resulting program is very performant, which is obviously of interest for large-scale physics analysis.

Have a look at the following code to understand how a computation graph can be build using `RDataFrame` in C++.

## Create a ROOT dataframe

The following ROOT dataframe is connected to a dataset named `Events` in two ROOT files. These files are not placed locally but pulled in via [XRootD](http://xrootd.org/) from a remote server.

The dataset `Events` is a `TTree` (a "table" in first order) and has following branches (also refered to as "columns"):

| Branch name | Data type | Description |
|-------------|-----------|-------------|
| `nMuon` | `unsigned int` | Number of muons in this event |
| `Muon_pt` | `float[nMuon]` | Transverse momentum of the muons stored as an array of size `nMuon` |
| `Muon_eta` | `float[nMuon]` | Pseudo-rapidity of the muons stored as an array of size `nMuon` |
| `Muon_phi` | `float[nMuon]` | Azimuth of the muons stored as an array of size `nMuon` |
| `Muon_charge` | `int[nMuon]` | Charge of the muons stored as an array of size `nMuon` and either -1 or 1 |
| `Muon_mass` | `float[nMuon]` | Mass of the muons stored as an array of size `nMuon` |

In [None]:
ROOT::RDataFrame df("Events",
                  {"root://eospublic.cern.ch//eos/root-eos/cms_opendata_2012_nanoaod/Run2012B_DoubleMuParked.root",
                   "root://eospublic.cern.ch//eos/root-eos/cms_opendata_2012_nanoaod/Run2012C_DoubleMuParked.root"});

## Filter relevant events for this analysis

Physics datasets are often general purpose datasets and therefore need excessive filtering of the events for the actual analysis. Here, we implement only a simple selection based on the number of muons and the charge to cut down the dataset on events, which are relevant for our study.

**Fill in the correct expressions to select ...**

1. Events with exactly two muons
2. Events with muons of opposite charge

See the table above for the column names and the data types.

In [None]:
auto df_2mu = df.Filter("nMuon == 3", "Events with exactly two muons");
auto df_os = df_2mu.Filter("Muon_charge[0] == Muon_charge[1]", "Muons with opposite charge");

## Compute the invariant mass of the dimuon system

Since we want to see the resonances in the mass spectrum, where dimuon events are more likely, we need to compute the invariant mass from the four-vectors of the muon candidates. Because this operation is non-trivial, we using ROOT's `Math::LorentzVector` to do the job for us.

The `Define` method below can also create new columns based on jitted strings, such as done for the `Filter` above. However, we implement a C++ callable, here the lambda function `compute_mass`, which is then passed to the `Define` method to be executed in the event loop.

In case you haven't used lambda functions before, that's how they work:

```cpp
auto my_lambda =                  // The name of the lambda function.
                 []               // In these braces you can capture variables from the outer scope.
                 (int x)          // The signature of your function.
                 { return 2*x; }; // The body of the function with your implementation.
```

In [None]:
using Vec_t = const ROOT::VecOps::RVec<float> &;
auto compute_mass = [](Vec_t pt, Vec_t eta, Vec_t phi, Vec_t mass) {
    ROOT::Math::PtEtaPhiMVector p1(pt[0], eta[0], phi[0], mass[0]);
    ROOT::Math::PtEtaPhiMVector p2(pt[1], eta[1], phi[1], mass[1]);
    return (p1 + p2).mass();
};
auto df_mass = df_os.Define("Dimuon_mass", compute_mass, {"Muon_pt", "Muon_eta", "Muon_phi", "Muon_mass"});

## Run only on a part of the dataset

The full dataset contains half a year of CMS data taking in 2012 with 66 mio events. For the purpose of this example, we use the `Range` node to run only on a small part of the dataset. This feature also comes in handy in the development phase of your analysis.

Feel free to experiment with this parameter!

In [None]:
auto df_range = df_mass.Range(100000);

## Make a histogram of the dimuon spectrum

As (almost) always in physics, we have a look at the results in the form of a histogram. Let's book a histogram as one endpoint of our computation graph.

**Where do you expect resonances in the dimuon spectrum? Adjust the plotting range accordingly!**

In [None]:
const auto nbins = 30000;
const auto low = 100;
const auto up = 300;
auto h = df_range.Histo1D({"Dimuon_mass", "Dimuon_mass", nbins, low, up}, "Dimuon_mass");

## What are the cuts doing?

To find out how many events your cuts are throwing away, we can book another endpoint of the graph reporting us the efficiency of the cuts.

In [None]:
auto report = df_range.Report();

## Plot the dimuon spectrum

Now, the computation graph is set up. Next, we want to have a look at the result.

**Can you figure out where the event loop actually runs?**

Note that `%%time` measures the time spend in the full cell.

In [None]:
%%time
gStyle->SetOptStat(0); gStyle->SetTextFont(42);
auto c = new TCanvas("c", "", 800, 700);
c->SetLogx(); c->SetLogy();
h->SetTitle("");
h->GetXaxis()->SetTitle("m_{#mu#mu} (GeV)"); h->GetXaxis()->SetTitleSize(0.04);
h->GetYaxis()->SetTitle("N_{Events}"); h->GetYaxis()->SetTitleSize(0.04);
h->Draw();

TLatex label; label.SetNDC(true);
label.SetTextSize(0.040); label.DrawLatex(0.100, 0.920, "#bf{CMS Open Data}");
label.SetTextSize(0.030); label.DrawLatex(0.630, 0.920, "#sqrt{s} = 8 TeV, L_{int} = 11.6 fb^{-1}");

ROOT provides for the notebooks a JavaScript front-end for drawing the canvas. Click and drag on the axis to zoom in and double click to reset view.

**Hint: It is possible to see the [eta meson](https://de.wikipedia.org/wiki/%CE%97-Meson)!**

Don't forget that you can improve the statistics by increasing the number of events given to `Range`.

In [None]:
%jsroot on
c->Draw()

## Inspecting the cut-flow

As the last study, we have a look at the efficiency of the placed cuts.

**Does the event loop run again, when executing the following line of code?**

In [None]:
report->Print();

## Additional tasks

Try to implement a second histogram and measure the time to compute both from the input dataset. The dataset contains the column `PV_npvs` representing the number of primary vertices per event, which you can study for this purpose.

**Is the time spend doubled? What do you expect?**