# Physics Analysis: Dimuon spectrum

This tutorial shows you how to analyze datasets using `RDataFrame` from a C++ notebook. The example analysis performs the following steps:

1. Connect a ROOT dataframe to a dataset containing 61 mio. events recorded by CMS in 2012
2. Filter the events being relevant for your analysis
3. Compute the invariant mass of the selected dimuon candidates
4. Plot the invariant mass spectrum showing resonances up to the Z mass

The notebook runs out-of-the-box. However, you are encouraged to tweak the code to see the effect on the result!

## Prelude: using a C++ notebook

ROOT comes with a [C++ interpreter](https://root.cern/cling/), ideal for fast prototyping in C++, which is available in SWAN via the ROOT C++ Jupyter kernel. Modern C++ and features such as polymorphism, templates or lambdas are at your disposal!

In [1]:
std::cout << "This is C++!" << std::endl;

This is C++!


The one-definition rule of C++ could get in your way when doing interactive analysis in C++ from a notebook, since it is quite common to modify and rerun cells that contain definitions. Luckily, ROOT supports redefinitions! In order to see that, try running the following cell multiple times.

In [2]:
// This cell can be run multiple times without error!
float i = 0;

## Analysis in C++ from a notebook

This notebook introduces the RDataFrame C++ API (ROOT is a C++ framework). Although C++ is probably not as convenient as Python, the resulting program is very performant. This can be interesting if the code of a notebook is intended to be later compiled and applied to larger-scale physics analysis.

Have a look at the following cells to understand how a computation graph can be built using `RDataFrame` in C++.

## Create a ROOT dataframe

First we will create a ROOT dataframe that is connected to a dataset named `Events` stored in a ROOT file. The file is pulled in via [XRootD](http://xrootd.org/) from EOS public, but note how it could also be stored in your CERNBox space or in any other EOS repository accessible from SWAN (e.g. the experiment ones).

The dataset `Events` is a `TTree` (a "table" in first order) and has the following branches (also referred to as "columns"):

| Branch name | Data type | Description |
|-------------|-----------|-------------|
| `nMuon` | `unsigned int` | Number of muons in this event |
| `Muon_pt` | `float[nMuon]` | Transverse momentum of the muons stored as an array of size `nMuon` |
| `Muon_eta` | `float[nMuon]` | Pseudo-rapidity of the muons stored as an array of size `nMuon` |
| `Muon_phi` | `float[nMuon]` | Azimuth of the muons stored as an array of size `nMuon` |
| `Muon_charge` | `int[nMuon]` | Charge of the muons stored as an array of size `nMuon` and either -1 or 1 |
| `Muon_mass` | `float[nMuon]` | Mass of the muons stored as an array of size `nMuon` |

In [3]:
ROOT::RDataFrame df("Events", "root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root");

## Inspect the dataset

There a few operations that we can use to inspect the dataset we just created. First, we can see the column names:

In [4]:
df.GetColumnNames()

(ROOT::RDF::RInterface::ColumnNames_t) { "nMuon", "Muon_pt", "Muon_eta", "Muon_phi", "Muon_mass", "Muon_charge" }


Now we can check the type of one of those columns:

In [5]:
df.GetColumnType("nMuon")

(std::string) "UInt_t"


We can also see how the dataset looks like (this will read the first five rows):

In [6]:
auto display = df.Display();
display->Print();

nMuon | Muon_pt  | Muon_eta   | Muon_phi    | Muon_mass | Muon_charge | 
2     | 10.7637f | 1.06683f   | -0.0342727f | 0.105658f | -1          | 
      | 15.7365f | -0.563787f | 2.54262f    | 0.105658f | -1          | 
2     | 10.5385f | -0.427780f | -0.274792f  | 0.105658f | 1           | 
      | 16.3271f | 0.349225f  | 2.53978f    | 0.105658f | -1          | 
1     | 3.27533f | 2.21086f   | -1.22341f   | 0.105658f | 1           | 
4     | 11.4292f | -1.58824f  | -2.07730f   | 0.105658f | 1           | 
      | ...      | ...        | ...         | ...       | ...         | 
      | 3.50223f | -1.65596f  | -1.84997f   | 0.105658f | 1           | 
4     | 3.28344f | -2.17248f  | -2.37001f   | 0.105658f | -1          | 
      | ...      | ...        | ...         | ...       | ...         | 
      | 23.7218f | -1.16290f  | -0.773005f  | 0.105658f | 1           | 


## Filter relevant events for this analysis

Physics datasets are often general purpose datasets and therefore need extensive filtering of the events for the actual analysis. Here, we implement only a simple selection based on the number of muons and the charge to cut down the dataset in events that are relevant for our study.

In particular, we are applying two filters to keep:
1. Events with exactly two muons
2. Events with muons of opposite charge

In [7]:
auto df_2mu = df.Filter("nMuon == 2", "Events with exactly two muons");
auto df_os = df_2mu.Filter("Muon_charge[0] != Muon_charge[1]", "Muons with opposite charge");

## Compute the invariant mass of the dimuon system

Since we want to see the resonances in the mass spectrum, where dimuon events are more likely, we need to compute the invariant mass from the four-vectors of the muon candidates. Because this operation is non-trivial, we will use ROOT's `ROOT::VecOps::InvariantMass` function to do the job for us.

The invariant mass will be stored in a new column that we will create with the `Define` operation of `RDataFrame`. The parameters of `Define` are: name of the new column, function to execute to generate the new column, dataset columns to be used as arguments for the function.

In [8]:
auto df_mass = df_os.Define("Dimuon_mass", ROOT::VecOps::InvariantMass<float>, {"Muon_pt", "Muon_eta", "Muon_phi", "Muon_mass"});

## Run only on a part of the dataset

The full dataset contains half a year of CMS data taking in 2012 with 61 mio events. For the purpose of this example, we use the `Range` node to run only on a small part of the dataset. This feature also comes in handy in the development phase of your analysis.

Feel free to experiment with this parameter!

In [9]:
auto num_events = df.Count(); cout << "Number of events in this dataset: " << *num_events << endl;

Number of events in this dataset: 61540413


In [10]:
// Take only the first 1M events
auto df_range = df_mass.Range(1000000);

## What are the cuts doing?

To find out how many events the cuts are throwing away, we can book another endpoint of the graph to look at the efficiency of the placed cuts.

In [11]:
auto report = df_range.Report();
report->Print();

Events with exactly two muons: pass=1315892    all=2676243    -- eff=49.17 % cumulative eff=49.17 %
Muons with opposite charge: pass=1000000    all=1315892    -- eff=75.99 % cumulative eff=37.37 %


## Make a histogram of the dimuon spectrum

As (almost) always in physics, we have a look at the results in the form of a histogram. Let's book a histogram as one endpoint of our computation graph.

In [12]:
const auto nbins = 30000;
const auto low = 0.25;
const auto up = 300;
auto h = df_range.Histo1D({"Dimuon_mass", "Dimuon_mass", nbins, low, up}, "Dimuon_mass");

## Plot the dimuon spectrum

Now, the computation graph is set up. Next, we want to have a look at the result.

Note that the event loop actually runs the first time we try to access the histogram object (results of an `RDataFrame` graph are computed lazily).

`%%time` measures the time spend in the full cell.

In [13]:
%%time

gStyle->SetOptStat(0); gStyle->SetTextFont(42);
auto c = new TCanvas("c", "", 800, 700);
c->SetLogx(); c->SetLogy();
h->SetTitle("");
h->GetXaxis()->SetTitle("m_{#mu#mu} (GeV)"); h->GetXaxis()->SetTitleSize(0.04);
h->GetYaxis()->SetTitle("N_{Events}"); h->GetYaxis()->SetTitleSize(0.04);
h->Draw();

TLatex label; label.SetNDC(true);
label.SetTextSize(0.040); label.DrawLatex(0.100, 0.920, "#bf{CMS Open Data}");
label.SetTextSize(0.030); label.DrawLatex(0.500, 0.920, "#sqrt{s} = 8 TeV, L_{int} = 11.6 fb^{-1}");

Time: 5.941462516784668 seconds.



ROOT provides interactive JavaScript graphics for Jupyter, which can be activated with the `%jsroot` magic. Click and drag on the axis to zoom in and double click to reset the view.

Don't forget that you can improve the statistics by increasing the number of events given to `Range`.

In [14]:
%jsroot on
c->Draw()