# RooFit Tutorial: Introduction to Unbinned Likelihood Models

Jonas Rembser (CERN), 2022

<https://github.com/root-project/training/blob/master/RooFit/2022/roofit-tutorial-01.ipynb>

## Setup

Import ROOT and NumPy:

In [None]:
import ROOT
import numpy as np

Silence the RooFit logging:

In [None]:
ROOT.RooMsgService.instance().setGlobalKillBelow(ROOT.RooFit.FATAL)

## The basics

Mathematical concepts are represented by C++ objects:


![roofit_classes.png](roofit_classes.png)

### Creating your first RooFit model

In [None]:
import ROOT

Observable:

In [None]:
x = ROOT.RooRealVar("x", "x", 0, 0, 10)

Parameters:

In [None]:
mean = ROOT.RooRealVar("mean", "mean of gaussian", 5, 0, 10)
sigma = ROOT.RooRealVar("sigma", "width of gaussian", 1, 0.1, 10)

Gaussian PDF:

In [None]:
gauss = ROOT.RooGaussian("gauss", "gaussian PDF", x, mean, sigma)

PDF inspection:

In [None]:
gauss.Print("t")

### Toy dataset generation and fitting

Generate a toy dataset with 9000 entries sampled from the Gaussian PDF:

In [None]:
data = gauss.generate({x}, 9000)

In [None]:
data.Print()

Fit the PDF to the toy data, saving the fit result:

In [None]:
fit_result = gauss.fitTo(data, PrintLevel=-1, Save=True)

In [None]:
fit_result.Print()

Inspect the correlation of your model parameters:

In [None]:
fit_result.correlationMatrix().Print()

## Plotting the data and the model

Create a `RooPlot` object on which the data and PDF is plotted:

In [None]:
x_frame = x.frame(Title="Gaussian PDF with data")

In [None]:
data.plotOn(x_frame)
gauss.plotOn(x_frame);

Draw the RooPlot on a `TCanvas`:

In [None]:
c1 = ROOT.TCanvas("c1", "c1", 500, 300)
x_frame.Draw()
c1.Draw()

### Exporting your RooFit datasets

You can export a RooDataSet to NumPy or Pandas:

In [None]:
df = data.to_pandas()

In [None]:
df

## Composite PDFs

Composite PDF: model with mutiple components, like signal and background.

Adding random "background" values sampled from exponential PDF to the dataset:

In [None]:
exp_tau = -0.18
arr_new = np.concatenate([data.to_numpy()["x"],
                          np.random.exponential(-1./exp_tau, size=4* data.numEntries())])

Import the new array back to a RooDataSet:

In [None]:
data_x = ROOT.RooDataSet.from_numpy({"x" : arr_new[arr_new < x.getMax()]}, {x}, name="data_new")

In [None]:
data_x.Print()

Visualize the dataset:

In [None]:
x_frame = x.frame(Title="Plotting Gaussian plus exp. background")
data_x.plotOn(x_frame)

c2 = ROOT.TCanvas()
x_frame.Draw()
c2.Draw()

### Creating the composite fit model

Create exponential PDF with parameter "tau":

In [None]:
tau = ROOT.RooRealVar("tau", "tau", -0.2, -10.0, -0.01)
expo = ROOT.RooExponential("expo", "expo", x, tau)

Define parameters for the number of signal and background events:

In [None]:
n_sig = ROOT.RooRealVar("n_sig", "n_sig", 10000, 1000, 100000)
n_bkg = ROOT.RooRealVar("n_bkg", "n_bkg", 50000, 5000, 500000)

Create a composite model that automatically includes a Poisson term for the total number of events:

In [None]:
model = ROOT.RooAddPdf("model", "model", [gauss, expo], [n_sig, n_bkg])

Do the fit:

In [None]:
fit_result = model.fitTo(data_x, PrintLevel=-1, Save=True)
fit_result.Print()

## Creating a nice plot

Create RooPlot and draw data, PDF, and components:

In [None]:
x_frame = x.frame(Title="Gaussian plus exp. background")

data_x.plotOn(x_frame, Name="data")

model.plotOn(x_frame, Components=gauss, LineColor="r", LineStyle="--", Name="gauss")
model.plotOn(x_frame, Components=expo, LineColor="k", LineStyle="--", Name="expo")
model.plotOn(x_frame, Name="model");

Add a legend:

In [None]:
legend = ROOT.TLegend(0.7, 0.55, 0.92, 0.87)
legend.SetBorderSize(0)
legend.SetFillStyle(0)
legend.AddEntry(x_frame.findObject("data"), "data", "P")

for name in ["model", "gauss", "expo"]:
    legend.AddEntry(x_frame.findObject(name), name, "L")

Create a second frame with the residuals:

In [None]:
resid_hist = x_frame.residHist()

resid_frame = x.frame(Title=";x;residuals")
resid_frame.addPlotable(resid_hist, "P")

Create a canvas that is divided into two drawing pads:

In [None]:
c3 = ROOT.TCanvas("c3", "c3", 600, 600)
c3.Divide(1, 2)

First pad is for the main plot and the legend:

In [None]:
pad_1 = c3.cd(1)
x_frame.Draw()
legend.Draw()
pad_1.SetPad(0.0, 0.2, 1, 1)

Second pad is for the residuals:

In [None]:
pad_2 = c3.cd(2)
pad_2.SetPad(0., 0.0, 1, 0.25)
resid_frame.Draw()
resid_frame.GetXaxis().SetLabelSize(0.12)
resid_frame.GetYaxis().SetLabelSize(0.12)
resid_frame.GetYaxis().SetTitleSize(0.12)
resid_frame.GetYaxis().SetTitleOffset(0.25)

Draw the canvas:

In [None]:
c3.Draw()

## Template fits with convolutions

In [None]:
template_hist = ROOT.TH1D("h1", "h1", 100, 0, 10)
f1 = ROOT.TF1("f1", "std::exp(-std::abs((x-5)))", 0, 10)
template_hist.FillRandom("f1", 100000)

In [None]:
c4 = ROOT.TCanvas()
template_hist.Draw()
c4.Draw()

In [None]:
y = ROOT.RooRealVar("y", "y", 0, 10)

roo_template_hist = ROOT.RooDataHist("roo_template_hist", "roo_template_hist", y, template_hist)

sig_raw_y = ROOT.RooHistPdf("sig__raw_y", "sig_raw_y", y, roo_template_hist)

resolution = ROOT.RooRealVar("resolution", "resolution", 0.2, 0.1, 1.0)
sig_smearing_y = ROOT.RooGaussian("sig_smearing_y", "sig_smearing_y", y, ROOT.RooFit.RooConst(0.0), resolution)

sig_y = ROOT.RooFFTConvPdf("sig_y", "sig_y", y, sig_raw_y, sig_smearing_y)

bkg_y = ROOT.RooChebychev("bkg_y", "bkg_y", y, [-0.5, 0.1])

model_y = ROOT.RooAddPdf("model_y", "model_x", [sig_y, bkg_y], [n_sig, n_bkg])

In [None]:
data_y = model_y.generate(y)

In [None]:
fit_result = model_y.fitTo(data_y, PrintLevel=-1, Save=True)
fit_result.Print()

In [None]:
y_frame = y.frame(Title="Model for y")

data_y.plotOn(y_frame)
model_y.plotOn(y_frame)

c5 = ROOT.TCanvas()
y_frame.Draw()
c5.Draw()

### Overview of other PDF types

RooFit provides a collection of standard PDF classes, e.g.:

![roofit_pdfs.png](roofit_pdfs.png)

Easy to **extend the library**: each pdf is a separate C++ class

## Multivariate fit

In [None]:
model_sig_xy = ROOT.RooProdPdf("model_sig_xy", "model_sig_xy", [gauss, sig_y])
model_bkg_xy = ROOT.RooProdPdf("model_bkg_xy", "model_bkg_xy", [expo, bkg_y])

model_xy = ROOT.RooAddPdf("model_xy", "model_xy", [model_sig_xy, model_bkg_xy], [n_sig, n_bkg])

In [None]:
data_xy = model_xy.generate({x, y}, 10000)

In [None]:
fit_result_xy = model_xy.fitTo(data_xy, PrintLevel=-1, Save=True)
fit_result_xy.Print()

In [None]:
x_frame = x.frame(Title="Model for x")
y_frame = y.frame(Title="Model for y")

data_xy.plotOn(x_frame)
model_xy.plotOn(x_frame)

data_xy.plotOn(y_frame)
model_xy.plotOn(y_frame)

c6 = ROOT.TCanvas("c6", "c6", 800, 400)
c6.Divide(2)

c6.cd(1)
x_frame.Draw()
c6.cd(2)
y_frame.Draw()

c6.Draw()

## Model inspection

In [None]:
model.Print("t")

In [None]:
model_xy.graphVizTree("model.dot")

In [None]:
!dot -Tgif -o model.gif model.dot

![model.gif](model.gif)

## The RooWorkspace

In [None]:
ws = ROOT.RooWorkspace("myworkspace")
ws.Import(model_xy);

In [None]:
ws.Print()

In [None]:
ws["model_xy"].Print()

In [None]:
ws.writeToFile("myworkspace.root");

## Exercises

1. Further improve the plot with the pull distribution by visualizing also the post-fit uncertainty of the model. Figure out how to do this by reading the documentation of [RooAbsPdf::plotOn()](https://root.cern.ch/doc/master/classRooAbsPdf.html#aa0f2f98d89525302a06a1b7f1b0c2aa6). *Hint: use one of the many keyword arguments.*

2. Look at the [rf203_ranges.py RooFit tutorial](https://root.cern/doc/master/rf203__ranges_8py.html) to learn how to restrict the fit to a subrange. Redo the convoluted template fit to the $y$ variable, but restricted to the range from 3 to 7.

   Why does the uncertainty of the `resolution` parameter increase, even though we are not excluding that much signal and `resolution` doesn't affect the background?

3. In a fresh notebook, open the `RooWorkspace` we wrote to disk and create new toy data according to the multidimensional model. Then, fit the model to the new data and look at the fit result. Is it compatible with the one in this notebook?

4. By calling [covarianceMatrix()](https://root.cern.ch/doc/master/classRooFitResult.html#afed3209d7be07a028e5c2131666c9c19) on a `RooFitResult` object, you can ispect the correlation between the fit parameters:

   `fit_result.correlationMatrix().Print()`

  Which parameters are strongly (anti)correlated in the final multidimensional fit? Can you explain why?

5. For the multidimensional model, why did we not just create a single `RooProdPdf` that multiples the model for $x$ and the model for $y$? *Hint: think about the model from a mathematical point of view.*

## Exercise solutions

### Exercise 1 - Visualizing fit uncertainties

### Exercise 2 - Ranged fit

### Exercise 3 - Reusing the RooWorkspace

### Exercise 4 - The correlation matrix

### Exercise 5 - Why can't we take the naive product?

In [None]:
df_xy = data_xy.to_pandas()
df_xy_sel = df_xy.query("x >= 3 & x <= 6")
data_xy_sel = ROOT.RooDataSet.from_pandas(df_xy_sel, {x, y})

In [None]:
x_frame = x.frame(Title="Data for x")
y_frame = y.frame(Title="Data for y")

data_xy.plotOn(x_frame)
data_xy.plotOn(y_frame)
data_xy_sel.plotOn(x_frame, MarkerColor="r")
data_xy_sel.plotOn(y_frame, MarkerColor="r")

c7 = ROOT.TCanvas()
c7.Divide(2)

c7.cd(1)
x_frame.Draw()
c7.cd(2)
y_frame.Draw()

c7.Draw()