# Workshop: Advanced Fitting with zfit

**Course:** Statistics for Particle Physics  
**Duration:** 2-3 Hours  
**Level:** Advanced (Closing Session)

---

## Introduction

Welcome to the final exercise of the course. Over the last 11 hours, we have covered the theory of probabilities, PDFs, Likelihoods, and Confidence Intervals. Now, we will apply this knowledge using **zfit**.

### Why zfit?
In High Energy Physics (HEP), we often deal with:
* **Complex Models:** Sums of signals and backgrounds, convolutions, etc.
* **High Statistics:** Millions of events.
* **Many Parameters:** Systematics, nuisance parameters.

**zfit** is a Python library built on top of **TensorFlow**. It is designed to be:
* **Scalable:** Runs on GPUs/CPUs efficiently.
* **Pythonic:** Integrates with the Scikit-HEP ecosystem (`uproot`, `mplhep`, `awkward`).
* **Flexible:** Allows custom PDFs and arbitrary loss functions.

## Agenda
1.  **The Basics:** Spaces, Parameters, and PDFs.
2.  **The Fit:** Unbinned Likelihoods and Minuit.
3.  **Visualization:** Data, Models, and Pull Plots.
4.  **Extended Likelihoods:** Fitting Yields.
5.  **Constraints:** Handling Systematic Uncertainties.
6.  **Simultaneous Fits:** Signal + Control Regions.
7.  **2D Fits:** Correlations.

### 0. Setup and Imports

In [None]:
# need to install zfit mplhep

import zfit
import numpy as np
import matplotlib.pyplot as plt
import mplhep as hep

# Set HEP plotting style (optional but looks nice)
plt.style.use(hep.style.ROOT)

print(f"zfit version: {zfit.__version__}")
print(f"numpy version: {np.__version__}")



---

## 1. The Building Blocks

In zfit, everything is an object. We need to define the "Universe" our physics lives in before we can do anything.

### 1.1 The Observable Space (`zfit.Space`)
The domain of definition for our PDF. If we fit a mass peak, this is the mass range.

$$ \text{Observable} \in [\text{min}, \text{max}] $$

In [None]:
# Define the B meson mass range
obs = zfit.Space("mass", limits=(5000, 6000)) # MeV

### 1.2 Parameters (`zfit.Parameter`)
Variables that the fitter can change (float) or that we keep constant (fix).

Syntax: `zfit.Parameter(name, value, [lower, upper], [step_size])`

* If limits are provided, the parameter is **floating** (fits can change it).
* If limits are `None` (or not provided), the parameter is **fixed**.

In [None]:
# Signal parameters (Gaussian)
mu = zfit.Parameter("mu", 5279, 5200, 5350)     # Floating
sigma = zfit.Parameter("sigma", 30, 1, 100)     # Floating

# Background parameters (Exponential)
lam = zfit.Parameter("lambda", -0.002, -0.01, 0) # Floating

### 1.3 Probability Density Functions (PDFs)
zfit provides standard shapes in `zfit.pdf` (Gauss, CrystalBall, Exponential, Voigt, etc.).

We will build a simple **Signal + Background** model.

$$ PDF_{total}(x) = f_{sig} \cdot G(x; \mu, \sigma) + (1 - f_{sig}) \cdot E(x; \lambda) $$

In [None]:
# Create component PDFs
signal_pdf = zfit.pdf.Gauss(mu=mu, sigma=sigma, obs=obs, name="Signal")
bkg_pdf = zfit.pdf.Exponential(lambda_=lam, obs=obs, name="Background")

# Create the combined model
frac = zfit.Parameter("frac", 0.3, 0, 1) # 30% signal fraction guess
model = zfit.pdf.SumPDF([signal_pdf, bkg_pdf], fracs=frac)

### 1.4 Generating Data
For this exercise, we will generate "Toy MC" data from the model itself. In a real analysis, you would load this from a ROOT file or Pandas DataFrame.

Note: We can fix the parameters temporarily to generate "True" data, then randomize them before fitting.

In [None]:
# Set "True" values for generation
mu.set_value(5280)
sigma.set_value(25)
lam.set_value(-0.003)
frac.set_value(0.2)

# Sample 5000 events
n_events = 5000
data = model.sample(n=n_events)

# Convert to numpy for basic plotting
data_np = data.numpy()
print(f"Generated {len(data_np)} events.")

# Simple check
plt.hist(data_np, bins=50, histtype='step');

---

## 2. The Fit (Unbinned Likelihood)

### 2.1 Theory Recap
We use the **Unbinned Negative Log-Likelihood (NLL)**.

$$ -\ln \mathcal{L}(\theta) = - \sum_{i=1}^{N} \ln P(x_i; \theta) $$

Minimizing this quantity yields the best estimators for parameters $\theta$.

In [None]:
# 1. Randomize parameters so the fit has work to do
mu.set_value(5250)
sigma.set_value(50)
lam.set_value(-0.001)
frac.set_value(0.5)

# 2. Define the Loss Function
nll = zfit.loss.UnbinnedNLL(model=model, data=data)

# 3. Define the Minimizer (Minuit)
minimizer = zfit.minimize.Minuit()

# 4. Run the minimization
result = minimizer.minimize(nll)

### 2.2 Analyzing the Result
The `result` object contains the parameter values, validity checks, and can compute errors.

In [None]:
print(f"Fit Converged: {result.converged}")
print(f"Fit Valid: {result.valid}")

# Compute Hessian Errors (Parabolic approximation)
result.hesse()

print(result.params)

### 2.3 Visualization (Best Practices)
A fit is worthless if you don't check it visually. In HEP, we almost always plot:
1.  **Data** (Error bars)
2.  **Model** (Line)
3.  **Pulls** (Residuals normalized by error): $\frac{N_{data} - N_{model}}{\sigma}$

In [None]:
def plot_fit_with_pulls(model, data, obs, bins=50):
    # Prepare Data
    data_np = data.numpy()
    lower, upper = obs.limits[0]
    
    # Histogram the data
    counts, edges = np.histogram(data_np, bins=bins, range=(lower, upper))
    centers = (edges[:-1] + edges[1:]) / 2
    yerr = np.sqrt(counts)

    # Evaluate Model
    x_plot = np.linspace(lower, upper, 1000)
    y_pdf = model.pdf(x_plot).numpy()
    # Scale PDF to data count
    scale = len(data_np) * (upper - lower) / bins
    y_model = y_pdf * scale
    
    # Evaluate Model at Bin Centers for Pulls
    y_model_at_centers = model.pdf(centers).numpy() * scale
    pulls = (counts - y_model_at_centers) / yerr

    # --- Plotting ---
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(9, 8), sharex=True, gridspec_kw={'height_ratios': [3, 1]})
    
    # Main Plot
    ax1.errorbar(centers, counts, yerr=yerr, fmt='ko', label="Data")
    ax1.plot(x_plot, y_model, 'b-', label="Total Fit", linewidth=2)
    ax1.set_ylabel("Events")
    ax1.legend()
    
    # Pull Plot
    ax2.errorbar(centers, pulls, yerr=1, fmt='ko')
    ax2.axhline(0, color='b', linestyle='--')
    ax2.fill_between([lower, upper], -2, 2, color='y', alpha=0.2, label=r"$2\sigma$")
    ax2.set_ylabel("Pull")
    ax2.set_xlabel(f"{obs.name} [MeV]")
    ax2.set_ylim(-5, 5)
    
    plt.subplots_adjust(hspace=0.05)
    plt.show()

plot_fit_with_pulls(model, data, obs)

### üìù Exercise 1: Improving the Model
1.  Replace the `zfit.pdf.Gauss` signal with a `zfit.pdf.CrystalBall`.
    * You will need two new parameters: `alpha` (try ~1.0) and `n` (try ~2.0).
2.  Perform the fit again.
3.  Compare the standard deviation (`sigma`) obtained from the Gaussian fit vs the CB fit.

---

## 3. Extended Likelihoods (Fitting Yields)

So far, we fitted the **fraction** (`frac`) of signal. The total number of events was fixed to the observed number.

In physics, we want to know the **Yield** ($N_{sig}$, $N_{bkg}$) and its uncertainty.
We use the **Extended Likelihood**, which adds a Poisson term to the NLL:

$$ -\ln \mathcal{L}_{ext} = -\sum \ln P(x_i) + (N_{exp} - N_{obs} \ln N_{exp}) $$

In zfit, we just attach a yield parameter to the PDF.

In [None]:
# 1. Define Yield Parameters (Floating)
n_sig = zfit.Parameter("n_sig", 1000, 0, 10000)
n_bkg = zfit.Parameter("n_bkg", 4000, 0, 10000)

# 2. Extend the PDFs
# Note: We don't need 'frac' anymore. The sum is determined by n_sig + n_bkg
signal_ext = signal_pdf.create_extended(n_sig)
bkg_ext = bkg_pdf.create_extended(n_bkg)

# 3. Create Sum
model_ext = zfit.pdf.SumPDF([signal_ext, bkg_ext])

# 4. Loss (Note: ExtendedUnbinnedNLL)
nll_ext = zfit.loss.ExtendedUnbinnedNLL(model=model_ext, data=data)

# 5. Fit
result_ext = minimizer.minimize(nll_ext)
result_ext.hesse()
print(result_ext.params)

### üìù Exercise 2: Uncertainty Scaling
In a simple counting experiment, the error on $N$ is $\sqrt{N}$.
1.  Check the error on `n_sig` from the fit output.
2.  Is it larger or smaller than $\sqrt{N_{sig}}$?
3.  **Thought experiment:** Why is it different? (Hint: Does the background shape look like the signal?)

---

## 4. Constraints (Systematics)

Often, we know something about a parameter from an external source (e.g., the Jet Energy Scale is known to 1%). We can add this information as a **Constraint**.

Mathematically, this multiplies the Likelihood by a Gaussian:
$$ \mathcal{L}_{total} = \mathcal{L}_{data} \times G(\theta; \mu_{ext}, \sigma_{ext}) $$

Let's assume we have a theoretical prediction for the background slope `lambda` = -0.003 +/- 0.0001.

In [None]:
# 1. Define the Constraint
# We constraint the parameter 'lam' to -0.003 with width 0.0001
constraint = zfit.constraint.GaussianConstraint(params=lam, observation=-0.003, uncertainty=0.0001)

# 2. Add to Loss
# Constraints are just added to the likelihood
nll_constrained = zfit.loss.ExtendedUnbinnedNLL(model=model_ext, data=data, constraints=constraint)

# 3. Fit
result_c = minimizer.minimize(nll_constrained)
result_c.hesse()

print(f"Fitted lambda without constraint: {result_ext.params[lam]['value']:.5f} +/- {result_ext.params[lam]['minuit_hesse']['error']:.5f}")
print(f"Fitted lambda WITH constraint:    {result_c.params[lam]['value']:.5f} +/- {result_c.params[lam]['minuit_hesse']['error']:.5f}")

Note how the error on `lambda` decreases, and this might propagate to improve the error on `n_sig`!

---

## 5. Simultaneous Fits (Signal + Control Regions)

One of the most powerful techniques in HEP. 
Imagine we have:
1.  **Signal Region (SR):** Has Signal and Background.
2.  **Control Region (CR):** Has **only** Background (e.g., sideband, or failing a cut).

If the background shape is the same in both (e.g. same exponential slope $\lambda$), we can fit them **simultaneously** to constrain the background much better.

In [None]:
# --- 1. Setup Data ---
# SR Data is 'data' (already generated)

# CR Data: Create a new dataset that is PURE background, but shares the same physics (same lambda)
# Let's assume CR has 2000 events
obs_cr = zfit.Space("mass_cr", limits=(5000, 6000))
bkg_cr_gen = zfit.pdf.Exponential(lambda_=-0.003, obs=obs_cr) # Use true lambda
data_cr = bkg_cr_gen.sample(2000)

# --- 2. Setup Models ---

# Shared Parameter (Slope)
lambda_shared = zfit.Parameter("lambda_shared", -0.002, -0.01, 0)

# SR Model (Sig + Bkg)
sig_sr = zfit.pdf.Gauss(mu=mu, sigma=sigma, obs=obs)
bkg_sr = zfit.pdf.Exponential(lambda_=lambda_shared, obs=obs) # Uses shared
model_sr = zfit.pdf.SumPDF([sig_sr, bkg_sr], fracs=frac)

# CR Model (Bkg only)
model_cr = zfit.pdf.Exponential(lambda_=lambda_shared, obs=obs_cr) # Uses shared

# --- 3. Combined Loss ---
nll_sr = zfit.loss.UnbinnedNLL(model_sr, data)
nll_cr = zfit.loss.UnbinnedNLL(model_cr, data_cr)

simultaneous_nll = nll_sr + nll_cr

# --- 4. Fit ---
result_sim = minimizer.minimize(simultaneous_nll)
print(result_sim.params)

### üìù Exercise 3: Shared Resolution
Imagine you have a $J/\psi \to \mu\mu$ calibration channel.
1.  Generate a calibration dataset (Pure Gaussian, mean=3100, sigma=25).
2.  Set up a simultaneous fit with your Signal Region.
3.  Share the `sigma` parameter between the Signal Region and the Calibration Channel.
4.  See how this constrains the width of your signal peak.

---

## 6. 2D Fits

Sometimes, Signal and Background overlap completely in Mass, but separate in another variable (e.g., Decay Time).

If variables are uncorrelated: $PDF(x, y) = PDF(x) \times PDF(y)$

In [None]:
# 1. Define Spaces
mass = zfit.Space("mass", limits=(5000, 6000))
time = zfit.Space("time", limits=(0, 2))

# 2. Parameters
tau_sig = zfit.Parameter("tau_sig", 1.5, 0.1, 3.0)
tau_bkg = zfit.Parameter("tau_bkg", 0.4, 0.01, 1.0)

# 3. PDFs
# Mass
sig_m = zfit.pdf.Gauss(mu=mu, sigma=sigma, obs=mass)
bkg_m = zfit.pdf.Exponential(lambda_=lam, obs=mass)

# Time
sig_t = zfit.pdf.Exponential(lambda_=-1/tau_sig, obs=time)
bkg_t = zfit.pdf.Exponential(lambda_=-1/tau_bkg, obs=time)

# 4. Product PDFs (2D)
sig_2d = zfit.pdf.ProductPDF([sig_m, sig_t])
bkg_2d = zfit.pdf.ProductPDF([bkg_m, bkg_t])

# 5. Total Model
model_2d = zfit.pdf.SumPDF([sig_2d, bkg_2d], fracs=0.5)

# 6. Generate 2D Data
data_2d = model_2d.sample(10000)

# 7. Fit
nll_2d = zfit.loss.UnbinnedNLL(model_2d, data_2d)
result_2d = minimizer.minimize(nll_2d)
print(result_2d.params)

In [None]:
# Quick visualization of 2D data
data_np_2d = data_2d.numpy()
plt.figure(figsize=(6, 5))
plt.hist2d(data_np_2d[:,0], data_np_2d[:,1], bins=30, cmap='viridis')
plt.xlabel("Mass")
plt.ylabel("Time")
plt.title("2D Data Distribution")
plt.colorbar()
plt.show()

## Conclusion & Further Steps

In this session, we transitioned from basic Python to full HEP-style fitting.

**Summary of Tools:**
* `zfit.Space`: Define your physics bounds.
* `zfit.Parameter`: Handle what floats and what is fixed.
* `ExtendedUnbinnedNLL`: For fitting yields.
* `Simultaneous Fit`: Combine SR and CR.

**Challenge:**
Try to implement a **Conditional PDF** where the width $\sigma$ depends on the error of the event $\sigma = \sigma_{event} \times S_{scale}$.

Good luck with your physics analysis!