# A proof-of-principle Jupyter Notebook using ATLAS dijet data

This notebook introduces the reader to the interpretation of the ATLAS search for new dijet resonances beyond the Standard Model (BSM) done using proton-proton data taken at the LHC in 2015 and 2016 [https://arxiv.org/abs/1703.09127](Phys. Rev. D 96 2017 052004). 
After introducing the basic theory and the experimental setup, this Jupyter notebooks depicts different methods of comparing the data with a predicted background. This notebook guides the reader to make figures illustrating different ways to compare data and expectation, as in [arXiv:1111.2062](https://arxiv.org/abs/1111.2062). The notebook is programmed in ```Python3``` but can also be run on ```Python2```. 

This notebook has been originally created by Bachelor Student Leonie Hermann, supervised by Caterina Doglioni (Lund University, who also adapted it for the COMPUTE Jupyter course) and Markus Schumacher (University of Freiburg). The thesis can be found here: https://terascale.physik.uni-freiburg.de/Publikationen/abschlussarbeiten/bachelorarbeiten/bachelorLeonieHermann/view

## Theory ##

The following part picks up the most important basics which are necessary to understand the physics behind the dijet resonance search. It starts with the main parts of the Standard Model (SM). Then, jet production and the  search for new particles are introduced after the experimental tools. 


### The Standard Model  ###

The Standard Model (SM) was introduced in 1960 and describes the fundamental particles and their interactions through the fundamental forces. It classifies the elementary particles into classes and in three generations of matter called fermions and into the interaction particles called bosons. The fermions are divided into quarks and leptons. The quarks build up particles such as baryons and mesons - also called hadrons. There are three pairs of quarks: up and down, charm and strange, top and bottom. The three lepton generations include the electron, the muon and the tauon, their respectively antiparticle and their neutrinos and antineutrinos.
The particles mediating the interactions are called gauge bosons or force carriers. The photon $\gamma$ interacts electromagnetically, the $W^\pm$ and $Z$ bosons interact weakly and the gluon interacts strongly. The recently discovered Higgs boson explains how most SM particles acquires mass. The SM particles are shown in the figure below. 

<img src="https://upload.wikimedia.org/wikipedia/commons/0/00/Standard_Model_of_Elementary_Particles.svg" width="400" />

By MissMJ [CC BY 3.0 (https://creativecommons.org/licenses/by/3.0) or Public domain], via Wikimedia Commons


## Experimental Setup ##

### The Large Hadron Collider ###

The Large Hadron Collider (LHC) is currently the largest and most powerful man-made circular particle accelerator in the world.  It is located at the border of France and Switzerland near Geneva in Switzerland and is part of CERN.  The $27.6\,\text{km}$ long accelerator has four collision points where experiments and detectors are placed. On of the largest experiment is the ATLAS (A Toroidal LHC ApparatuS) detector. [Note: the following cell shows how to embed a video.]

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('UDoIzvKumGI')

### The ATLAS experiment ###

Protons collide with a total center-of-mass energy of $13\, \text{TeV}$ $1$ billion times per second in correspondence of the ATLAS detector and other experiments. The new particles created from the collision are measured by different layer of the ATLAS detector, which is shown in the figure below.
* Close to the center of collision, the inner detector is a pixel/semiconductor detector responsible for measuring the tracks of the particles. A superconducting solenoid magnet is necessary to measure the momentum of the particles. 
* Following the tracking detector, the electromagnetic calorimeter absorbs the particles which interact electromagnetically. The momentum and energy of those particles are measured by collecting the energy from electromagnetic showers.
* The following calorimeter is called hadronic calorimeter. The interaction of particles with the strong force leads to their absorption and energy loss. An active material measures the energy from the so-called hadronic showers. 
* The outer layer is the muon spectrometer, outside toroidal magnets. The muon spectrometer measures the energy and momentum of the muons. 

<img src="http://mediaarchive.cern.ch/MediaArchive/Photo/Public/2008/0803012/0803012_01/0803012_01-A5-at-72-dpi.jpg" width="600" />

Optional material: How the ATLAS experiment detects particles is described in the video below. 

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('iYRQpcJVQx8')

## What are we looking for, in this notebook? ## 

### Jet Production in Proton-Proton Collisions ###

When protons collide with each other in a _collision event_, one of the most frequent processes that is seen in the detector is the production of two particles coming from the initial protons. These particles are quarks and gluons. We cannot measure those particles directly as they soon turn into _cones_ of other particles in a process called hadronization. These cones are called jets, they consist of particles that interact with the strong force and carry a lot of the energy of the initial collision. A jet is a collimated flow of particles moving essential in the same direction. A _dijet event_ is the signature in the detector of two separate jets, generally back-to-back from each other. The distribution of the energy of those two jets combined (also called invariant mass, $m_{jj}$) is predicted by the Standard Model to be a monotonically decreasing, smooth function. 


### New Physics beyond the Standard Model ###
The SM cannot explain all phenomena which has been observed. Those include for example dark matter (DM) and dark energy which compose most of the universe. Theories beyond the Standard Model predict mediator particles between the SM and DM. A new heavy particle which decays into dijets could be a DM mediator and then a huge progress in the search for dark matter.  


### Resonance search ### 

When considering theories beyond the Standard Models with new particle states decaying into dijets, an excess in this $m_{jj}$ distribution could emerge on top of the smooth distribution of dijet events from QCD. The main part of this notebook focuses on the methods to visualize such an excess, once we have a prediction of the background from the Standard Model.

## Dataset and Background Prediction ##


With the ATLAS detector at the Large Hadron Collider (LHC) at CERN in Switzerland and France, proton-proton collision events at a center of mass energy of $13\,\text{TeV}$ are investigated for the presence of new resonances decaying into dijets. The search includes data from the 2015 and 2016 runs (with an integrated luminosity of $37\,\text{fb}^{-1}$). 

[Advanced] The reconstruction and calibration of the jet is described in the thesis and summarized here. For the data analysis dijets events with the following criteria are chosen:
* a leading jet with $p_T > 440\,$GeV
* a sub-leading jet with $p_T > 60\,$GeV
* $m_{jj} > 1\,$TeV

Since the background is smooth, it can be estimated using a fit to the data as long as this fit does not accommodate for local fluctuations. In the case of this search, the background is estimated by a fitting method called *Sliding Window Fits* (SWiFt). A three parameter fit function $f(x) = p_1 (1-x)^{p_2} x^{p_3}$ is used to fit the data in smooth function that does not allow local excesses. The technique fits the data in smaller ranges called windows which slides in overlapping steps in the spectrum. The fitted value of the bin in the center is then the background prediction. For the bins near the edge, the window is reduced to 60% to the nominal window size and all values of the fit are taken for the background fit. The background prediction has two types of uncertainties which are depicted later in the notebook. One uncertainty comes from the choice of the fit function e.g. taking instead a four-parameter fit function and the other uncertainty is due to determining the parameters. 

## The  Analysis ## 

In the analysis, the background prediction is  compared to the measured data. It is investigated in order to find localized excesses appearing as a bump on top of the smooth QCD distribution. Firstly, data and background prediction are compared using hypothesis tests and then visually investigated using the metric in [arXiv:1111.2062](https://arxiv.org/abs/1111.2062). This is described in *Methods to compare Data and Background Bin-by-Bin* and in the written elaboration.  Secondly, the hypothesis hyper tests BumpHunter and TailHunter algorithm are applied to the dataset.  

The main parts of the analysis (Statistical significance and BumpHunter algorithm) were already performed by the ATLAS Collaboration. 
The comparison of the data and the background prediction from [this search](http://inspirehep.net/record/1519428/files/fig1a.png) are shown in the figure below. No significant local excesses are observed. 

<img src="http://inspirehep.net/record/1519428/files/fig1a.png" width="700" />

##   Codes of Statistical Analysis ## 

### Introduction ###
This is the main part of the *Proof of Principle Notebook*. The data and the background data from this search is made available [on the HEPData platform](https://www.hepdata.net/record/77265). Ensure, that table 1 is saved in the same directory as this notebook for an error-free operation. Each coding cell with [N] should be executed in the chronological  order by pressing shift-enter.


For running this notebook in the interface JuypterLab, ```%matplotlib nbagg``` must be replaced by ```%matplotlib inline``` so that the figures appear below the coding cells. ```%matplotlib nbagg``` creates interactive figures where the range of the spectrum and the zoom can easily be adapted. 


 With ```pandas```, the data is presented and read, the figures are done with ```matplotlib.pyplot``` and to calculate the significance, ```scipy``` and a formular from ```statsmodels.api``` are used.  Those are the main packages applied in this code and therefore are imported in the beginning. 

In [None]:
%matplotlib nbagg    
# %matplotlib inline # For JupyterLab usage
import matplotlib
matplotlib.rcParams.update({'errorbar.capsize': 2})     # Make nicer errorbars
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import scipy
import math

### Reading the Data ###

The following part imports the data from the .csv file called ```HEPData-ins1519428-v2-Table1.csv``` to the data frame called ```data``` and ```temp```. First, the measured data is presented using ```pandas``` tables. The second table shows the background prediction and the uncertainties. 

In  ```names=[...]```, the title of each column is defined. First, the binning parameters are given. The bin center  ```$m_{jj}$ [TEV]``` is followed by the left bin edge   ```$m_{jj}$ [TEV] LOW``` and the right bin edge ```$m_{jj}$ [TEV] HIGH``` which determine together the bin width. The last column gives the number of events ```Data [ Events/Bin ]``` for the corresponding bin. The second table presents again the binning which is followed by the background prediction ```Background [ Events/Bin ]``` and two different types of uncertainties  ```Sys Fit function``` and ```Sys Fit parameters```. 

In [None]:
data = pd.read_csv("HEPData-ins1519428-v2-Table_1.csv",  names=["$m_{jj}$ [TEV]", "$m_{jj}$ [TEV] LOW", "$m_{jj}$ [TEV] HIGH", "Data [ Events/Bin ]"])
#this is a vector of strings so far
temp_data = pd.concat([data])  

#you only want the rows with the data, so print it out to check which ones you want
#print(temp)

#now we turn the columns we want into numbers
data_table = temp_data[10:102].astype('float64')

#and then we print them
data_table.style

#### Formatting the Data ####

This high energy physicist does not approve of your significant digits. How can you fix them? 

In [None]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "http://gate.hep.anl.gov/lecompte/Bio/koala.jpg")

As a first point of call, we can use the pandas "round" module:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.round.html

In [None]:
data_table.round(2)

However, this does not follow the rules & regulations for significant digits that you may have learned in various labs. A personal take (some people get rather religious about them) is to only use the significant digits that carry meaning to the precision of your experiment. So if you have an error on your measurement, you truncate the measurement according to the error. However, you also know the error with a limited precision (this can also be measured!), so where do you stop? For this course we'll stop with a convention: truncate the error to two digits, and round the measurement to a similar number of significant digits. 

The data that we have in those tables follows a probability distribution function that is "Poisson" (for a primer: https://www.umass.edu/wsp/resources/poisson/) and the statistical error on a number of counts N is sqrt(N). So, let's round the number of counts in the "Data" column separately, using the names in the pandas table above. The first three columns are simply the edges of the bins, while the fourth column will use information about itself to round. If you want to exercise some pandas/lambda wisdom, you can add a cell here and make the table look even nicer. 

In [None]:
background = pd.read_csv("HEPData-ins1519428-v2-Table_1.csv",  names=["$m_{jj}$ [TEV]", "$m_{jj}$ [TEV] LOW", "$m_{jj}$ [TEV] HIGH", "Background [ Events/Bin ]", "Sys Fit function + [ Events/Bin ]", "Sys Fit function - [ Events/Bin ]", "Sys Fit parameters + [ Events/Bin ]",  "Sys Fit parameters - [ Events/Bin ]" ])
temp_back = pd.concat([background])  
back_table = temp_back[105:197].astype('float64')


back_table.style

 **Step 1: Preparing data and the binning** We need columns in the tables above to turn into lists. Let's start again from the csv output. The first table with the measured data is imported to the following lists:
```python
x=[]   # Bin center of data
y=[]   # Measured data

# Bin edges
widthlow = []   # Left edge
widthhigh = []  # Right edge
```

In [None]:
#Reading the data in the tables above in the following steps:
# Bin center (mjj in TeV)
x = data[ "$m_{jj}$ [TEV]"][10:102]      # Taking only the corresponding rows in the file
x = pd.Series(x).values.tolist()         # Converting to an array
x = [float(i) for i in x]                # Converting to a list
print(x)

In [None]:
# Measured Events/Bin
y = [float(i) for i in pd.Series(data["Data [ Events/Bin ]"][10:102]).values.tolist()]
print(y)
# Bin edges
widthlow = [float(i) for i in pd.Series(data["$m_{jj}$ [TEV] LOW"][10:102]).values.tolist()]
widthhigh =  [float(i) for i in pd.Series(data["$m_{jj}$ [TEV] HIGH"][10:102]).values.tolist()]

 **Step 2: Preparing background and systematic uncertainties**
In the second table, the background prediction is given with uncertainties. The bin center and edges are the same as in the first table. Therefore, they are not imported again.
```python
yb=[]   # Background prediction 

# Uncertainties
# Systematic uncertainties due to the fit function
fplus = []
fminus =[]
# Systematic uncertainties due to the choice of the parameters
pplus = []
pminus =[]
```

In [None]:
# Measured Events/Bin

# Background prediction
yb = [float(i) for i in pd.Series(background["Background [ Events/Bin ]"][105:197]).values.tolist()]

# Systematic uncertainties due to the fit function
fplus = [float(i) for i in pd.Series(background["Sys Fit function + [ Events/Bin ]"][105:197]).values.tolist()]
fminus = [float(i) for i in pd.Series(background["Sys Fit function - [ Events/Bin ]"][105:197]).values.tolist()]

# Systematic uncertainties due to the choice of the parameters
pplus = [float(i) for i in pd.Series(background["Sys Fit parameters + [ Events/Bin ]"][105:197]).values.tolist()]
pminus = [float(i) for i in pd.Series(background["Sys Fit parameters - [ Events/Bin ]" ][105:197]).values.tolist()]

# Creating additional list with only zeros in the length of bins.
i = 0
null = []
while i< len(x):
    null.append(float(0))
    i = i+1

 **Step 3: Preparing statistical uncertainties**
A Poisson distribution is assumed for the measured data so that the statistical uncertainty is calculated by $\sqrt{N}$ with $N$ as number of events. 
The data uncertainty is given by
```python
yderr=[].
```

In [None]:
yderr = [yderr**(1./2) for yderr in y ]     # Poisson uncertainty of the y-value of the measured data
#if you want to see the errors:
print(yderr)

### Creation of Figures containing Data and Background Prediction ###

Now, the data and the background prediction can be depicted including uncertainties. A proper figure includes

* labeled axis 

* an appropriate scale (e.g.logarithmic)

* a grid

* a legend.

How to add these details is shown in the following coding cells. First, only the background is shown with the two systematic uncertainties. Then, the data and the background prediction are shown in one figure. 

***1. Depicting background with the systematic uncertainties***

In [None]:
# Fit of the background with uncertainties

plt.rcParams.update({'errorbar.capsize': 1.5})
backgroundonly = plt.figure()
ax1 = plt.subplot(111)

In [1]:
# Label the axis
plt.xlabel('$m_{jj}$ in TeV')
plt.ylabel('Events / Bin')
# Adding a grid (optional)
plt.grid()

# Plot the background prediction in a step function
plt.step(x, yb, 'r', linewidth= 0.5, data=None, where='mid', label='Background fit')

NameError: name 'plt' is not defined

In [None]:
# Add the two types of systematic uncertainties to the step function

# Detail: the uncertainty due to the choice of the fit function goes only in one direction (same sign in the table)
ax1.errorbar(x,yb, yerr=(fplus, null), fmt='None',  ecolor='b', label='Uncertainty due to \n  choice of background parametrization  ', markersize=2.1, elinewidth=0.8)
# Detail: the uncertainty due to the choice of the parameters goes in both directions (different signs in the table)
ax1.errorbar(x,yb, yerr=(pplus), fmt='None', ecolor='k',label='Uncertainties due to  \n values of parameters', markersize=2.1, elinewidth=0.8)

# Make the y-axis in logarithmic scale
plt.yscale('log', nonposy='clip')

# Display the legend
plt.legend(loc=1)

# Show the plot
plt.show()

# Save the figure
plt.savefig('Backgroundonly.png')

The systematic uncertainties on the background are very small. Therefore, in the following figures, these uncertainties are neglected. 

*** 2. Depicting data and background prediction***

In [None]:
# Fit of the background with uncertainties
databackground = plt.figure()
ax = plt.subplot(111)

# Label the axis
plt.xlabel('$m_{jj}$ in TeV')
plt.ylabel('Events / Bin')

# Adding a grid (optional)
plt.grid()

# Plot the data with the statistical uncertainty
ax.errorbar(x,y, yerr=yderr, fmt='ko', label='Data', markersize=2.1, elinewidth=0.8)

# Plot the background prediction in a step function
plt.step(x, yb, 'r', linewidth= 0.5, data=None, where='mid', label='Background fit')

#(Optional) Add text with further information about the plot
#ax.text(1,10**0+1,'Center of mass energy :$\sqrt{s} = 13\,$TeV')
#ax.text(1,10**(-1)+ 10**(-0.5),'Integrated luminosity 37.0 fb$^{-1}$')
#ax.text(1,10**(-1),'Run in 2015 and 2016')

# Make the y-axis in logarithmic scale
plt.yscale('log', nonposy='clip')  # 'clip' avoids artefacts in the plot, when the error band extends to negative values

# (Optional) x-axis in logarithmic scale (two different ways)
#plt.semilogx()
#plt.xscale('log')

# Make the legend
plt.legend(loc=1)

# Show the plot
plt.show()

# Save the figure with the name... in the format...
plt.savefig('HEPDataandbkg.png')

### Methods to compare Data and Background Bin-by-Bin
There are different ways to investigate the differences between data and expectations as presented in [arXiv:1111.2062](https://arxiv.org/abs/1111.2062). 

Before, the histogram can be plotted, some preparations must be done. One important part is to determine the width of the bins individually. In the histogram, every bin has a different width. The edges of the bin values are given in the second and third row of the tables shown in *Reading the Data*. Their differences determine the bin width ```width=[]```: ```width[i] = widthhigh[i]-widthlow[i]```.

In [None]:
# Calculate the width for each bin
width=[]
i=0
while i < len(widthlow):                     # Calculate the width for all entries 
    width.append(widthhigh[i]-widthlow[i])   # The width is the difference of the high and the low edge
    i=i+1 

The following methods are applied on the data and the background prediction: 
* Absolute difference
* Relative difference
* $\frac{D-B}{\sqrt{B}}$-approximation of the significance
* Statistical significance

### 1.  Absolute and relative difference of data and background

#### Absolute difference between data and background

The first method to investigate the differences between data and background prediction is to take the absolute difference ```absolute=```$\Delta E(i)$ 
\begin{equation}
    \Delta E(i)=D(i)-B(i)
\end{equation}
at each bin $i$. The number of events in the data set is ```y=```$D(i)$ and in the background model ```yb=```$B(i)$. An excess has a positive and a deficit a negative value.

In [None]:
# Determine the values of the absolute difference
absolute=[]
i=0
while i < len(x):                # Calculate the abs. difference for all bins
    absolute.append(y[i]-yb[i])  # Subtract the background from the measured data value
    i=i+1

# Plot the absolute difference in  a histogram
figabsdiff = plt.figure()
ax = plt.subplot(111)

# Label the axis 
plt.xlabel('$m_{jj}$ in TeV')
plt.ylabel('$\Delta$ Events / Bin')

# Make a grid 
plt.grid()

# Plot the bars in the histogram
ax.bar(x, absolute, width=width, color='red', edgecolor=['crimson']*len(x))

# Save the figure
plt.savefig('Absolutediff.png')

The absolute difference is usually high for high populated bins  and  low for bins with very low population. This method does not give any useful information about how the data set is in accordance with the background model. It does not show any significant deviations for data counts, which span over several orders of magnitude. 

#### Result: Relative Difference between Data and Background
One other way to investigate the differences, is to take the ratio of data over expectation
$$ \frac{\text{Data}}{\text{Expected}} = \frac{D}{B}$$
and calculate the relative difference ```rel```
$$ \frac{D}{B}- 1 = \frac{D-B}{B}.$$
$\Rightarrow$ ``` rel[i] = (y[i]-yb[i])/yb[i]```

In [None]:
# Calculation of the relative difference for all bins 
rel = []  
i=0
while i < len(y):
    rel.append((y[i]-yb[i])/yb[i])    # Adding the relative difference of each bin to the list 'rel'.
    i=i+1

The relative difference for low values of $m_{jj}$ is almost zero. For $m_{jj}> 4\,$TeV, the absolute value of the relative difference rises. Deviations in form of an excess as well as a deficit appear. The highest relative difference is present at $m_{jj} = 8\,$TeV for two adjacent bins with an excess of over 600% and 800%.

This way of plotting does not show how significant the deviations are with respect to the relative uncertainty. For high -populated bins, significant discrepancies can be hidden since the relative difference becomes smaller for a large background. The fluctuations of the relative difference rise for low-populated bins. 
It seems that most fluctuations are observed at the low-populated bins. 
This is the deficit of the method and has a big impact if the data set spans over several orders of magnitude.
This method does not statistically quantify the agreement between data and background prediction.

In [None]:
# Plotting the relative difference
figrel = plt.figure()
ax = plt.subplot(111)

# Label the axis
plt.xlabel('$m_{jj}$ in TeV')
plt.ylabel('(D-B) / B')

# Add a grid
plt.grid()

# Plot the bars in the histogram
ax.bar(x, rel, width=width, color='orange', edgecolor=['gold']*len(x))

# Save the figure
plt.savefig('RelDifferencehisto.png')

### 2. The $\frac{D-B}{\sqrt{B}}$-approximation

Another possibility is to consider the $\frac{D-B}{\sqrt B}$-approximation. 
The Poisson distribution approximates the Gaussian distribution for a high number of events in each bin. The statistical significance (also called z-value, intuitively corresponding to the number of standard deviations by which the value of an observation is above the mean value of that observation) becomes then
\begin{equation}
    \text{zvalue}  = \frac{x-\mu}{\sigma} = \frac{x-\mu}{\sqrt{\mu}}.
\end{equation}
with a standard deviation of $\sigma={\sqrt{\mu}}$. 
Applying this to the example with the measured data $D$ and the estimated background $B$, the significance  can be approximated by
\begin{align}
    \text{zvalue}= \frac{D-B}{\sqrt{B}}
\end{align}
 $\Rightarrow$ ```app[i] = (y[i]-yb[i])/(yb[i]**(0.5)) ```


In [None]:
# Calculate the Gaussian approximation of the significance (D-B)/(B)^(1/2)

# List for the values of the approximation
app=[]
i=0
while i < len(y):
    app.append((y[i]-yb[i])/(yb[i]**(0.5)))  # Calculation and adding the values to the list
    i=i+1

#### Result: $\frac{D-B}{\sqrt B}$ - approximation
A short code is applied which separates the bins with an excess ```plus = []``` to the bins with a deficit ```minus = [] ```. The excesses are shown in green and the deficits in red. The following code adds all negative values of the approximation to the list ```minus = [] ``` and all positive to ```plus = []```. Then, the both lists are used to create a figure. Different colors are chosen to highlight the differences. 

In [None]:
# All deficits are added to the list called 'minus'
minus=[]
i=0
while i < len(y):
    if app[i]<0:
        minus.append(app[i])      # Add all negative values to the list
    else:
        minus.append(0)           # Positive values are set to zero to keep the positions in the list
    i=i+1


# All excesses are added to the list 'plus'
plus=[]
i=0
while i < len(y):
    if app[i]>0:
        plus.append(app[i])     # Add all positive values to the list
    else:
        plus.append(0)          # Negative values are set to zero to keep the positions in the list   
    i=i+1   

In [None]:
# Plot the Gaussian approximation of the significance (D-B)/(B)^(1/2) and make the excesses in another color than the deficits
figapp = plt.figure()
ax = plt.subplot(111)

# Label the axis
plt.xlabel('$m_ {jj}$ in TeV')
plt.ylabel('(D-B) / $\sqrt{B}$')

# Add a grid
plt.grid()

# Plot the deficits in red
ax.bar(x, minus, width=width, color='salmon', edgecolor=['r']*len(x))

# Plot the excesses in green
ax.bar(x, plus, width=width, color='lightgreen', edgecolor=['g']*len(x))

# Save the figure
plt.savefig('Approximation.png')

The highest deviation can be seen for $m_{jj}\approx 8 \,$TeV with almost 3 standard deviations. For all other bins, the deviation is below two standard deviations.

Since the Poisson distribution approximates the Gaussian distribution for high-populated bins, this way of plotting is valid for a large background $B$. However, it fails for bins with only a few entries. No reliable statement about the significance can be made for bins with only a few entries. For bins from $m_{jj}>5.874\,$TeV  the approximation is  not valid anymore. For data sets with only high-populated bins in the interesting  range of the observable, this way is easy and efficient to use. 


### 3. Statistical Significance

In order to plot the differences between the data and the expectations, the statistical significance is considered as a metric to estimate excesses and deficits. Now, the statistical significance is calculated for data with follows the Poisson distribution. A more detailed explanation of the calculation can be read in the written elaboration. 

Two important parameters are used: Poisson p-value and related z-value.

The **p-value** can be defined bin-by-bin as the probability of finding a deviation from the chosen background model that is at least as big as the one observed in data. In this case, the chosen background model is a Poisson distribution with a mean equal to the number of events estimated by the fit.
The Poisson p-value is given by
\begin{equation}\label{eq:poissonpvalue}
   \text{p-value} =
   \begin{cases}
     \sum\limits_{n=D}^{\infty} \frac{B^n}{n!} e^{-B} = 1-\sum\limits_{n= 0}^{D-1} \frac{B^n}{n!} e^{-B}   & \text{for } D > B \\
     \sum\limits_{n= 0}^{D} \frac{B^n}{n!} e^{-B}  & \text{for } D \leq B
   \end{cases}
\end{equation}
where due to the summation from $0$ to $D$ all possible outcomes are considered for deficits $D \leq B $. The sum for excesses $D > B$ counts until infinity since one bin could have any number of events. 

In order to calculate the sums, the upper regularized Gamma function $Q(D,B) =  \Gamma(D,B)/ \Gamma(D)$  with
$$\Gamma(D,B)  =   \int_B^\infty t^{D-1} \mathrm{e}^{-t}\,\mathrm dt$$
is used. 


The **z-value** is the deviation at the right of the mean of a Gaussian distribution in units of standard deviations equivalent to the p-value. It is directly related to the p-value and is calculated with 
$$ \text{p-value} = \int_{\text{z-value}}^\infty \frac{1}{\sqrt{2\pi}} \cdot e^{- \frac{x^2}{2}}\, \text{d}x \;\; \leftrightarrow \;\; \text{z-value} = \sqrt{2} \cdot \operatorname{erf}^{-1}(1-2\cdot \text{p-value})$$
with the inverse errorfunction $\operatorname{erf}^{-1}$.

The relationship is depicted below.  The significance gets negative for p-value$>0.5$ so that the relation does not work anymore. Therefore, only bins with p-value $ \leq 0.5$ are considered in the following hypothesis tests. A p-value of $0.5$ is related to z-value $=0$.

In [None]:
import numpy as np
errfunction= plt.figure()

p = np.arange(0, 0.5, 0.01)
null = p*0
l = np.arange(0.5, 1.2, 0.01)
plt.grid(True, which="both")
z=2**(1./2)*scipy.special.erfinv(1-2.*p)
z2=2**(1./2)*scipy.special.erfinv(1-2.*l)
plt.plot(p,z, label ='z-value for p-value < 0.5')
plt.plot(l,z2,label ='z-value for p-value $\geq$ 0.5' )
plt.axvline(0.5, color='k')
plt.hlines(0, 10**(-3), 1.2, color='k')
plt.xscale('log')
plt.xlabel('p-value')
plt.ylabel('z-value')
plt.xlim((10**(-2),1))
plt.legend()
plt.show()
plt.savefig('errorfunction.png')

The significance level measures the magnitude of deviations between the results and the model. For an excess, the significance is defined positive and for a deficit negative.  
In physics, an evidence is given for a deviation of z-value $\geq 3$. A new discovery can be proclaimed if the deviation is significant with z-value $ \geq 5$. Further interpretations are shown in the following table.

| z-value | $ \geq 0$  |$ < 0$| $ 5$  |$1$ | 
|------|------|------|------|------|------|------|
|  p-value   | $\leq 0.5$|$> 0.5$  | $2.87\cdot 10^{-7}$| 0.15 |
||deviations|no deviations|discovery threshold| 1-sigma statistical fluctuation|

The code calculating the p- and z-value has been translated into python from the supporting documentation of [arXiv:1111.2062](https://arxiv.org/abs/1111.2062).

In [None]:
# Function for calculation the Poisson p-value with the incomplete gamma function
def pvalue(D,B):
    if D>B :                                  # For an excess
        p = scipy.special.gammainc(D, B) 
    else :                                    # For a deficit          
        p= 1-scipy.special.gammainc(D+1, B)   
    return p

# Function for calculation the significance (z-value) with the inverse errorfunction
def zvalue(p):
    if y[i]>yb[i]:                                  # For an excess
        z=2**(1./2)*scipy.special.erfinv(1-2.*p)
    else:                                           # For a deficit, the negative significance is calculated  
        z= -2**(1./2)*scipy.special.erfinv(1-2.*p)
    return z

In [None]:
sig=[]           # List for significance
i=0
while i < len(y):
    sig.append(zvalue(pvalue(y[i], yb[i])))  # Calculate the zvalue in dependence of the pvalue of y and yb and add it to the list
    i=i+1

#### Result: Statistical  Significance

Firstly, a figure without excluding bins with ```p-value``` $>0.5$ is produced. Then, in the second figure,  the bins with ```p-value``` $>0.5$ are excluded.

The deviations of the data set from the background model are between approximately +1.5 and -2.4 standard deviations. 
Evidence is accepted for 3 standard deviations and a discovery is proclaimed for a significance higher than 5 standard deviations. Considering the bin-by-bin analysis, no significant excess or deficit is observed.

In [None]:
# Plot all significance values
figsig = plt.figure()
ax = plt.subplot(111)

# Make the labels
plt.xlabel('$m_ {jj}$ in TeV')
plt.ylabel('Significance')

# Make a grid
plt.grid()

# Plot the bars in the histogram
ax.bar(x, sig, width=width, color='plum', edgecolor=['m']*len(x))

# Save the figure
plt.savefig('Significanceall.png')

Now, the cut p-value $<0.5$ is performed.

In [None]:
# Perform the cut pvalue<0.5

# Determine all pvalues<0.5
sigp=[]
i=0
while i < len(y):
    if pvalue(y[i], yb[i])<0.5:                    # For pvalues<0.5 ...
        sigp.append(zvalue(pvalue(y[i], yb[i])))   # ... add the pvalue
        i=i+1
    else:                                          # For pvalues>0.5...
        sigp.append(0)                             # ... add pvalue=0
        i=i+1

#Make the plot
figsigp = plt.figure()
ax = plt.subplot(111)

# Label the axis
plt.xlabel('$m_ {jj}$ in TeV')
plt.ylabel('Significance')

# Make a grid
plt.grid()

# Plot the bars in the histogram 
ax.bar(x, sigp, width=width, color='c', edgecolor=['darkcyan']*len(x))

# Save the figure
plt.savefig('Significance.png')

### 4. Comparison of the Bin-by-Bin Analyze
The difference between the ways of depicting is recapped in the following part.

The absolute and relative difference cannot directly be compared with the approximation and the significance since the unit is not in standard deviations. 
For large bin populations, the absolute difference is very high and drops down to nearly zero for the low-populated bins. The relative difference shows a reversed behavior. Highly populated bins have a relative difference of almost zero and the relative difference increases rapidly for low populated bins. By comparing those characteristics with the results of the other techniques, it becomes clear that both methods cannot be used for reliably quantifying  excesses or deficits.

The differences between the approximation of the significance  $(D-B) / \sqrt{B}$ and the two ways of the significance are shown below. 


In [None]:
# Plotting the different ways in one graph in order to compare the techniques

figall = plt.figure()
ax = plt.subplot(111)

# Label the axis
plt.xlabel('$m_ {jj}$ in TeV')
plt.ylabel('Significance')

# Make a grid
plt.grid()


# Add the Gaussian approximation of the significance
ax.bar(x, app, width=width, alpha = 0.3, ls='dotted',color='salmon', edgecolor=['r']*len(x), label='Approximation (D-B) / $\sqrt{B}$')

# Add the significance
ax.bar(x, sig, width=width, alpha = 0.3,  color='plum', edgecolor=['m']*len(x), label='Significance')

# Add the significance with pvalue < 0.5
ax.bar(x, sigp, width=width, alpha = 0.3, ls='-.', color='c', edgecolor=['darkcyan']*len(x), label='Significance with p<0.5')

# (Optional) Add the relative difference
# ax.bar(x, rel, width=width, alpha = 0.3, ls='dashed', color='yellow', edgecolor=['gold']*len(x), label='Relative Difference (D-B) / B')

# (Optional) Add the absolute difference (consider scaling)
# ax.bar(x, absolute, width=width, alpha = 0.3, color='springgreen', edgecolor=['g']*len(x), label='Absoulte difference D-B')


# Include the legend
plt.legend()

# Save the figure
plt.savefig('DifferencePlots.png')

The figure above confirms that the approximation of the significance $\frac{D-B}{\sqrt{B}}$ is a quite good approximation for large populated bins, here in the range of $1.1 - 5.874\,$ TeV. For lower populations in the bins and here subsequently for higher invariant dijet masses $m_{jj}$, the approximation breaks down.  
For p-value $>0.5$, the relationship between the p- and the z-value breaks down and would give a z-value in the opposite direction. 
Thus,  the significance performed with the cut p-value $<0.5$ gives the most accurate result for the determining the deviations in the bin-by-bin analysis.

### 5. The final figure including data, background and histograms###
The final figure includes data and expectation in one graph. In subplots the different methods of determining the differences can be presented. The significance and the relative difference are already shown. Other methods like the approximation can easily be added.

In [None]:
# Make a final plot with all important comparisons

# Set up the axes with gridspec
result = plt.figure()
grid = plt.GridSpec(12, 4, hspace=0.2, wspace=0.2)   # A grid 12x5 for subplots

# Main plot
main_ax = result.add_subplot(grid[:-4, :], xticklabels=[] )   # Main plot

#Properties of the main plot
plt.grid()
plt.ylabel('Events/Bin')
#plt.xlabel('$m_ {jj}$ in TeV')

## Histogram plots:

#Significance
x_hist = result.add_subplot(grid[-4, :], xticklabels=[1,2,3,4,5,6,7,8,9],sharex=main_ax)
# Properties of the significance plot
plt.grid()
plt.ylim((-2.5,2.5))
plt.ylabel('Sign.')

# Relative Difference
x_hist2 = result.add_subplot(grid[-3, :], xticklabels=[1,2,3,4,5,6,7,8,9],sharex=main_ax)

# Properties of the relative difference plot
x_hist2.yaxis.set_label_position("right")       # Label on the right side
x_hist2.yaxis.tick_right()                      # Ticks on the right side
plt.grid()                                      # Make a grid
plt.ylabel('(D-B)/B')                           # Label the yaxis


# (Optional1) Gaussian approximation of the significance
#x_hist3 = result.add_subplot(grid[-2, :], xticklabels=[1,2,3,4,5,6,7,8,9],sharex=main_ax)
# Properties of the relative difference plot
#plt.grid()
#plt.ylabel('(D-B)/sqrt(B)')

# (Optional2) Absolute difference
#x_hist4 = result.add_subplot(grid[-1, :], xticklabels=[1,2,3,4,5,6,7,8,9],sharex=main_ax)
# Properties of the relative difference plot
#x_hist4.yaxis.set_label_position("right")
#x_hist4.yaxis.tick_right()
#plt.grid()
#plt.ylabel('(D-B)')


# x label for the last histogram plot
plt.xlabel('$m_ {jj}$ in TeV')

### Filling with data

# Main plot: Data and background
main_ax.errorbar(x,y, yerr=yderr, fmt='ko', label='Data', markersize=2.1, elinewidth=0.8)
main_ax.step(x, yb, 'r', linewidth= 0.5, data=None, where='mid', label='Background fit')

#ax.text(4,10**5+10**4,'$\sqrt{s} = 13\,$TeV, 37.0 fb$^{-1}$')
#main_ax.set_xscale("log", nonposx='clip')
main_ax.set_yscale("log", nonposy='clip')

## Filling the histograms:

# Significance
x_hist.bar(x, sigp, width=width, alpha = 1, ls='-', color='r', edgecolor=['k']*len(x), label='Significance with p<0.5')

# Relative difference
x_hist2.bar(x, rel, width=width, alpha = 1, ls='-', color='yellow', edgecolor=['gold']*len(x), label='Relative Difference')

# (Optional1) Gaussian approximation of the significance
#x_hist3.bar(x, app, width=width, alpha = 1, ls='-', color='plum', edgecolor=['m']*len(x), label='Approximation')

# (Optional2) Absolute difference
#x_hist4.bar(x, absolute, width=width, alpha = 1, ls='-', color='lightgreen', edgecolor=['green']*len(x), label='Absolute difference')


# Save the figure
plt.savefig('Results.png')


# Conclusions

The invariant mass $m_{jj}$ of dijet events of a data set recorded in 2015 and 2016 with the ATLAS detector in proton-proton collisions at $\sqrt{s} = 13 \,$TeV corresponding to an integrated luminosity of $37\,\text{fb}^{-1}$ were investigated. The Jupyter notebook comparing the data and the background prediction was validated by comparing the results with the results of the ATLAS collaboration.

Consequently, no significant local excess between the measured data and the predicted background is observed. For dijet events, the deviations of all the single bins lay below 2.4 standard deviations. Had it been three standard deviations, this would have been regarded as evidence.

The focus of this notebook is set on the statistical analysis of comparing a data set with a background prediction. Besides bin-by-bin analysis techniques, hypothesis hyper tests are introduced to get a global analysis of the spectrum. 

In the first part of the analysis, different ways of bin-by-bin comparisons are applied and compared. It was shown that methods as depicting the absolute difference or the relative difference between data and background are not comparable with the significance. Additionally, it was shown that the Gaussian approximation of the significance for a Poisson distributed data set is only suitable for high-populated bins with at least 10 entries. As a consequence, the most informative way is to illustrate differences by the calculation of  the significance via the p-value. 

# Credits

The notebook was written by Leonie Hermann as a part of a Bachelor's project during an Erasmus exchange at Lund University, under the supervision of Caterina Doglioni (Lund) and Markus Schumacher (Freiburg). Support and troubleshooting was provided by Florido Paganelli (Lund), Matteo Bauce, the Anaconda community and the ROOT notebook community.