Basic theory of likelihood 
===================

This jupyter notebook gives a basic overview of the theory of likelihood and maximum likelihood estimators applied to *Fermi* LAT. 

In the introduction to some of the probabilistic concepts below, I was blatantly inspired by the excellent [Bayesian methods in astronomy](https://github.com/jakevdp/BayesianAstronomy) tutorial. In fact, I am copying parts of Jake Vanderplas' tutorial below. You should go and read it if you have the chance.

Let's begin with some Python imports:

In [1]:
%pylab inline
import seaborn # for plot formatting

Populating the interactive namespace from numpy and matplotlib


# Fundamental questions of statistics

There are two fundamental types of statistical questions we want to answer:

**1. Model Fitting:** *Given this Model, what parameters best fit my data?*

Examples:

- What are the slope and intercept of a line of best-fit?
- What is the frequency, amplitude, and phase of a sinusoidal fit?

**2. Model Selection:** *Given two potential Models, which better describes my data?*

Examples:

- Does a linear or quadratic fit describe our data better?

Often one of the two models is a *null hypothesis*, or a baseline model in which the effect you're interested in is not observed.

# Likelihood definition

Here, we will focus on frequentist *maximum likelihood* approaches as a way of performing both model fitting and model selection. Another approach is Bayesian methods, but given our time constraints we will not cover them.

The starting point of maximum likelihood methods is to define the probability of seeing our data given the model—the likelihood:

$$ P(data ~|~ scientific\ model) $$ 

Let's define some symbols that will let us express this more easily:

$$
P(D ~|~ \theta)
$$

- $\theta$ represents the "science": the set of parameters that we are interested in constraining
- $D$ represents the "observed data"

It makes sense that the best-fit parameters that describe the data are those that maximize the likelihood defined above. Now all we need to do--as far as likelihood methods are concerned--is to compute the likelihood and maximize it. This should give us a point estimate of the model parameters that best describe the data.

# Simple example of statistical model

Since we want to maximize the likelihood, we need an expression to compute $P(D ~|~ \theta)$ for our data as a function of the parameters $\theta$.



If we were given:

- Data points $x_i, y_i$ with simple errorbars—this implies that probability for any *single* data point is a normal distribution about the true value
- Model $y_M(x; \theta)$ providing expected values

then

$$
y_i \sim \mathcal{N}(y_M(x_i;\theta), \sigma)
$$

and the likelihood would be

$$
P(x_i,y_i\mid\theta) = \frac{1}{\sqrt{2\pi\varepsilon_i^2}} \exp\left(\frac{-\left[y_i - y_M(x_i;\theta)\right]^2}{2\varepsilon_i^2}\right)
$$

where $\varepsilon_i$ are the (known) measurement errors indicated by the errorbars.

Assuming all the points are independent, we can find the *full likelihood by multiplying the individual likelihoods together*:

$$
P(D\mid\theta) = \prod_{i=1}^N P(x_i,y_i\mid\theta)
$$

which is a function of the model parameters and the data. From now on, we will refer to the likelihood function as $\mathcal{L}$:

$$ \mathcal{L} \equiv  P(D\mid\theta). $$

For convenience (and also for numerical accuracy) this is often expressed in terms of the *log-likelihood*, which for our simple example is:

$$
\log \mathcal{L} = \log P(D\mid\theta) = -\frac{1}{2}\sum_{i=1}^N\left(\log(2\pi\varepsilon_i^2) + \frac{\left[y_i - y_M(x_i;\theta)\right]^2}{\varepsilon_i^2}\right).
$$

## Exercises

1. Write a python method that creates some Mock data with errors bars, given a known line (i.e. known slope and intercept). We will fit this data below in order to recover the line parameters.

2. Write a Python function which computes the log-likelihood given a parameter vector $\theta$, an array of errors $\varepsilon$, and an array of $x$ and $y$ values

3. Use tools in [`scipy.optimize`](http://docs.scipy.org/doc/scipy/reference/optimize.html) to maximize this likelihood (i.e. minimize the negative log-likelihood). How close is this result to the input ``theta_true`` that you provided in exercise 1?

## Useful references

- Bevington, Data reduction and analysis for the physical sciences
- Lyons, Statistics for nuclear and particle physicists

# The case of *Fermi* LAT data

In our case, the input model is the distribution of gamma-ray sources on the sky and includes their intensity and spectra. One will maximize $\mathcal{L}$ to get the best match of the model to the data. Given a set of data, one can bin them in multidimensional (energy, sky pixels, time etc) bins.

The observed number of counts in a bin $i$ is characterized by the [Poisson distribution](https://en.wikipedia.org/wiki/Poisson_distribution). $\mathcal{L}$ is the product of the probabilities of observing the detected counts in each bin, $n_i$, while $m_i$ counts are predicted by the model:

$$
\mathcal{L} = \prod_i \mathcal{L}_i = \prod_i \frac{m_i^{n_i} e^{-m_i}}{n_i !}
$$

Using the properties of the product, $\mathcal{L}$ can be written in a slightly more convenient way:

$$\mathcal{L} = \prod_i e^{-m_i} \prod_i \frac{m_i^{n_i}}{n_i !} = e^{-N_{\rm pred}} \prod_i \frac{m_i^{n_i}}{n_i !}  $$

where $N_{\rm pred}$ is the predicted total number of counts.

MLE for Fermi 
writing the likelihood function
follow Julie’s notes

How to do it in practice: tutorial 

[Solutions to exercises](fermi_likelihood_lecture-solutions.ipynb).