# Approaches to inference

Goals:

* Explore the relationship between "characterizing the posterior PDF" and "fitting a model to data."

* Be able to compare, contrast and appreciate the Bayesian and Frequentist approaches to statistics

## Further reading

* Ivezic 3.2.2, 4.1 and 4.2

* [Vanderplas (2014), "Frequentism and Bayesianism: A Python-driven Primer"](https://arxiv.org/abs/1411.5018)

* [Hogg, Bovy & Lang (2010), "Data analysis recipes: Fitting a model to data"](https://arxiv.org/abs/1008.4686)

## The period-magnitude relation in Cepheid stars

* Cepheids star brightness oscillates with a stable period that appears to be strongly correlated with their mean luminosity (or absolute magnitude).

* In the ["cepheids"](cepheids.ipynb) notebook we looked at some Cepheid measurements reported by [Riess et al (2011)](https://arxiv.org/abs/1103.2976), and in the ["straight line"](straight_line.ipynb) notebook we inferred the parameters of a simple relationship between Cepheid period and apparent magnitude.

* Let's revisit some of the ingredients of that inference, and look at some alternative approaches to investigating our model using the data.

<img src="../graphics/cepheid_data.png" width=100%>

## The model, and the data

* Let's assume that Cepheid stars' luminosities are related to their oscillation periods by a power law, such that their apparent magnitude and log period follow the straight line relation

$\;\;\;\;\;\;\;m = a\;\log_{10} P + b$

* The data consist of *observed magnitudes with quoted uncertainties*, such as: 

$\;\;\;\;\;\;\;m^{\rm obs} = 24.51 \pm 0.31$ at $\log_{10} P = \log_{10} (13.0/{\rm days})$

## The Bayesian solution

* Compute the posterior PDF for the parameters $a$ and $b$ given the data and the assumed model: 

### $\;\;\;\;\;{\rm Pr}(a,b|\boldsymbol{m}^{\rm obs},H) \propto {\rm Pr}(\boldsymbol{m}^{\rm obs}|a,b,H)\;{\rm Pr}(a,b|H)$

* We evaluated the unnormalized posterior PDF on a grid, renormalized it numerically, and then visualized and summarized the resulting 2D function.

## Fitting the data

* The Bayesian solution is not a single set of "best-fit" parameters. 

* We can think of the posterior PDF as providing us with a continuous distribution of model fits that are _plausible_ given the data and our assumptions.

* There are other ways of defining the parameters that _best fit_ the data: the primary one is "the method of Maximum Likelihood"

## Maximum likelihood

* Instead of asking for the posterior probability for the parameters given the data, ${\rm Pr}(a,b|\boldsymbol{m}^{\rm obs},H)$, we could find the parameters that maximize the probability of getting the data: ${\rm Pr}(\boldsymbol{m}^{\rm obs}|a,b,H)$

  > In astronomy, "best fit" often (but not always) means "maximum likelihood"

* Where does the emphasis on the likelihood, rather than the posterior come from?

## Frequentism

* In the frequentist school of statistics, parameters do not have probability distributions. Probability can only be used to describe _frequencies_, not _degrees of belief_ (or odds). 

* In the frequentist view, it's only the data that can be modeled as having been drawn from a probability distribution, because we can imagine doing the experiment or observation multiple times, and building up a _frequency_ distribution of results.

* This PDF is the sampling distribution, e.g. ${\rm Pr}(\boldsymbol{m}^{\rm obs}|a,b,H)$

## Frequentism

* Given an assumed model, the frequentist view is that there is only one set of parameters, the true ones, and our job is to _estimate_ them.

* Derivation of good estimators is a major activity in frequentist statistics, and has led to some powerful mathematical results and fast computational shortcuts.

## The Likelihood Principle

* The _likelihood principle_ holds that all of the information in the data that is relevant to the model parameters is contained in the likelihood function $\mathcal{L}(a,b) = {\rm Pr}(\boldsymbol{m}^{\rm obs}|a,b,H)$

> This was evident in our Bayesian treatment, PGMs etc too: Frequentists and Bayesians are in full agreement about the importance of the likelihood function!

* As a result of this focus, Maximum Likelihood estimators (MLEs) have some good properties

## Maximum likelihood estimators

* Consistency: as more data are taken, the MLE tends towards the true parameter value if the model is correct. 

> MLEs can be "biased" but this bias goes to zero as $N_{\rm data} \rightarrow \infty$ 

* Efficiency: among estimators, MLEs have the minimum variance when sampled over datasets

* Asymptotic Normality: as the dataset size increases, the distribution of MLEs over datasets tends to a Gaussian centred at the true parameter value.

> The covariance of this ultimate Gaussian distribution is the inverse of the "Fisher information matrix"

## Maximum likelihood

* In our Cepheid straight line fit, we can derive MLEs for our parameters by finding the maximum (log) likelihood parameters analytically

$\;\;\;\;\;\;\;\log L(a,b) = \log {\rm Pr}(\boldsymbol{m}^{\rm obs}|a,b,H) = -\frac{1}{2}\sum_k \log{2\pi\sigma_k^2} - \frac{1}{2} \sum_k \frac{(m^{\rm obs}_k - a\log{P_k} - b)^2}{\sigma_k^2}$

* That is, we need to find the parameters $(\hat{a}, \hat{b})$ that give

$\;\;\;\;\;\;\; -2 \nabla \log L(a,b) = \nabla\,\chi^2$ = 0

> NB: Maximizing a Gaussian likelihood is equivalent to minimizing $\chi^2$ - and gives a "weighted least squares" fit

## Maximum likelihood

The result in this case is a pair of equations that we can solve for the best-fit parameters $(\hat{a}, \hat{b})$, that give the smallest misfit between observed and model-predicted data

Writing $x = \log{P}$ and $y = m^{\rm obs}$, we have

$\frac{\partial \log L}{\partial a}\Bigr|_{\hat{a},\hat{b}} =  \sum_k \frac{x_k(y_k - \hat{a}x_k - \hat{b})}{\sigma_k^2} = 0 \longrightarrow   \hat{a} \sum_k \frac{x_k^2}{\sigma_k^2} + \hat{b} \sum_k \frac{x_k}{\sigma_k^2} = \sum_k \frac{x_k y_k}{\sigma_k^2}$

$\frac{\partial \log L}{\partial b}\Bigr|_{\hat{a},\hat{b}} =  \sum_k \frac{(y_k - \hat{a}x_k - \hat{b})}{\sigma_k^2} = 0 \longrightarrow \hat{a} \sum_k \frac{x_k}{\sigma_k^2} + \hat{b} \sum_k \frac{1}{\sigma_k^2} = \sum_k \frac{y_k}{\sigma_k^2}$

## Maximum likelihood

This set of linear equations can be solved straightforwardly to find the **_estimators_** $\hat{a}$ and $\hat{b}$:

$\mathcal{S} \boldsymbol{\hat{\theta}} = \boldsymbol{v} \longrightarrow \begin{pmatrix} S_{xx} & S_{x} \\ S_x & S_0 \end{pmatrix} \begin{pmatrix} \hat{a} & \hat{b} \end{pmatrix} = \begin{pmatrix} S_{xy} \\ S_{y} \end{pmatrix}$

> All the information in the data that is needed to find the best-fit parameters $\boldsymbol{\hat{\theta}}$ is contained in a set of so-called **sufficient statistics** packaged in $\mathcal{S}$ and $\nu$. This is a common feature of maximum likelihod estimators

## Maximum likelihood

Let's find the maximum likelihood parameters in the Cepheid problem.

In [None]:
exec(open('../code/cepheids.py').read())
%matplotlib inline
plt.rcParams['figure.figsize'] = (15.0, 8.0)

data = Cepheids('../examples/Cepheids/R11ceph.dat')

In [None]:
data.select(4258)
M, v = data.sufficient_statistics()
a, b = np.linalg.solve(M, v)
print('\hat{a} = %.2f' % np.round(a, 2))
print('\hat{b} = %.2f' % np.round(b, 2))

## Visualizing the best fit model

In [None]:
data.plot(4258)

data.overlay_straight_line_with(a=a, b=b, label='Maximum Likelihood fit')

data.add_legend()

## Uncertainties in the estimators

* In frequentism, we think of the estimators having distributions, since each dataset that we imagine being drawn from the sampling distribution will produce one estimator. An ensemble of (hypothetical) datasets leads to a (hypothetical) distribution of estimators

* One straightforward approximate way to estimate these distributions is to use the asymptotic normality property of MLEs, and associate a _Gaussian approximation to the likelihood_ with the Gaussian distribution for the MLEs we expect to see when averaging over datasets

## Uncertainties in the estimators

* The distribution of the log likelihood itself over the hypothetical ensemble of datasets provides a route to a confidence interval.

* In our simple Gaussian likelihood example, and also in the large dataset limit, the $\chi^2$ statistic follows a $\chi^2$ distribution with the same number of degrees of freedom as the dimensionality of the parameter space. Integrating this distribution from 0 to some boundary $\Delta \chi^2_{D}$ defines a confidence region.

## Uncertainties in the estimators

* For example: in 1D, the 68% confidence region is bounded by the contour at $\chi^2_{\rm min} + \Delta \chi^2_{D}$ where $\Delta \chi^2_{D} = 1$ in 1D, and $\Delta \chi^2_{D} = 2.30$ in 2D. 

* In the 1D case, the boundary of the 68% confidence interval lies 1 standard deviation (or "1-sigma") from the mean.

> In general, $\Delta \chi^2_{D}$ can be computed from the $\chi^2$ distribution "inverse survival function", e.g.  `scipy.stats.chi2.isf(1.0-0.683, D)`

## Uncertainties in the estimators

* In our _linear_ example, the likelihood is Gaussian in the estimators $\boldsymbol{\hat{\theta}}$: the exponent is:

### $\;\;\;\;\;\frac{\chi^2}{2} = \frac{1}{2}(\boldsymbol{\hat{\theta}} - \boldsymbol{\theta})^T (\mathcal{M}^T C^{-1} \mathcal{M})^{-1} (\boldsymbol{\hat{\theta}} - \boldsymbol{\theta})$, 

$\;\;\;\;\;$where $C$ is the covariance matrix of the data (i.e. $C$ is diagonal, with elements equal to the squared uncertainties on each datapoint) and $\mathcal{M}$ is the 2xN _design matrix_ that predicts data given parameters via   $\mathcal{M}\boldsymbol{\theta} = \boldsymbol{m}$.

* This $\chi^2$ function can be computed on a grid, and visualized as a contour plot: the contour at $\chi^2_{\rm min} + 2.30$ will enclose the 68% confidence region.

## Uncertainties in the estimators

* In general, the covariance matrix of a _Gaussian approximation to the likelihood_ can be calculated by taking second derivatives of the log likelihood at the peak, and inverting the resulting _Hessian_ matrix. 

* This gives a lower limit to the covariance of a set of estimators: 

## $\;\;\;\;\;V^{-1}_{ij} \geq -\frac{\partial^2 \log{L}}{\partial\theta_i\partial\theta_j} \biggr|_{\boldsymbol{\hat{\theta}}}$

> $V$ is what [`scipy.optimize.minimize`](https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.optimize.minimize.html) returns in its `hess_inv` field if you pass it the negative log likelihood.

In [None]:
# Generalized maximum likelihood approach:

import scipy.optimize

pars = np.array([0.0, 20])
result = scipy.optimize.minimize(data.negative_log_likelihood, pars, method='BFGS', tol=0.001)

print result

In [None]:
C = result.hess_inv
np.sqrt(C[0,0]), np.sqrt(C[1,1])

## Wording

* Frequentist confidence intervals are different from Bayesian credible regions:

> "68% of datasets would give a 68% frequentist confidence interval that contains the true parameter value"

> "The probability of the true parameter value lying within the 68% Bayesian credible region is 68%"

* The difference in wording comes from the two different definitions of probability

## Uncertainties in the estimators

* The covariance matrix of a Gaussian approximation to the likelihood defines a 1-sigma, 2D, elliptical, _frequentist confidence interval_

* Since this came from transforming the sampling distribution, which is a PDF over datasets, the confidence interval enables conclusions in terms of fractions of an ensemble of datasets

* The 68% confidence interval is the region that we expect to contain the true parameter value 68% of the time

## Frequentism and Bayesianism

* In frequentism, the data are considered to be random variables (in large sets of hypothetical trials described with probability distributions) while parameters are considered fixed (and to be estimated)

* In Bayesianism, the data are considered to be fixed (as constants, in datafiles) while parameters are considered random variables (to be inferred, with uncertainty described by probability distributions)

## Frequentism and Bayesianism

Given an assumed model:

* Frequentists seek to _transform_ the frequency distribution of the data into a frequency distribution of their estimators, and hence quantify their uncertainty in terms of _what they expect would happen if the observation were to be repeated_  
  
* Bayesians seek to _update their beliefs_ about their model parameters, and hence quantify their uncertainty in terms of _what might have been had the observation been different_, and _what they knew before the data were taken_

#### Q: Which approach is better matched to science?

## Frequentism vs. Bayesianism


* "You have to assume a prior" cf. "You get to assume a prior"

* "Your calculations are computationally expensive"

* "How do you account for your nuisance parameters?"

* "Your conclusions are not relevant"

## Things to remember

* The most important thing is to _know what you are doing_, and to _communicate that clearly to others_

* Both approaches involve generative models, and their assumptions which must be recorded and tested

* The Bayesian approach provides a logical framework for combining datasets and additional information, and provides answers in terms of the probability distribution for the model parameters

* The Frequentist approach provides a way of studying the model independent of additional information, and provides answers in terms of the probability of getting the data

## Endnote

* The astronomy literature contains a mixture of frequentist and Bayesian analysis, sometimes within the same paper

* Frequentist estimators often make good _summary statistics_ with well understood sampling distributions: astronomical catalogs are full of them

* In most of this course we follow the Bayesian approach: Bayes' Theorem gives you a framework for deriving the solution to _any_ inference problem you encounter.  Having said that, we'll keep our eyes open.