**Tutorial 9 - Bayesian Model Checking**

In this tutorial we will do some Bayesian model checking on the type Ia supernovae data that we used in tutorial 8.

The object of this exercise is to see if the data is consistent with the cosmological and statistical model we have been using to find the best fit parameters.  These assumptions could be wrong in any of a number of ways.  For example, the errors in the distance moduli are not Gaussian or not measured correctly, the cosmological model we are assuming is incorrect or the observed redshifts which we have treated as independent variable have significant errors in them.  We might wonder if the cosmological model we have assume which has no curvature (flat) and a cosmological constant is too simple.  Might there be evidence that the cosmological constant is evolving? 

In [None]:
import pandas as pa
import numpy as np
import matplotlib.pyplot as plt

Read in the MC chain of parameters that you calculated in tutorial 8.  Put it into a numpy array so that `mcmc_chain[0]` is the first pair of parameters with the `M` parameter first.

In [None]:
df = pa.read_csv('mcmc.csv')
mcmc_chain = _______
## this should print the first point in chain
print(mcmc_chain[0])

Read in the supernova data again and put them in the arrays named.

In [None]:
data = pa.read_csv("SCPUnion2.1_mu_vs_z.txt",sep='\t',comment='#')

redshifts = ________
magnitudes = __________
errors = ________


1) Copy your definition of `mu_model()` from tutorial 8.  Modify it so that it doesn't take the redshifts as a parameter, but instead uses the array `redshifts` already defined and accessible from inside the function.  Make sure the `params[0]` is the magnitude offset and `params[1]` is $\Omega_m$.

In [None]:
#from astropy.cosmology import FRW
import astropy.cosmology as cosmo
def mu_model(params):
    .
    .
    .

2) We will need to generate new mock data sets given a set of parameters.

Make a function that takes the parameters ($M$ and $\Omega$) and generates a new data set with the same size and redshifts.  Inside the function it should use `mu_model()` and the distance modulus errors.  Assuming the errors are normally distributed in magnitudes.  No looping should be necessary.  (hint: Use `numpy.random.normal()` )

In [None]:
def data_generator(params) :
    .
    .
    .
    

Let's do a simple test to see that things are working so far :

In [None]:
x = data_generator([12,0.7])
print(x[0:10])

3) Now we need to define the $\chi^2$ function which is the same as the negative log likelihood in this case except for some constant factors and additive constants.

Make a `chi2()` function that takes the parameters and the distance moduli (`mags`) and returns the $\chi^2$.  It should use the `errors` array and `mu_model()` inside it.

In [None]:

def chi2(params,mags) :
    .
    .


4) Use the function `scipy.optimize.minimize` to minimize `chi2()` to find the best fit parameters for the observed data.  Note that the object returned by `scipy.optimize.minimize` has both the parameter values at the minimum (`result.x`) and the value of the function at the minimum (`result.fun`).  Store these values for later use.

In [None]:
from scipy.optimize import minimize

param_init = _________
bestfit_observed = minimize(chi2,param_init,args=(_______),bounds=((None, None), (0, 1)))

print("best fit model = ",bestfit_observed.x," Its chi^2 is ",bestfit_observed.fun)

5) Besides $\chi^2$, we could also think of other statistics.  One that we will use is the maximum absolute residual with respect to the best-fit model prediction

${\rm max}_i\left| \frac{\mu_i - \mu_{model}(\hat{\theta},z_i)}{\sigma_i}  \right|$

We would expect that if the errors are not Gaussian in the sense that there are catastrophic outliers this value would be higher than expected.

Find the maximum absolute residual for the data set and store it in `max_res_observed`.

In [None]:
.
.
.
print("maximum absolute residual for data = ",max_res_observed)

6) Now we have all tools to create a sample drawn from the distribution

$ p(T) = \int d\theta ~p(\theta | D ) p(T | \theta) = \int d\theta ~p(\theta | D ) \int_{V(T)}dx~ p(x | \theta) $

where $T$ is a statistic and $V(T)$ is the volume in data-space where $T(x)=T$.  For $T$ we will choose the minimum $\chi^2$ and the maximum absolute residual.  Fill in the missing lines and store the values of these statistics for each iteration.  Make at least 1000 simulated data sets.

In [None]:

from numpy.random import randint
from scipy.optimize import minimize


chi2array = []
max_res = []

nmcmc = len(mcmc_chain)
for i in range(1,1000) :
    # take random set of parameters from the Markov Chain 
    ________________
    # generate a new random data set from the model and this set of parameters 
    ________________
    # find the maximum likelihood parameters for the new dataset
    ________________
    
    chi2array.append(_______)
    
    max_res.append(_______)


7) Make a histogram of the maximum absolute residuals (MAR).  Then make a plot of the cumulative distribution of them.  Mark on this last plot the observed MAR with a vertical line.

8) Make a histogram of the minimum $\chi^2$s. Then make a plot of the cumulative distribution of them. Mark on this last plot the observed minimum $\chi^2$ with a vertical line.

9) Calculate the right-hand, one sided p-values for the MAR and the minimum $\chi^2$.  Is this model for cosmology and errors consistent with the data in terms of these two statistics? 