$\newcommand{\pr}{\textrm{Pr}}$
$\newcommand{\l}{\left}$
$\newcommand{\r}{\right}$
$\newcommand\given[1][]{\:#1\vert\:}$
$\newcommand{\var}{\textrm{Var}}$
$\newcommand{\mc}{\mathcal}$
$\newcommand{\lp}{\left(}$
$\newcommand{\rp}{\right)}$
$\newcommand{\lb}{\left\{}$
$\newcommand{\rb}{\right\}}$
$\newcommand{\iid}{\textrm{i.i.d. }}$
$\newcommand{\ev}{\textrm{E}}$
$\newcommand{\odds}{\textrm{odds}}$
$\newcommand{\normal}{\textrm{normal}}$
$\newcommand{\gamma}{\textrm{gamma}}$
$\newcommand{\mode}{\textrm{mode}}$
$\newcommand{\MSE}{\textrm{MSE}}$

# 5.1 The normal model

A random variable $Y$ is said to be normally distributed with mean $\theta$ and 
variance $\sigma > 0$ if the density of $Y$ is given by
$$p\lp y \given \theta, \sigma \rp = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2}\lp\frac{y-\theta}{\sigma}\rp^2} $$
for $-\infty < y < \infty$. Some important things about this distribution:

* the distribution is symmetric around $\theta$, and the mode, median, and mean are all equal to $\theta$
* about 95% of the population lies with 95% of the mean (more precisely, 1.96 stdevs)
* if $X \sim \normal\lp \mu, \tau^2\rp$, $Y \sim \normal\lp\theta,\sigma^2\rp$, and $X$ and $Y$ are independent, 
than $aX + bY \sim \normal\lp a\mu + b\theta, a^2\tau^2 + b^2\sigma^2\rp$
* the R commands for working with normal distribution take the standard deviation $\sigma$, not the variance
$\sigma^2$ as input

The normal distribution is particularly important because the central limit theorem says that 
under very general conditions, the sum (or mean) of a set of random variables is approximately
normally distributed. This means the normal model will be appropriate for data that results
from the additive effects of a large number of factors.

# 5.2 Inference for the mean, conditional on the variance

Suppose our model is $\lb Y_1, \cdots, Y_n \given \theta, \sigma^2\rb \sim \iid \normal\lp\theta, \sigma^2\rp$.

$\lb \sum y_i^2, \sum y_I \rb$ make up a two-dimensional (vector) sufficient statistic for the 
normal model. Knowing these values
is the same as knowing the sample mean $\bar{y} = \sum y_i /n$ and the sample variance
$s^2 = \sum \lp y_i - \bar{y} \rp^2 / \lp n-1\rp$, so $\lb \bar{y}, s^2 \rb$ is also a sufficient statistic.

A class of prior distributions is conjugate for a sampling model if the resulting posterior
distribution is in the same class. If a (conditional) prior distribution $p \lp \theta \given \sigma^2 \rp$
is normal and $y_1, \cdots, y_n \sim \iid \normal \lp\theta,\sigma^2\rp$, the posterior distribution 
$p\lp\theta\given y_1, \cdots, y_n, \sigma^2 \rp$ will also be normally distributed. If
$\theta \sim \normal\lp\mu_0,\tau^2_0\rp$, then the posterior variance is
$$\tau_n^2 = \frac{1}{\frac{1}{\tau_0^2} + \frac{n}{\sigma^2}}$$
and the posterior mean is
$$\mu_n = \frac{\frac{1}{\tau_0^2}\mu_0 + \frac{n}{\sigma^2}\bar{y}}{\frac{1}{\tau_0^2} + \frac{n}{\sigma^2}}$$

$\tau_0^2$ is the prior variance and $\sigma^2$ is the actual sampling variance of the data (how close the $y_i$'s are to $\theta$). $\tau_n^2$ is the posterior variance of our estimate of $\theta$. See next section for more.

## Combining information

The (conditional) posterior parameters $\tau_n^2$ and $\mu_n$ combine the prior parameters $\tau_0^2$
and $\mu_0$ with terms from the data.

### Posterior variance

The formula for $1/\tau_n^2$ is 
$$\frac{1}{\tau_n^2} = \frac{1}{\tau_0^2} + \frac{n}{\sigma^2}$$
and so the prior inverse variance is combined with inverse of the data variance.
The inverse variance is often referred to as the **precision**. For the normal model, let

* $\tilde{\sigma}^2 = 1 / \sigma^2 = $ sampling precision (how close the $y_i$'s are to $\theta$)
* $\tilde{\tau}_0^2 = 1 / \tau_0^2 = $ prior precision 
* $\tilde{\tau}_n^2 = 1 / \tau_n^2 = $ posterior precision

Precision is a quantity that works on an additive scale:
$$\tilde{\tau}_n^2 = \tilde{\tau}_0^2 + n\tilde{\sigma}^2$$
In other words, posterior information = prior information + data information.
Here, our uncertainty in our estimate of $\theta$ is the sum of our prior uncertainty
plus the sampling variability in the data. Here, we assume we know the sampling variability
but we will jointly estimate it in the next section.

If this is confusing, remember that we are only doing inference for $\theta$ here, so we will
come up with a posterior distribution for $\theta$. That posterior distribution is normal
with mean $\mu_n$ and variance $\tau_n^2$. We need these two parameters to define the posterior
normal distribution for $\theta$. In the next section, when we do joint inference 
for both $\theta$ and $\sigma^2$, we will still get a mean and variance for $\theta$. We will
also get one or more parameters that define the posterior distribution of $\sigma^2$, depending
on what the posterior distibution is.

### Posterior mean

Notice that 
$$\mu_n = \frac{\tilde{\tau}_0^2}{\tilde{\tau}_0^2 + n\tilde{\sigma}^2}\mu_0 + 
\frac{n\tilde{\sigma}^2}{\tilde{\tau}_0^2 + n\tilde{\sigma}^2}\bar{y}$$
so the posterior mean is a weighted average of the prior mean and the sample mean.
The weight on the sample mean is $n / \sigma^2$ (ignoring the denominator), the sampling
precision of the sample mean. The weight on the prior mean is $1/\tau_0^2$, the prior
precision. If the prior mean was based On $\kappa_0$ prior observations from the same 
or similar population as $Y_1, \cdots, Y_n$ then we can set $\tau_0^2 = \sigma^2 / \kappa_0$,
the variance of the mean of the proir observations. The formula for the posterior mean
then reduces to 
$$\mu_n = \frac{\kappa_0}{\kappa_0 + n}\mu_u + \frac{n}{\kappa_0 + n}\bar{y}$$

### Prediction

If we want to predict a new sample $\tilde{Y}$ from the population after having observed
$Y_1=y_1, \cdots, Y_n=y_n$, we can use the fact that 
$$\lb \tilde{Y} \given \theta, \sigma^2 \rb \sim \normal\lp\theta, \sigma^2\rp \iff
\tilde{Y} = \theta + \tilde{\epsilon}, \lb\tilde{\epsilon} \given \theta, \sigma^2\rb \sim 
\normal\lp 0, \sigma^2\rp$$
In other words, since $\tilde{Y}$ is normally distributed, it can be represented as its 
mean plus some normally distributed noise $\tilde{\epsilon}$. We can easily compute the 
predictive mean and variance and find that
$$\tilde{Y} \given \sigma^2, y_1, \cdots, y_n \sim \normal\lp\mu_n,\tau_n^2 + \sigma^2\rp$$
$\tau_n^2$ is our uncertainty in our estimate for the mean and $\sigma^2$ is the sampling 
variability. As our data increases, $\tau_n^2$ will decrease but the sampling variability
will always remain.

# 5.3 Joint inference for the mean and variance

We perform joint inference in a similar way as for one parameter. Starting with Bayes' rule,
\begin{align}
p\lp\theta, \sigma^2\given y_1,\cdots,y_n\rp &= \frac{p\lp y_1,\cdots,y_n \given \theta,\sigma^2\rp
p\lp\theta,\sigma^2\rp}{p\lp y_1,\cdots,y_n \rp} \\
&= \frac{p\lp y_1,\cdots,y_n \given \theta,\sigma^2\rp p\lp\theta \given \sigma^2\rp p\lp \sigma^2\rp}
{p\lp y_1,\cdots,y_n \rp}
\end{align}
We already have a conjugate prior for the mean given the variance so we just need one for 
the variance $p\lp \sigma^2 \rp$. Let's consider the 
particular case where $\tau_0^2 = \sigma^2 / \kappa_0$. In other words, the prior variance is 
equal to the actual sampling variance scaled by the inverse of $\kappa_0$, our prior sample size. 
So far we have
$$ p\lp\theta \given \sigma^2\rp p\lp \sigma^2\rp = \textrm{dnorm}\lp\theta,\mu_0,\tau_0 = \sigma / \sqrt{\kappa_0}\rp
\times p\lp \sigma^2\rp$$
We need a prior that has support on $\lp0, \infty\rp$. One such family is the gamma family, though
it turns out that the gamma family is conjugate for $1/\sigma^2$. When using such a prior 
distriubtion, we say that $\sigma^2$ has an **inverse-gamma** distribution:

* precision = $1/\sigma^2 \sim \textrm{gamma}\lp a,b\rp$
* variance = $\sigma^2 \sim \textrm{inverse-gamma}\lp a,b\rp$

We will reparameterize this as 
$$1/\sigma^2 \sim \gamma\lp \frac{\nu}{2}, \frac{\nu}{2}\sigma_0^2\rp$$
Under this parameterization,
* $\ev\l[\sigma^2\r] = \sigma_0^2\frac{\nu_0 / 2}{\nu_0 / 2 - 1}$
* $\mode\l[\sigma^2\r] = \sigma_0^2\frac{\nu_0 / 2}{\nu_0 / 2 + 1}$, so 
$\mode\l[\sigma^2\r] < \sigma_0^2 < \ev\l[\sigma^2\r]$
* $\var\l[\sigma^2\r]$ is decreasing in $\nu_0$

## Posterior inference

In the above section, we decomposed the prior distribution
$$p\lp\theta,\sigma^2\rp = p\lp\theta\given\sigma^2\rp p\lp\sigma^2\rp$$
Now we will decompose the posterior in the same way
$$p\lp\theta,\sigma^2 \given y_1,\cdots,y_n\rp = p\lp \theta \given \sigma^2, y_1,\cdots, y_n\rp
p\lp \sigma^2 \rp$$
We've already calculated the conditional distribution of $\theta$ given $\sigma^2$ and the data
above:
$$\lb \theta \given y_1,\cdots,y_n,\sigma^2\rb \sim \normal\lp\mu_n,\sigma^2/\kappa_n\rp$$
where
$$\kappa_n = \kappa_0 + n$$
and 
$$\mu_n = \frac{\lp\kappa_0 / \sigma^2\rp\mu_0 + \lp n/\sigma^2\rp\bar{y}}{\kappa_0/\sigma^2 + n/\sigma^2} = 
\frac{\kappa_0\mu_0 + n\bar{y}}{\kappa_n}$$

The posterior distribution of $\sigma^2$ can be obtained by integrating over the unknown value of
$\theta$ which gives
$$\lb 1/\sigma^2\given y_1,\cdots, y_n\rb\sim\gamma\lp\nu_n/2,\nu_n\sigma^2_n/2\rp$$
where
$$\nu_n = \nu_0 + n$$
and
$$\sigma^2_n = \frac{1}{\nu_n}\l[\nu_0 \sigma^2_0 + \lp n-1 \rp s^2 + \frac{\kappa_0 n}{\kappa_n}
\lp \bar{y} - \mu_0\rp^2\r]$$
These formulae suggest an interpretation of $\nu_0$ as a prior sample size, from which a prior
sample variance of $\sigma^2_0$ has been obtained. $s^2$ is the sample variance and $\lp n-1\rp s^2$
is the sum of the squared observations from the sample mean (sum of squares). We can think of
$\nu_0\sigma^2_0$ and $\nu_n\sigma^2_n$ as the prior and posterior sum of squares, respectively
(I suppose because they have the same "(sample size) $\times$ (variance)" form as $\lp n-1 \rp s^2$). Multiplying the
expression for $\sigma_n^2$ by $\nu_n$ almost gives us "posterior sum of squares equals
prior sum of squares plus data sum of squares." However, there is a third term 
$\frac{\kappa_0 n}{\kappa_n} \lp \bar{y} - \mu_0\rp^2$. This term says that a large value $\lp \bar{y} - \mu_0\rp^2$
increases the posterior probability of a large $\sigma^2$. This makes sense for our particular
joint prior distribution for $\theta$ and $\sigma^2$: if we want to think of $\mu_0$ as the sample mean
of $\kappa_0$ prior observations with variance $\sigma^2$, then $\frac{\kappa_0 n}{\kappa_n} \lp \bar{y} - \mu_0\rp^2$
is an estimate of $\sigma^2$ and so we want to use the information this term provides. We will develop
an alternative prior distribution in the following section for situations where $\mu_0$ is not the
mean of prior observations.

## Monte Carlo sampling

Often we are interested in the population mean $\theta$ and we 
just want to calculate quantities like $\ev\l[y_1\cdots,y_n\r]$, $\textrm{sd}\l[y_1,\cdots,y_n\r]$,
$\pr\l[\theta_1 < \theta_2 \given y_{1,1},\cdots,y_{n_2,2}\r]$, etc. As discussed in the last chapter,
we can obtain these quantities by sampling $\theta$ from the **marginal posterior distribution** of $\theta$
given the data $p\lp\theta\given y_1,\cdots, y_n\rp$.

Why is this a marginal distribution? It's marginal because we need to marginalize out $\sigma^2$ (I think).
This interpretation would seem to agree with section 4.4. Note that this is sampling from the posterior
predictive distribution.

So far, we have the conditional distribution of $\theta$ given $\sigma^2$ and the data. We can generate samples of
$\theta$ from the joint distribution of $\theta$ and $\sigma^2$ by first sampling $\sigma^2$ from 
its inverse gamma and then using the sampled $\sigma^2$ to sample $\theta$:
\begin{align}
\sigma^{2\lp 1\rp} \sim \textrm{inverse gamma}\lp\nu_0/2, \sigma^2_n\nu_n/2\rp),&
\theta^{\lp 1 \rp} \sim \normal\lp\mu_n, \sigma^{2\lp 1\rp} / \kappa_n\rp, \\
\cdots, \\
\sigma^{2\lp S\rp} \sim \textrm{inverse gamma}\lp\nu_0/2, \sigma^2_n\nu_n/2\rp),&
\theta^{\lp S \rp} \sim \normal\lp\mu_n, \sigma^{2\lp S\rp} / \kappa_n\rp, \\
\end{align}

A sequence of pairs $\lb\lp\\sigma^{2\lp 1\rp}, \theta^{\lp 1\rp}\rp, \cdots, \lp\\sigma^{2\lp S\rp}, \theta^{\lp S\rp}\rp\rb$ simulated in this are way are independent samples from the joint posterior distribution 
of $p\lp \theta, \sigma^2 \given y_1,\cdots,y_n\rp$. Additionally, the sequence $\lb \theta^{\lp 1\rp}, \cdots, \theta^{\lp S\rp}\rp$ can be seen as independent samples from the marginal posterior distribution of 
$p\lp \theta \given y_1,\cdots,y_n\rp$ (having marginalized $\sigma^2$ out I suppose). 

It turns out that the marginal posterior distribution of 
$$t\lp\theta\rp = \frac{\lp\theta - \mu_n\rp}{\sigma_n / \sqrt{\kappa_n}}$$
is $t$-distributed with $\nu_0 + n$
degrees of freedom. If $\kappa_0$ and $\nu_0$ are small, the posterior distribution
of $t\lp\theta\rp$ will be very close to the $t_{n-1}$ distribution.

### Improper priors

You may be tempted to use Bayesian approaches but try not to use prior information in order
to not appear biased. We can let $\kappa_0$ and $\nu_0$ go to zero to understand what would
happen with no prior information. As $\kappa_0, \nu_0 \rightarrow 0$, 
\begin{align}
\mu_n &\rightarrow \bar{y} \\
\sigma^2_n &\rightarrow \frac{1}{n}\sum \lp y_i-\bar{y}\rp^2
\end{align}
This leads to the following posterior distributions:
\begin{align}
\lb 1/\sigma^2 \given y_1, \cdots, y_n \rb &\sim \gamma\lp \frac{n}{2}, \frac{n}{2}\frac{1}{n}\sum \lp y_i-\bar{y}\rp^2\rp \\
\lb \theta \given y_1, \cdots, y_n \rb &\sim \normal\lp\bar{y},\frac{\sigma^2}{n}\rp
\end{align}
You can show that 
$$\frac{\theta - \bar{y}}{s/\sqrt{y}}\given y_1, \cdots, y_n \sim t_{n-1}$$

This can be compared to the sampling distribution of the $t$ statistic, conditional on $\theta$
but not on the data
$$\frac{\bar{Y} - \theta}{s/\sqrt{n}}\given\theta\sim t_{n-1}$$

Ths second statement says that the deviation of the estimate $\bar{Y}$ from the true population 
mean $\theta$ (scaled by the denominator) is represented by a $t_{n-1}$ distribution. The first
statement says that after you sample your data, your uncertainty is still represented by $t_{n-1}$
distribution. 

Since there are no prior probabilities that will lead to the $t_{n-1}$ posterior for $\theta$,
inference based on this posterior is not formally Bayesian. Somtimes taking limits like this
can lead to reasonable answers, however.

# 5.4 Bias, variance, and mean squared error

A **point estimator** of an unknown parameter $\theta$ is a function that converts data into a single
element of the parameter space $\Theta$. In the case of a normal sampling model and conjugate prior
distribution of the last section, the posterior mean estimator of $\theta$ is
$$\hat{\theta}_b \lp y_1,\cdots,y_n\rp = \ev\l[\theta\given y_1,\cdots,y_n\r] = 
\frac{n}{\kappa_0 + n}\bar{y} + \frac{\kappa_0}{\kappa_0 + n}\mu_0 = w\bar{y} + \lp 1-w\rp\mu_0$$

The sampling propertires of an estimator such as $\hat{\theta}_b$ refer to its behavior under hypothetically
repeatable surveys or experiments. Let's compare the sampling properties of $\hat{\theta}_b$ to 
$\hat{\theta}_e\lp y_1,\cdots,y_n\rp = \bar{y}$, the sample mean, when the true value of the population 
mean is $\theta_0$:
\begin{align}
\ev\l[\theta_e\given\theta=\theta_0\r] &= \theta_0, \\
\ev\l[\theta_b\given\theta=\theta_0\r] &= w\theta_0 + \lp 1-w\rp\mu_0
\end{align}
We say that $\hat{\theta}_e$ is unbiased because its expected value equals the true population mean.
We say that $\hat{\theta}_b$ is biased since $\mu_0 \not= \theta_0$.

Bias refers to how close the center of mass of the sampling distribution is to the true value. Bias doesn't
tell us how far away an estimate from the sampling distribution might be from the true value, however. We
can look at the mean squared error (MSE) to evaluate how close an estimator $\hat{\theta}$ is likely to be 
to the true value $\theta$. Letting $m=\ev\l[\hat{\theta}\given\theta_0\r]$, the MSE is
$$\MSE\l[\hat{\theta}\given\theta_0\r] = \var\l[\hat{\theta}\given\theta_0\r] + 
\textrm{Bias}\l[\hat{\theta}\given\theta_0\r]$$
This means that before the data are gathered, the expected distance from the estimator to the true value
depends on how close $\theta_0$ is to the center of the distribution of $\hat{\theta}$ (bias) and how
spread out the distribution is (the variance). While the bias $\hat{\theta}_e$ is zero, it turns out
that $\var\l[\hat{\theta}_b\r] < \var\l[\hat{\theta}_e\r]$ and that the MSE of $\hat{\theta}_b$ 
is less than the MSE of $\hat{\theta}_e$ if 
\begin{align}
\lp \mu_0 - \theta_0\rp^2 &< \frac{\sigma^2}{n}\frac{1+w}{1-w} \\
&=\sigma^2\lp\frac{1}{n}+\frac{2}{\kappa_0}\rp
\end{align}
If you know even a little bit about the population you are about to sample from, you should be able to find
values of $\theta_0$ and $\kappa_0$ such that this inequality holds. In this case, you can construct
a Bayesian estimator that will have lower average squared distance to the truth than does the sample mean.
For example, if you are pretty sure that your best prior guess $\mu_0$ is within two standard deviations
of the true population mean, then if you pick $\kappa_0 = 1$ you can be pretty sure the Bayesian
estimator has a lower MSE.

# 5.5 Prior specification based on expectations

A $p$ dimensional exponential family model is a model whose densities can be written as
$p\lp y\given\phi\rp = h\lp y\rp c\lp\phi\rp\exp\lb\phi^T \textbf{t}\lp y\rp\rb$, where
$\phi$ is the parameter to be estimated and $\textbf{t}\lp y\rp = \lb t_1\lp y\rp, \cdots, t_p\lp y\rp\rb$
is the sufficient statistic. The normal model is a two-dimensional exponential family
model with
* $\textbf{t}\lp y\rp = \lp y,y^2\rp$
* $\phi = \lp \theta/\sigma^2, -\lp 2\sigma^2\rp^{-1}\rp$
* $c\lp\phi\rp = \left|\phi_2\right|^{1/2}\exp\lb\phi_1^2/\lp 2\phi_2\rp\rb$

A conjugate prior can be written in terms of $\phi$, giving
$p\lp\phi\given n_0, \textbf{t}_0\rp \propto c\lp\phi\rp^{n_0}\exp\lp n_0, \textbf{t}^T_0\phi\rp$, where
$\textbf{t}_0 = \lp t_{01},t_{02}\rp = \lp \ev\l[Y\r], \ev\l[Y^2\r]\rp$, the prior expectations
of $Y$ and $Y^2$. We can reparameterize in terms of $\theta, \sigma^2$ and obtain an expression for
$p\lp\theta,\sigma^2\given n_0, t_0\rp$ that is proportional to a $\normal\lp t_{01}, \sigma^2/n_0\rp$
density times an $\textrm{inverse-gamma}\lp\lp n_0 + 3\rp/2, n_0\lp t_2 - t_1^2 / 2\rp\rp$ density.

How do we interpret the prior parameters $t_{01}$ and $t_{02}$? Consider the case where we have a prior
expectation $\mu_0$ for the population mean and a prior expectation $\sigma^2$ for the population variance.
Our joint distribution for $\lp\theta,\sigma^2\rp$ is then
\begin{align}
\theta \given \sigma^2 &\sim \normal\lp\mu_0, \sigma^2 / n_0\rp \\
\sigma^2 &\sim \textrm{inverse-gamma}\lp\lp n_0 + 3\rp/2, \lp n_0+ 1\rp\sigma^2_0/2\rp
\end{align}

If our prior information is weak, we might set $n_0=1$.

# 5.6 The normal model for non-normal data

People use the normal model even for non-normal data because the sampling distribution
of the sample mean is generally close to normal. In general, using the normal model
for non-normal data is reasonable if we are only interested in obtaining a posterior
distribution for the population mean. For other population quantities the normal model
can provide misleading results.

# 5.7 Discussion and further references

A characterizing feature of the normal distribution is that the mean and variance
are independent. From a subjective probability standpoint, this suggests if your
beliefs about the sample mean are independent from those about the sample variance,
then a normal model is appropriate. 

Among all distribution with a given mean $\theta$ and variance $\sigma^2$, the normal
distribution is the most diffuse in terms of entropy.





