## 13. Statistical Decision Theory

### 13.1 Preliminaries

**Decision theory** is a formal theory for comparing between statistical procedures.

In the language of decision theory, a estimator is sometimes called a **decision rule** and the possible values of the decision rule are called **actions**.

We shall measure the discrepancy between $\theta$ and $\hat{\theta}$ using a **loss function** $L(\theta, \hat{\theta})$.  Formally, $L$ maps $\Theta \times \Theta$ into $\mathbb{R}$.

The **risk** of an estimator $\hat{\theta}$ is

$$ R(\theta, \hat{\theta}) = \mathbb{E}_\theta \left( L(\theta, \hat{\theta}) \right)
= \int L(\theta, \hat{\theta}(x)) f(x; \theta) dx$$

When the loss function is squared error, then the risk is just the mean squared error:

$$R(\theta, \hat{\theta}) = \mathbb{E}_\theta(\hat{\theta} - \theta)^2 = \text{MSE} = \mathbb{V}_\theta(\hat{\theta}) + \text{bias}_\theta^2(\hat{\theta})$$

In the rest of chapter, if the risk function is not specified, assume the loss function is the squared error.

### 13.2 Comparing Risk Functions

The **maximum risk** is

$$ \overline{R}(\hat{\theta}) = \sup_\theta R(\theta, \hat{\theta})$$

and the **Bayes risk** is

$$ r(\pi, \hat{\theta}) = \int R(\theta, \hat{\theta}) \pi(\theta) d\theta$$

where $\pi(\theta)$ is a prior for $\theta$.

An estimator that minimizes the maximum risk is called a **minimax rule**. Formally, $\hat{\theta}$ is minimax if

$$R(\theta, \hat{\theta}) = \inf_{\overline{\theta}} \sup_\theta R(\theta, \hat{\theta})$$

where the infimum is over all estimators $\overline{\theta}$.

A decision rule that minimizes the Bayes risk is called a **Bayes rule**. Formally, $\hat{\theta}$ is a Bayes rule for prior $\pi$ if

$$R(\theta, \hat{\theta}) = \inf_{\overline{\theta}} r(\pi, \overline{\theta})$$

where the infimum is over all estimators $\overline{\theta}$.

### 13.3 Bayes Estimators

Let $\pi$ be a prior.  From Bayes' theorem, the posterior density is

$$f(\theta | x) = \frac{f(x | \theta) \pi(\theta)}{m(x)} = \frac{f(x | \theta) \pi(\theta)}{\int f(x | \theta) \pi(\theta) d\theta} $$

where $m(x) = \int f(x, \theta) d\theta = \int f(x | \theta) \pi(\theta) d\theta$ is the **marginal distribution** of $X$.  Define the **posterior risk** of an estimator $\hat{\theta}(x)$ by

$$r(\hat{\theta} | x) = \int L(\theta, \hat{\theta}(x)) f(\theta | x) d\theta$$

**Theorem 13.8**.  The Bayes risk $r(\pi, \hat{\theta})$ satisfies

$$r(\pi, \hat{\theta}) = \int r(\hat{\theta} | x) m(x) dx$$

Let $\hat{\theta}(x)$ be the value of $\theta$ that minimizes $r(\hat{\theta} | x)$.  Then $\hat{\theta}$ is the Bayes estimator.

**Proof**.  We can rewrite the Bayes risk as:

$$
\begin{align}
r(\pi, \hat{\theta} &= \int R(\theta, \hat{\theta}) \pi(\theta) d\theta = \int \left( \int L(\theta, \hat{\theta}(x)) f(x | \theta) dx \right) \pi(\theta) d\theta \\
&= \int \int L(\theta, \hat{\theta}(x)) f(x, \theta) dx d\theta = \int \int L(\theta, \hat{\theta}(x)) f(\theta | x) m(x) dx d\theta \\
&= \int \left(\int L(\theta, \hat{\theta}(x) d\theta \right) m(x) dx = \int r(\hat{\theta} | x) m(x) dx
\end{align}
$$

If $\hat{\theta} = \text{argmin}_\theta r(\hat{\theta} | x)$ then we will minimize the integrand at every $x$ and thus minimize the integral $\int r(\hat{\theta} | x)m(x) dx$.

**Theorem 13.9**.  If $L(\theta, \hat{\theta}) = (\theta - \hat{\theta})^2$ then the Bayes estimator is

$$\hat{\theta}(x) = \int \theta f(\theta | x) d\theta = \mathbb{E}(\theta | X = x)$$

If $L(\theta, \hat{\theta}) = |\theta - \hat{\theta}|$ then the Bayes estimator is the median of the posterior $f(\theta | x)$.  If $L(\theta, \hat{\theta})$ is zero-one loss, then the Bayes estimator is the mode of the posterior $f(\theta | x)$.

**Proof**.  We will prove the theorem for the squared error loss.  The Bayes rule $\hat{\theta}$ minimizes $r(\theta | x) = \int (\theta - \hat{\theta}(x))^2 f(\theta | x) d\theta$. Taking the derivative of $r(\hat{\theta} | x)$ with respect to $\hat{\theta}(x)$ and setting it to 0 yields the equation $2 \int (\theta - \hat{\theta}(x)) f(\theta | x) d\theta = 0$.  Solving for $\hat{\theta}(x)$ we get the given estimator.

### 13.4 Minimax Rules

The problem of Minimax Rules is complicated and a complete coverage of that theory will not be attempted here, but a few key results will be mentioned.  Main takeaway message from this section:  Bayes estimators with a constant risk function are minimax.

**Theorem 13.11**.  Let $\hat{\theta}^\pi$ be the Bayes rule for some prior $\pi$:

$$r(\pi, \hat{\theta}^\pi) = \inf_{\hat{\theta}} r(\pi, \hat{\theta})$$

Suppose that

$$R(\theta, \hat{\theta}^\pi) \leq r(\pi, \hat{\theta}^\pi) \;\text{for all } \theta$$

Then $\hat{\theta}^\pi$ is minimax and $\pi$ is called a **least favorable prior**.

**Proof**.  Suppose that $\hat{\theta}^\pi$ is not minimax. Then there is another rule $\hat{\theta}_0$ such that $\sup_\theta R(\theta, \hat{\theta}_0) \leq \sup_\theta R(\theta, \hat{\theta}^\pi)$.  Since the average of a function is always less than or equal to its maximum, we have that $r(\theta, \hat{\theta}_0) \leq \sup_\theta R(\theta, \hat{\theta}_0)$.  Hence,

$$r(\theta, \hat{\theta}_0) \leq \sup_\theta R(\theta, \hat{\theta}_0) \leq \sup_\theta R(\theta, \hat{\theta}^\pi) \leq r(\pi, \hat{\theta}^\pi)$$

which contradicts $r(\pi, \hat{\theta}^pi) = \inf_{\hat{\theta}} r(\pi, \hat{\theta})$.

**Theorem 13.12**.  Suppose that $\hat{\theta}$ is the Bayes rule estimator with respect to some prior $\pi$.  Suppose further that $\hat{\theta}$ has constant risk: $R(\theta, \hat{\theta}) = c$ for some $c$.  Then $\hat{\theta}$ is minimax.

**Proof**.  The Bayes risk is $r(\pi, \hat{\theta}) = \int R(\theta, \hat{\theta}) \pi(\theta) d\theta = c$ and hence $R(\theta, \hat{\theta}) \leq r(\pi, \hat{\theta})$ for all $\theta$.  Now apply Theorem 13.11.

**Theorem 13.15**.  Let $X_1, \dots, X_n \sim N(\theta, 1)$ and let $\hat{\theta} = \overline{X}$. Then $\hat{\theta}$ is minimax with respect to any well-behaved loss function.  It is the only estimator with this property.

*Well-behaved means that the level sets must be convex and symmetric about the origin.  The result holds up to sets of measure 0.*

### 13.5  Maximum Likelihood, Minimax and Bayes

For parametric models that satisfy weak regularity conditions, the MLE is approximately minimax.  Consider squared error loss which is squared bias plus variance.  In parametric models with large samples, it can be shown that the variance term dominates the bias so the risk of the MLE $\hat{\theta}$ roughly equals the variance:

$$R(\theta, \hat{\theta}) = \mathbb{V}_\theta(\hat{\theta}) + \text{bias}^2 \approx \mathbb{V}_\theta(\hat{\theta})$$

*Typically, the squared bias is of order $O(n^{-2})$ while the variance is of order $O(n^{-1})$.*

As seen on the chapter on parametric models, the variance is approximately:

$$\mathbb{V}(\hat{\theta}) = \frac{1}{nI(\theta)}$$

where $I(\theta)$ is the Fisher information.  Hence,

$$ n R(\theta, \hat{\theta}) \approx \frac{1}{I(\theta)}$$

For any other estimator $\theta'$, it can be shown that, for large $n$,  $R(\theta, \theta') \geq R(\theta, \hat{\theta})$.  More precisely,

$$ \lim_{\epsilon \rightarrow 0} \limsup_{n \rightarrow \infty} \sup_{|\theta - \theta'| < \epsilon} n R(\theta', \hat{\theta}) \geq \frac{1}{I(\theta)} $$

This says that, in a local, large sample sense, the MLE is minimax.  It can also be shown that the MLE is approximately the Bayes rule.

In summary, in parametric models with large samples, the MLE is approximately minimax and Bayes.  There is a caveat:  these results break down when the number of parameters is large.

### 13.6  Admissibility

An estimator $\hat{\theta}$ is **inadmissible** if there exists another rule $\hat{\theta}'$ such that

$$
\begin{align}
R(\theta, \hat{\theta}') \leq R(\theta, \hat{\theta}) & \quad \text{for all } \theta \text{ and} \\
R(\theta, \hat{\theta}') < R(\theta, \hat{\theta}) & \quad \text{for at least one } \theta
\end{align}
$$

A prior has **full support** if for every $\theta$ and every $\epsilon > 0$, $\int_{\theta - \epsilon}^{\theta + \epsilon} \pi(\theta) d\theta > 0$.