## Decision Theory
Suppose we have a probability model. It tells us that the probabilty of each of three classes is <.25,.4,.35>. What label should we assign to this sample? Or suppose the model tells us based on environmental factors that the level of mercury in this county is exponentially distributed with mean .03- what single value or interval should we report? And moreover, what action should we take?

Decision theory deals with these questions of mapping a probability distribution down to an action (where an action might be "evacuate the county" or might even be "report 0.35" or "report the interval [.2, .4]".

#### Idea
The basic idea behind decision theory is this: we should quantify the utility or, equivalently, the loss of each possible action and decide based on those. That is, don't just look at P(cancer|data) and P(not cancer|data), factor in the costs/benefits/risks of each possible action.

Indeed one can consider prediction itself as actions to be undertaken with respect to a particular utility.

#### Specifics
Remember that there are two key distributions arising from the Bayesian scenario: the posterior $p(\theta \vert D)$ and the posterior-predictive $p(y^* \vert D) = \int d\theta\, p(y^* \vert \theta) p(\theta \vert D)$. Either of these can be used to make decisions: it depends upon what we have information about. If we have information about the true values of parameters, the posterior might be fine to use. But typically, we don't have information at that level and are more interested in predictions from the model, so we formulate the problem in terms of the posterior predictive.

Either way, the components of the decision problem are

1. $a \in A$, available actions for the decision problem
2. $\omega \in \Omega$, a state in the set of states of the world. If $\Omega$ is the set of all future observed outcomes $y$, then $\omega = y^*$, a future $y$. if $\Omega$ is the set of possible true parameter values, then $\omega = \theta$ is a value of a parameter(s). **Note** $y^*$ isn't some outcome with special properties, just whatever outcome happened to occur. We could really just as well write it without the star.
3. $p(\omega \vert D)$ which tells us our current beliefs/knowledge about the world. This is either the posterior distribution (for $\theta$) or the posterior predictive distribution (for $y^*$)
4. A utility function $u(a, \omega): A \times \Omega \rightarrow R$ that awards a score/utility/profit to each action $a$ when the true state of the universe is $\omega$. Utility functions can always be reformulated as a risk/loss.

The game then is to maximize the distribution-expected-utility amongst all possible actions a (or minimize the risk).

In other words we first define the distribution-averaged utility, i.e. the average utility of action a under our posterior beliefs.

$$\bar{u}(a) = \int u(a, \omega) \, p(\omega \vert D) d\omega$$

We then find the $a$ that maximizes this utility and call it $\hat{a}$ (even if easier said than calculated):

$$ \hat{a} = \arg\max_a \bar{u}(a)$$

This best-expected-utility action is called the **bayes action**.

#### Another view
Instead of directly searching for the bayes/best-expected-utility action, we can define a "distance" from this action and minimize that.

The expected utility of the bayes action is given by the weighted average of the utility gained at each outcome, assuming we take action $\hat{a}$:

$$\bar{u}(\hat{a}, p) = \bar{u}(\hat{a}) = \int \, u(\hat{a}, \omega) \, p(\omega \vert D) d\omega $$

This maximized utility is sometimes referred to as the entropy function, and an associated **divergence** can be defined:

$$ d(a,p) = \bar{u}(p, p) - \bar{u}(a, p)$$

Then one can think of minimizing $d(a,p)$ with respect to $a$ to get $\hat{a}$, so that this discrepancy can be thought of as a loss function.

### Example

To make this concrete consider the problem in which $\omega$ is a future observation $y^*$, e.g. the actual number of manuscripts a monestary will produce next month. We will then get a posterior predictive distribution with respect to some model $M$ that we shall put into our conditioned-upon variables as well (the reason to do this is that we'll consider later, averaging with respect to sufficiently expressive or true distributions, rather than any particular posterior predictive).

With this in hand we can write the utility of action a as the weighted average utility of a across the distribution possible outcomes. In the monestary example the model tells us that the number of manuscripts produced ($y^*$) should follow some distribution. We find the average value of action a across the possible y values:

$$\bar{u}(a) = \int \, u(a, y^*) \, p(y^* \vert D, M) dy^*$$

We could then search for the action with the best average utility over possble outcomes.

#### Notation for regression
Everything we have said above works for linear regression and other non-generative models. However, now the action a can look at and use the observed x values ($x^*$) to decide what to do, so it's written $a(x^*)$, and the outcome y always appears given the $x^*$ data (in addition to the data D the model was trained on).

$$\bar{u}(a(x^*)) = \int u(a(x^*), y^*) \, p(y^* \vert x^*, D, M) dy^* $$

$$ \hat{a}(x^*) = \arg\max_a \bar{u}(a(x^*))$$