In [1]:
import numpy as np
import matplotlib.pyplot as plt

Toy example:

Suppose that we have the following model

$z_n \sim Categorical(\pi)$

$x_n | z_n = k \sim N(\mu_k, \sigma_k)$

$y_n | z_n = k \sim Bernoulli(p_k)$

and suppose that we have learned the model parameters $\Theta = \{ \pi, \mu_k, \sigma_k, p_k \}$.

Having learned the parameters, given a test sample $x^*$ without the corresponding label $y^*$, we must make a decision---a hard assignment---of what $y^*$ might be. Associated with the decision is a cost, encoded as a loss function $L(\hat{y}, y)$, where $\hat{y}$ is the guess and $y$ the truth. For example, we might have:

$$L(\hat{y}, y)  = 
\begin{cases}
100 & \text{if } \hat{y} = 1, y = 0 \\
10  & \text{if } \hat{y} = 0, y = 1 \\
0   & \text{if } \hat{y} = y 
\end{cases}$$

In other words, there is a greater penalty to incorrectly guessting that $y = 0$. As a mnemonic guide, call $y = 0$ "green" and $y = 1$ "red". There is a higher penalty for guessing the light is green when it really is red. For general loss values, we denote $L_{10} = L_{RG}$ is the loss of guessing red when the truth is green, and $L_{01} = L_{GR}$ is defined similarly.

#### Baseline (the plugin estimator)
Given $x^*$, we may compute the posterior probability $p(y^*|x^*)$ as

$$\begin{align*}
p(y^*|x^*) &= \sum_{z^*} p(z^*|\pi) p(x^*|z^*) p(y^*|z^*) \\
&= p(z^* = 0|\pi) p(x^*|z^* = 0) p(y^*|z^* = 0 ) + p(z^* = 1|\pi) p(x^*|z^* = 1) p(y^*|z^* = 1) \\
&= \pi_0 N(x^* | \mu_0, \sigma_0) Bernoulli(y^*|p_0) + \pi_1 N(x^* | \mu_1, \sigma_1) Bernoulli(y^*|p_1)
\end{align*}$$

And we will choose as our decision $\hat{y}$ whichever value of $y$ has the higher posterior probability. That is,

$$\hat{y} = \text{argmax}_{y^*} \; p(y^*|x^*) $$

#### The decision aware way 
Rather than the above decision, we may choose whichever value of $y$ that minimizes the expected loss, or, equivalently, maximizes the expected utility $U(y,y^*)$ with respect to the posterior $p(y^*|x^*)$. In the discussion below, we will frame the problem as maximization with respect to the utility 

$$
\begin{align*}
\hat{y} &= \text{argmax}_{y} \; \sum_{y*} U(y, y^*) p(y^*|x^*) \\
&= \text{argmax}_{y} \; [ U(y, R) p(R|x^*) + U(y, G) p(G|x^*) ]
\end{align*}
$$ 

where $p(R|x^*)$ means $p(y^* = R|x^*)$. We simply plug in for $y \in \{ G, R \}$ and compare the expected costs of the decisions, using the general loss values:

Decision $y = G$: 

$$
\begin{align*}
U(y=G, R) p(R|x^*) + U(y=G, G) p(G|x^*) &= -L_{GR} * p(R|x^*) - L_{GG} * p(G|x^*)
\end{align*}
$$

where $p_0$ is the Bernoulli parameter of the $z=0$ component and thus the probability of $y = 1$ or $y = Red$ given $z=0$.

Similarly the expected cost of the decision $y = R$ is

$$
\begin{align*}
U(y=R, R) p(R|x^*) + U(y=R, G) p(G|x^*) &= -L_{RR} * p(R|x^*) + -L_{RG} * p(G|x^*)
\end{align*}
$$

Setting the expected costs of the decisions equals to each other gives us the decision bound as a function of $x^*$; we choose $\har{y} = G$ if

$$
\frac{p(G|x^*)}{p(R|x^*)} \geq \frac{U_{RR} - U_{GR}}{U_{GG} - U_{RG}}
$$

For this thereshold to be meaningful (i.e. the decision depends on the probability $p(y^*|x^*)$), we might require that $U_{RR} > U_{GR}$ and $U_{GG} > U_{RG}$, such that the threshold $\frac{U_{RR} - U_{GR}}{U_{GG} - U_{RG}} > 0$.

The optimal decision $h$ is then 

$$
h(x^*) =
\begin{cases}
G & \text{if } D(x^*) \geq T \\
R & \text{otherwise}
\end{cases}
$$

where $D(x^*) = \frac{p(G|x^*)}{p(R|x^*)}$ and $T = \frac{U_{RR} - U_{GR}}{U_{GG} - U_{RG}}$. $T$ is a threshold determined soley by the utility function, and $D(x^*)$ is the probabilistic quantity that depends on the observed test data $x^*$.

Let's interpret this result. The threshold $T$ captures the ratio of releative distance between potential actions given the true state.

For example, we might set $U_{RR} = U_{GG} = 101$---we're happy to when we make the correction decision. $U_{RG} = 51$ seems reasonable, as stopping at the light when it's green diminishes the pleasure, but we're still safe. $U_{GR} = 1$ is reasonable for the risky behavior of running the red light. The threshold then equals $T = 2$---we require that the probability of the green light is twice more likely than the probability of the red light.

Suppose that if $z = 0$, the probability of red $p_0 = 0.75$, and the corresponding probability of red for $z = 1$ is $p_1 = 0.5$. In other words, if $z = 0$, we are fairly confident that the light is red, whereas we have no information if $z = 1$.