# Lecture 17: Bayesian Decision Theory

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import sympy as sy
import utils as utils
from IPython.display import display, HTML

# Inline plotting
%matplotlib inline

# Make sympy print pretty math expressions
sy.init_printing()

utils.load_custom_styles()

Bayesian decision theory is a fundamental statistical approach to the problem of pattern classification. This approach is based on quantifying the tradeoffs between various classification decisions using probability and the costs that accompany
such decisions. It makes the assumption that the decision problem is posed in probabilistic terms, and that all of the relevant probability values are known.

---
## Definitions



<img src="figures/lecture-17/dataset-example.png" width="400" />











### Probability Functions
- Probability mass function, denoted $P(\cdot)$, used for discrete variables
- Probability density function, denoted $p(\cdot)$, used for continuous variables. Density functions are normalized, so the area under each curve is 1.0.


--- 
### A priori probability
A priori probability (prior probability) captures an expert's beliefs about something before looking at new data. Suppose we have a dataset of 10 samples categorised into two classes $c_1$ and $c_2$. The a priori probability can be computed as follows:

$$
P(c_k) = \frac{\text{number of samples in } c_k}{\text{total number of samples}} \text{ or }\\
p(x) = \frac{\text{number of } x}{\text{total number of samples}}
$$


To use our toy dataset, we can compute the a priori probabilities for $c_k$ as follows:

$$
P(c_1) = \frac{5}{10}, 
P(c_2) = \frac{5}{10}, 
$$

and for $x$:
\begin{align}
p(x=29) = \frac{1}{10} \\
p(x=30) = \frac{1}{10} \\
p(x=31) = \frac{3}{10} \\
p(x=49) = \frac{1}{10} \\
p(x=50) = \frac{1}{10} \\
p(x=51) = \frac{3}{10}
\end{align}

The a priori values must sum to 1 because they are probability values!

Making decision based solely on the a priori probability is like deciding whether a picture is $c_1$ or $c_2$ without actually seeing the picture first. If we are forced to make such a decision with so little information, we could use the following decision rule:

**Classify $c_1$ if $P(c_1) > P(c_2)$; otherwise classify as $c_2$**

This decision rule based solely on the a priori probabilites does not make sense to use for all images because a priori probabilities do not take into account the sample itself.

---
### Conditional Probability

We need to define a way to obtain a probability given the sample. The conditional probability, denoted $P(c_k \mid x)$, expresses the probability of class $c_k$ given a sample $x$.



---
### Class-conditional Probability

We denote $p(x \mid c_k)$ as the **class-conditional probability**. This expresses the probability of
observing $x$ given that the sample it corresponds to belongs to class $c_k$.

The class-conditional probability is also called the **likelihood** of $c_k$ with respect to $x$. This term is 
chosen to indicate that, other things being equal, the category $c_k$ for which $p(x \mid c_k)$
is large is more "likely" to be the true category.


Two hypothetical class-conditional probability density functions show the
probability density of measuring a particular feature value $x$ given the pattern is
in class $c_k$:

<img src="figures/lecture-17/probability-density-functions.png" width="500" />











---
### Joint Probability

We denote $p(c_k, x)$ as the **joint probability** of $c_k$ and $\mathbf{x}$ and define it as:

$$
p(c_k, x) = P(c_k \mid x) p(x) = p(x\mid c_k) P(c_k)
$$
where
$$
p(x) = \sum_{k=1}^K{ p(x \mid c_k) P(c_k)}
$$

---
### Bayes' formula

Bayes' formula gives us a way to compute the conditional probability by rewriting the joint probability:

\begin{align}
p(c_k, x) = P(c_k \mid x) p(x) \Longleftrightarrow
\frac{p(c_k, x)}{p(x)} = P(c_k \mid x)
\end{align}

and since $p(c_k, x) = p(x\mid c_k) P(c_k)$, we can obtain Bayes' formula:

\begin{align}
P(c_k \mid x) = \frac{p(x\mid c_k) P(c_k)}{p(x)}
\end{align}

Bayes' formula can be expressed informally in English by saying that:

$$
posterior = \frac{likelihood \times prior}{evidence}
$$

It is the product of the likelihood and the prior probability that is most important in determining the posterior
probability. The evidence factor, $p(x)$, can be viewed as merely a scale factor that guarantees that the posterior probabilities sum to one.

---
### Examples: Using Bayes' formula

$$
P(c_1 \mid x=30) = \frac{ p(x=30 \mid c_1) P(c_1) }{ p(x=30) }
$$

---
## Bayes' Decision Rule

Suppose we can compute the conditional probabilites, how do we classify an observation $x$? We can use the following rule called Bayes' Decision Rule:

**Classify $x$ as $c_1$ if $P(c_1 \mid x) > P(c_2 \mid x)$; otherwise classify $x$ as $c_2$**

This form of the decision rule emphasizes the role of the posterior probabilities.

### Multiple Classes

The above decision rule can be extended to more than two classes in a straightforward manner. Given:
- a set of classes $\{ c_1, c_2, \cdots, c_K \}$, 
- an observation $x$ and
- the corresponding conditional probabilities $P(c_k \mid x)$ where $k = 1, 2, \cdots , K$ 
the decision rule is:

**Classify $x$ as $c_l$ for which $P(c_l \mid x) > P(c_i \mid x)$ for all $i \not = l$**


---
### Probability of Error

To justify Bayes' Decision Rule, let us calculate the probability of error whenever we make a decision.
Whenever we classify a particular $x$, we can define the probability of error as follows:

<img src="figures/lecture-17/error-probability.png" width="600" />



Equation (3.7) can be written as:
$$
P(error \mid x) = \min \left[ P \left( c_1 \mid x \right), P \left( c_2 \mid x \right) \right]
$$

Clearly, for a given $x$ we can minimize the probability of error by deciding $c_1$ if $P(c_1 \mid x) > P(c_2 \mid x)$ and $c_2$ otherwise. Will this rule minimize the average probability of error? Yes, because the average probability of error is given by

<img src="figures/lecture-17/average-error.png" width="600" />

and if for every $x$ we ensure that $P(error \mid x)$ is as small as possible, then the integral must be as small as possible. 

Thus we have justified the following Bayes' decision rule for minimizing the probability of error!

---
## Decision Functions

Bayes' Decision Rule is given as:

**Classify $x$ as $c_1$ if $P(c_1 \mid x) > P(c_2 \mid x)$; otherwise classify $x$ as $c_2$**


Thus, the optimal decision function corresponds to the $x$ value where:

$$
P(c_1 \mid x) = P(c_2 \mid x)
$$

<div class="warning">
For multivariate case i.e., when $\mathbf{x} \in \mathbb{R}^D$ then the decision function corresponds to a hyperplane. 
</div>

This is illustrated in Figure 3.2, where we assume that the conditional probability $P(c_k \mid x)$ can be approximated by a continuous function:

<img src="figures/lecture-17/figure-3.2.png" width="600" />













---
## Maximum Likelihood Classification

Bayes' Decision Rule gives us a way to classify a sample $x$ into a class $c_k$ based on the conditional probability $P(c_k \mid x)$. However, in the special case where the prior probability for all classes $P(c_k)$ are the same, then decision rule can be defined on class-conditional probability $p(x \mid c_k)$ alone. 

Why is that? First, recall Equation 3.2 and Eq. 3.3

<img src="figures/lecture-17/joint-probability.png" width="600" />










<img src="figures/lecture-17/eq-3.6.png" width="600" />





Given a sample $x$, we can observe that its a priori probability $p(x)$ is a constant factor (given by Eq. 3.3) scaling the probability $P(c_k \mid x)$ in the range of [0, 1].

<img src="figures/lecture-17/eq-3.12.png" width="600" />





The scalar $\alpha$ is:
$$
\alpha = \frac{P(c_k)}{p(x)}
$$

From Equation 3.12, we can see that the probability of a class given an observation $x$ is proportional to the sample's class-conditional probability or likelihood i.e. $p(x\mid c_k)$.

By classifying the new sample based on the highest conditional probability (using Bayes' Decision Rule), corresponds to classifying the sample based on its maximum likelihood. This process is referred to as **Maximum Likelihood Classification**.

<div class="summary">
In summary, Maximum Likelihood is a method to classify a sample based on the sample's likelihood $p(x \mid c_k)$, which can be used when all classes have equal prior probability i.e. $P(c_k) = 1/K$.
</div>

---
## Multi-variate Bayes' formula

Until now we discussed about the case where the decision function is a function of only one continuous variable $x$. 

Let us consider the general case where the observations are more than one and are stored as a vector $\mathbf{x}$. Then, the Bayes' formula is

<img src="figures/lecture-17/bayes-formula-multi-d.png" width="600" />





---
## Risk-based decision functions

Some classification errors are more important or have more impact than other. Suppose, we want to classify whether a tumor is malignant or benign. The impact of misclassifying malignant tumors is much higher than the misclassification of benign tumors. If our classifier determines that a tumor is benign when in fact it is not, then this may result in the patient dying because lack of treatment. However, misclassifying a tumor as malignant does not risk the patient's life. In situations like the above, we want consider the risk or the loss involved in taking some action $\alpha_i$ based on the probability-based classification. 

Suppose that we observe a particular $\mathbf{x}$ and that we contemplate taking action $\alpha_i$. If the true class is different say $c_j$, then by definition we will incur some loss. We define the loss function $\lambda(\alpha_i \mid c_j)$ describing the loss incurred by taking the action $\alpha_i$ given that the correct class is $c_k$.

<div class="insight">
The loss function is an $K\times K$ matrix given by the user.
</div>

Now, we can define the **risk** or the **expected loss** of taking action $\alpha_i$ given the observation $\mathbf{x}$  as follows:

<img src="figures/lecture-17/conditional-risk.png" width="600" />





Therefore, given an observation $\mathbf{x}$ the best action is the one minimizing the risk. This means that we calculate the risk for each action $\alpha_i$ and take action $\alpha_l$ where $R(\alpha_i \mid \mathbf{x})$ is lowest i.e. the action with the smallest risk.


The total risk of a decision function is given by:

$$
\sum_{\mathbf{x}} p(\mathbf{x}) R(\alpha(\mathbf{x}) \mid \mathbf{x}) 
$$

A decision function is optimal if it minimises the total risk. The minimum overall risk is called **Bayes risk** and it corresponds to the best performance that can be achieved.

---
### Two-class Classification

Let us consider these results when applied to the special case of two-class classification problems. Suppose the action $\alpha_1$ corresponds to deciding that the correct class is $c_1$ and the action $\alpha_2$ corresponds to
deciding that the correct class is $c_2$.

<img src="figures/lecture-17/eq-3.16.png" width="600" />





<img src="figures/lecture-17/likelihood-ratio.png" width="600" />



