# Probabilistic Inference

## Probability Theory
### Definitions

- $p(\mathbf{x})$ is the *marginal probability* of $\mathbf{x}$
- $p(\mathbf{x}, \mathbf{y})$ is the *joint* probability of $\mathbf{x}$ and $\mathbf{y}$
- $p(\mathbf{x}|\mathbf{y})$ is the *conditional* probability of $\mathbf{x}$ given $\mathbf{y}$.

### Rules of Probability
- $0 < p(\mathbf{x}) < 1$
- Probabilities must sum to 1: $\sum_{\mathbf{x}}p(\mathbf{x}) = 1$
- Product rule: $p(\mathbf{x}, \mathbf{y}) = p(\mathbf{x}|\mathbf{y})p(\mathbf{y}) = p(\mathbf{y}|\mathbf{x})p(\mathbf{x})$
- Sum rule: $p(\mathbf{x}) = \sum_{\mathbf{y}}p(\mathbf{x},\mathbf{y})$

Bayes' rule is derived from the product rule:

$$
p(\mathbf{x}|\mathbf{y}) = \frac{p(\mathbf{y}|\mathbf{x})p(\mathbf{x})}{p(\mathbf{y})}
$$

Suppose we have observed data $\mathcal{D} = \{x_1,\ldots,x_N\}$ generated from a model with parameters $\mathbf{w}$. We can capture our assumptions about $\mathbf{w}$, before observing data, in the form of a prior probability distribution $p(\mathbf{w}$. The effect of the observed data is expressed through the conditional probability $p(\mathcal{D}|\mathbf{w})$. 

Bayes' theorem, which takes the form 

$$
p(\mathbf{w}|\mathcal{D}) = \frac{p(\mathcal{D}|\mathbf{w})p(\mathbf{w})}{p(\mathcal{D})}
$$

then allows us to evaluate the uncertainty in $\mathbf{w}$ *after* we have observed $\mathcal{D}$ in the form of the posterior probability $p(\mathbf{w}|\mathcal{D})$.

The quantity $p(\mathcal{D}|\mathbf{w})$ on the right-hand side of Bayes' theorem is evaluated for the observed dataset $\mathcal{D}$ and can be viewed a function of the parameter vector $\mathbf{w}$, in which case it is called the *likelihood function*. It expresses how probable the observed dataset is for different settings of the parameter vector $\mathbf{w}$. Note that the likelihood is not a probability distribution over $\mathbf{w}$, and its integral with respect to $\mathbf{w}$ does not (necessarily) equal one.

Given this definition of likelihood, we can state Baye's theorem in words as 

$$
\mathrm{posterior}\propto\mathrm{likelihood}\times\mathrm{prior}
$$

where all of these quantities are viewed as functions of $\mathbf{w}$. The denominator in Bayes' theorem can be expressed in terms of the prior distribution and the likelihood function

$$
p(\mathcal{D}) = \int p(\mathcal{D}|\mathbf{w})p(\mathbf{w})d\mathbf{w}
$$

## Probabilistic (Supervised) Learning

We have a dataset consisting of input/output pairs:

$$
\begin{array}{lll}
\mathcal{D} & = \{\mathbf{x}_i, \mathbf{y}_i\}_{i=1}^n & \\
\mathbf{X} & = [\mathbf{x}_1,\ldots,\mathbf{x}_n]^T & \\
\mathbf{y} & = [y_1,\ldots,y_n]^T & \mathrm{binary/regression} \\
\mathbf{Y} & = [\mathbf{y}_1^T,\ldots,\mathbf{y}^T_n] & \mathrm{multiclass}
\end{array}
$$

$$
\begin{array}{lll}
\mathbf{w} &= [\mathbf{w}_1,\ldots,\mathbf{w}_C]^T & \mathrm{parameters\, (weights)} \\
\sigma & = [\sigma_1,\ldots,\sigma_q]^T & \mathrm{likelihood\, hyperparameters} \\
\theta & = [\theta_1,\ldots,\theta_p]^T & \mathrm{prior\, hyperparameters}
\end{array}
$$

To define a probabilistic model, we start with choosing the **likelihood** function which describes how the data were produced

$$
p(\mathrm{data}|\mathrm{parameters}) = p(\mathbf{y}|\mathbf{w},\mathbf{X}, \sigma)
$$

There are many possible choices, depending on our problem, e.g. if we are performing regression or classification.

We also specify our **prior** beliefs about the weight vector

$$
p(\mathrm{parameters}|\mathrm{model}) = p(\mathbf{w}|\theta)
$$

You can think of this as similar to regularization in non-probabilistic approaches.

Inference then amounts to computing the posterior distribution (Bayes rule)

<img src="prob_inference.png">

* The **posterior** gives a distribution for the weight vector $\mathbf{w}$ given the data. We then can use this to perform predictions.
* The marginal likelihood enables us to perform model selection and choose the optimal values for the hyperparameters $\theta$, $\sigma$.

## Model Selection
The marginal likelihood (evidence) plays an important role in probabilistic modelling

$$
p(\mathbf{y}|\mathbf{X},\theta,\sigma) = \int p(\mathbf{y}|\mathbf{X},\mathbf{w}, \sigma)p(\mathbf{w}|\theta)d\mathbf{w}
$$

It embodies a tradeoff between data fit and model complexity and can be used for
* Deciding which of several competeing models is most probable
* Automatic optimisation of hyperparameters $\theta, \sigma$ by **evidence maximisation**

## Decision Theory

In probabilistic models, we commonly divide the learning process into two phases:
1. **Inference**: computing the posterior distributions
2. **Decision**: make a prediction / decision based on the posterior

* Decision theory concerns the second step (e.g. given the class probabilities, should we choose treatment $A$ or $B$?)
* This framework is highly flexible: e.g. we can accommodate asymmetric misclassification costs where a false negative may be more costly than a false positive (medical applications)

In contrast, many approaches combine these two phases and learn a function that directly maps inputs $\mathbf{x}$ onto class labels ($y$). This is called a *discriminant function* approach (e.g. SVM).

### Loss function and Risk
We can formalize the measurement of model performance using some "loss function" $\mathcal{L}(y, f(\mathbf{x}))$. There are many different loss functions for classification (e.g. classification error) and regression (e.g. MSE). 

The expected generalizability is then given by its "Risk":

$$
\mathcal{R}[f] = \int \mathcal{L}(y, f(\mathbf{x}))p(y,\mathbf{x})dyd\mathbf{x}
$$

However, we usually don't know $p(y,\mathbf{x})$, so we approximate this by the "empirical risk", defined over the training set

$$
\mathcal{R}_{emp}[f] = \frac{1}{n}\sum_{i=1}^n\mathcal{L}(y, f(\mathbf{x}))
$$

### Minimising the empirical risk
Consider a linear model that aims to predict the output ($y$) using a weighted combination of the inputs $\mathbf{x}$

$$
f(\mathbf{x},\mathbf{w}) = \mathbf{x}^T\mathbf{w} + b
$$

To estimate the weights, we seek to minimise the empirical risk, which is penalised to restrict model flexibility

$$
\hat{\mathbf{w}} = \min_{\mathbf{w}}\sum_{i=1}^n\mathcal{L}(y_i,\mathbf{x}_i,\mathbf{w})+\lambda J(\mathbf{w})
$$

Many algorithms (e.g. SVM, Lasso, Ridge regression) are particular choices of $\mathcal{L}()$ and $J()$.

Probabilistic models can be viewed from a similar perspective: we want to minimise

$$
\log p(\mathbf{w}|\mathbf{y}, \mathbf{X},\theta,\sigma)\propto \sum_{i=1}^n\log p(y_i|\mathbf{w},\mathbf{x}_i,\sigma) + \log p(\mathbf{w}|\theta)
$$

## Probabilistic classification and regression
* The discriminant function approach is appealing and is often very efficient
* However, separating inference and decision also provides benefits, especially for classification

### Advantages of probabilistic classification
- Minimising risk (e.g. misclassification costs may change)
- Compensate for class priors (accomodate disease prevalence)
- Reject option (only make a decision if sufficiently confident)
- Combining classifiers
- Easily interpretable (predictive confidence)

Coherent handling of uncertainty is especially important in medicine

### Souces of uncertainty in clinical applications
- Diagnostic uncertainty (class labels may be noisy)
- Heterogeneity in disease severity and course
- Individual variability in response to treatment

In such applications, predictive confidence is potentially highly informative about individual variability.

$p(y|\mathbf{x}) = 0.55$: ambiguous, $p(y|\mathbf{x})=0.99$: confident.