# Modelling Approximation Uncertainty

Bayesian inference can be seen as the main representative of probabilistic methods and provides a coherent framework for statistical reasoning that is well-established in machine learning (and beyond). Version space learning can be seen as a “logical” (and in a sense simplified) counterpart of Bayesian inference, in which hypotheses and predictions are not assessed numerically in terms of probabilities, but only qualified (deterministically) as being possible or impossible. In spite of its limited practical usefulness, version space learning is interesting for various reasons. In particular, in light of our discussion about uncertainty, it constitutes an interesting case: By construction, version space learning is free of aleatoric uncertainty, i.e., all uncertainty is epistemic.

## Version Space Learning

In the idealized setting of version space learning, we assume a deterministic dependency $f^*:\, \mathcal{X} \longrightarrow \mathcal{Y}$, i.e., the distribution 

$$
P( y \mid x_{q}) = \frac{P(x_{q} , y)}{P(x_q)}
$$(ccp)

degenerates to

$$
P( y \mid x_{q}) = \left\{ \begin{array}{ll}
1 & \text{ if } y = f^*(x_{q}) \\
0 & \text{ if } y \neq f^*(x_{q}) \\
\end{array} \right.
$$(ccpvs)


Moreover, the training data 

$$
\mathcal{D} := \big\{ (x_1 , y_1 ), \ldots , (x_N , y_N ) \big\} \subset \mathcal{X} \times \mathcal{Y} \enspace\tag{3}
$$

is free of noise. Correspondingly, we also assume that classifiers produce deterministic predictions $h(\vec{x}) \in \{ 0, 1 \}$ in the form of probabilities 0 or 1. Finally, we assume that $f^* \in \mathcal{H}$, and therefore $h^* = f^*$ (which means there is no model uncertainty).


Under these assumptions, a hypothesis $h \in \mathcal{H}$ can be eliminated as a candidate as soon as it makes at least one mistake on the training data: in that case, the risk of $h$ is necessarily higher than the risk of $h^*$ (which is 0). The idea of the candidate elimination algorithm ({cite:t}`mitc_vs77`) is to maintain the *version space* $\mathcal{V} \subseteq \mathcal{H}$ that consists of the set of all hypotheses consistent with the data seen so far:

$$
\mathcal{V} = \mathcal{V}(\mathcal{H} , \mathcal{D}) = \{ h \in \mathcal{H} \mid h(x_i) = y_i \text{ for } i = 1, \ldots , N \}\tag{4}
$$

Obviously, the version space is shrinking with an increasing amount of training data, i.e., $\mathcal{V}(\mathcal{H} , \mathcal{D}') \subseteq \mathcal{V}(\mathcal{H} , \mathcal{D})$ for $\mathcal{D} \subseteq \mathcal{D}'$. 


If a prediction $\hat{y}_{q}$ for a query instance $x_{q}$ is sought, this query is submitted to all members $h \in \mathcal{V}$ of the version space. Obviously, a unique prediction can only be made if all members agree on the outcome of $\vec{x}_{q}$. Otherwise, several outcomes $y \in \mathcal{Y}$ may still appear possible. Formally, mimicking the logical conjunction with the minimum operator and the existential quantification with a maximum, we can express the degree of possibility or plausibility of an outcome $y \in \mathcal{Y}$ as follows ($[\cdot]$ denotes the indicator function):

$$
\pi(y) := \max_{h \in \mathcal{H}} \min \left( [h \in \mathcal{V}], [h(x_q) = y] \right)
$$(ee1)

Thus, $\pi(y)=1$ if there exists a candidate hypothesis $h \in \mathcal{V}$ such that $h(\vec{x}_{q}) = y$, and $\pi(y)=0$ otherwise. In other words, the prediction produced in version space learning is a subset

$$
Y = Y(x_q) := \{ h(x_q) \mid h \in \mathcal{V} \} = \{ y \mid \pi(y) = 1 \} \subseteq \mathcal{Y}
$$

See Fig.\ \ref{fig:vs} for an illustration.


Note that the inference {eq}`ee1` can be seen as a kind of constraint propagation, in which the constraint $h \in \mathcal{V}$ on the hypothesis space $\mathcal{H}$ is propagated to a constraint on $\mathcal{Y}$, expressed in the form of the subset (\ref{eq:vss}) of possible outcomes; or symbolically:

$$
\mathcal{H} , \mathcal{D} , x_{q}  \models Y\tag{7}
$$

This view highlights the interaction between prior knowledge and data: It shows that what can be said about the possible outcomes $y_{q}$ not only depends on the data $\mathcal{D}$ but also on the hypothesis space $\mathcal{H}$, i.e., the *model assumptions* the learner starts with. The specification of $\mathcal{H}$ always comes with an *inductive bias*, which is indeed essential for learning from data {cite:t}`mitc_tn80`. In general, both aleatoric and epistemic uncertainty (ignorance) depend on the way in which prior knowledge and data interact with each other. Roughly speaking, the stronger the knowledge the learning process starts with, the less data is needed to resolve uncertainty. In the extreme case, the true model is already known, and data is completely superfluous. Normally, however, prior knowledge is specified by assuming a certain type of model, for example a linear relationship between inputs $\vec{x}$ and outputs $y$. Then, all else (namely the data) being equal, the degree of predictive uncertainty depends on how flexible the corresponding model class is. Informally speaking, the more restrictive the model assumptions are, the smaller the uncertainty will be. This is illustrated in Fig.~\ref{fig:vsli} for the case of binary classification.


Coming back to our discussion about uncertainty, it is clear that version space learning as outlined above does not involve any kind of aleatoric uncertainty. Instead, the only source of uncertainty is a lack of knowledge about $h^*$, and hence of epistemic nature. On the model level, the amount of uncertainty is in direct correspondence with the size of the version space $\mathcal{V}$ and reduces with an increasing sample size. Likewise, the predictive uncertainty could be measured in terms of the size of the set (\ref{eq:vss}) of candidate outcomes. Obviously, this uncertainty may differ from instance to instance, or, stated differently, approximation uncertainty may translate into prediction uncertainty in different ways.  


In version space learning, uncertainty is represented in a purely set-based manner: the version space $\mathcal{V}$ and prediction set $Y(x_q)$ are subsets of $\mathcal{H}$ and $\mathcal{Y}$, respectively. In other words, hypotheses $h \in \mathcal{H}$ and outcomes $y \in \mathcal{Y}$ are only qualified in terms of being possible or not. In the following, we discuss the Bayesian approach, in which hypotheses and predictions are qualified more gradually in terms of probabilities. 

## Bayesian Inference