# Bayesian Learning

## Probability Basics

**Definition 1**
A random experiment or a random trial is a procedure that, at least theoretically, can be repeated infinite times. It is characterized as follows:

1. Configuration (A precisely specified system that can be reconstructed)
2. Procedure (An instruction on how to execute the experiment, based on the config)
3. Unpredictability of the outcome

A set $ \Omega = \{\omega_1, \omega_2, \dots, \omega_n\} $ is called *sample space* of a random experiment, if eac hexperiment outcome is associated with at most one element $ \omega \in \Omega $. The elements in $ \Omega $ are called *outcomes*.

Let $ \Omega $ be a finite sample space. Each subset $ A \subseteq \Omega $ is called an event, which occurs iff the experiment outcome $ \omega $ is a member of $ A $. The set of all events $ \mathcal{P}(\Omega) $, is called the event space or $ \sigma $-algebra.

Ex:
* Experiment: Rolling a dice
* Sample Space: $ \Omega = \{1, 2, 3, 4, 5, 6\} $
* Some Event: $ A = \{2, 4, 6\} $

* Experiment: Rolling two dice at the same time
* Sample Space: $ \Omega = \{\{1, 1\}, \{1, 2\}, \dots, \{6, 6\}\} $
* Some Event: $ B = \{\{1, 2\}\} $

* Experiment: Rolling two dice in succession
* Sample Space: $ \Omega = \{(1, 1), (1, 2), \dots, (6, 6)\} $
* Some Event: $ C = \{(1, 2), (2, 1)\} $

### How to capture the Nature of Probablity

1. Classic, symmetry-based
2. Frequentist
3. Axiomatic
4. Subjectivist, Bayesian, prognostic

**Classical/Laplace Probability**

If each elementary event $ \{\omega\}, \omega \in \Omega $ gets assigned the same probability, then the probability $ P(A) $ of an event $ A $ is defined as follows:

$ P(A) = \frac{|A|}{|\Omega|} $

**Frequentist**

Basis is the **empirical law of large numbers**:

In a random experiment, the average of the outcomes obtained from a large number of trials is close to the expected value, it will come closer as more trials are performed.

**Axiomatic**

(a) Postulate a function $ P() $ (That assigns a probability to every event in $ \mathcal{P}(\Omega) $)

(b) Specify the required properties (of $ P() $ in the form of axioms)

**Subjectivist, Bayesian, Prognostic**

Consider (prior) knowledge about the hypotheses:

$ p(h \mid D) = \frac{p(D \mid h) \cdot p(h)}{p(D)} $
* Likelihood: how well does $ h $ explain (entail, induce, invoke) the data $ D $?
* Prior: how probable is the hypothesis $ h $ a priori (in principle)?

### Axiomatic Approach to Probability

**Probability Measure**
Let $ \Omega $ be a set, called sample space, and let $ \mathcal{\Omega} $ be an event space. A Function $ P, P: \mathcal{P}(\Omega) \rightarrow \mathbb{R} $, which maps each event $ A $ onto a a real number $ P(A) $ is called probability measure if it has the following properties:

1. $ P(A) \geq 0 $ (Axiom I)
2. $ P(\Omega) = 1 $ (Axiom II)
3. $ A \cap B = \emptyset \Rightarrow P(A \cup B) = P(A) + P(B) $ (Axiom III)

**Probability Space**
Let $ \Omega $ be a sample space, $ \mathcal{P}(\Omega) $ be an event space, and $ P: \mathcal{P}(\Omega) \rightarrow \mathbb{R} $ be a probability measure. Then the tuple $ (\Omega, P) $ as well as the tripe $ (\Omega, \mathcal{P}(\Omega), P) $ is called probability space.

The Kolmogorov Axioms also imply:
1. $ P(A) + (P(\overline{A}) = 1 $
2. $ P(\emptyset) = 0 $
3. $ A \subseteq B \Rightarrow P(A) \leq P(B) $
4. $ P(A \cup B) = P(A) + P(B) - P(A \cap B) $
5. Let $ A_1, A_2, \dots A_n $ be mutually exclusive (incompatible), then holds:
   - $ P(A_1 \cup A_2 \cup \dots \cup A_n) = P(A_1) + P(A_2) + \dots + P(A_n) $

### Conditional Probability

Let $ (\Omega, \mathcal{P}(\Omega), P) $ be a probability space and let $ A, B \in \mathcal{P}(\Omega) $ be two events. Then the probability of the occurence of event $ A $ given that event $ B $ is known to have occurred is defined as follows:

$ P(A \mid B) = \frac{P(A \cup B)}{P(B)} $ if $ P(B) > 0 $

This is called "probability of A under condition B".

![img1](img/topic4img1.png)

### Total Probability

Let $ (\Omega, \mathcal{P}(\Omega), P) $ be a probability space and let $ A_1, A_2, \dots, A_n $ be mutually exclusive events with $ \Omega = A_1 \cup \dots \cup A_n, P(A_i) > 0, i = 1, \dots, n $. Then for each $ B \in \mathcal{P}(\Omega) $ holds:

$ P(B) = \sum\limits_{i = 1}{k} P(A_i) \cdot P(B \mid A_i) $

![img2](img/topic4img2.png)

### Independence of Events

Let $ (\Omega, \mathcal{P}(\Omega), P) $ be a probability space and let $ A, B \in \mathcal{P}(\Omega) $ be two events. Then $ A $ and $ B $ are called statistically independet iff the following holds true:

$ P(A \cap B) = P(A) \cdot P(B) $ (multiplication rule)

$ \Rightarrow P(A \mid B) = P(A \mid \overline{B}) $
$ \Leftrightarrow P(A \mid B) = P(A) $

The statistical independence of $ k $ events can also be determined by checking whether the multiplication rule holds true for all subsets of the $ k $ events.

## Bayes Classifier

### Generative Approach to Classification Problems

Setting:
* $ X $ is a multiset of feature vectors
* $ C $ is a set of classes
* $ D = \{(\mathbf{x}_1, c), \dots, (\mathbf{x}_n, c)\} \in X \times C $ is a multiset of examples

Learning task: Fit $ D $ using joint probabilities $ p() $ between features and classes.

Let $ (\Omega, \mathcal{P}(\Omega), P) $ be a probability space and let $ A_1, \dots, A_k $ be mutually exclusive events with $ \Omega = A_1 \cup \dots \cup A_k, P(A_i) > 0, i = 1, \dots, k $. Then for an event $ B \in \mathcal{P}(\Omega) $ with $ P(B) > 0 $ holds:

$ P(A_i \mid B) = \frac{P(A_i) \cdot P(B \mid A_i)}{\sum\limits_{i = 1}^{k} P(A_i) \cdot P(B \mid A_i)} $

$ P (A_i) $ is called a priori probability of $ A_i $.

$ P(B \mid A_i) $ is called posterior probability of $ A_i $

### Example: Reasoning about a disease

1. 
   - $ A_1 $: HIV_pos with $ P(A_1) = 0.001 $ (prior knowledge about population)
   - $ A_2 $: HIV_neg with $ P(A_2) = 1 - P(A_1) = 0.999 $
   - $ B $: test_pos
2. $ B \mid A_1 $: test_pos | HIV_pos with $ P(B \mid A_1) = 0.98 $ (result from clinical trials)
3. $ B \mid A_2 $: test_pos | HIV_neg with $ P(B \mid A_2) = 0.03 $ (result from clinical trials)

Using the Theory of Total Probability we can deduce:
$ \Rightarrow P(B) = \sum\limits_{i = 1}^{2} P(A_i) \cdot P(B \mid A_i) = 0.031 $

Now, we can use the simple Bayes formula to determine the probability of a patient having HIV under the condition that they have been tested positive:

$ P(HIV_pos \mid test_pos) = P(A_1 \mid B) = \frac{P(A_1) \cdot P(B \mid A_1)}{P(B)} = \frac{0.001 \cdot 0.98}{0.031} = 0.032 = 3.2% $

![img3](img/topic4img3.png)

### Combined Condiitonal Events

Let $ P(A_i \mid B_1, \dots, B_p $ denote the probability of the occurrence of event $ A_i $ given that the events $ B_, \dots, B_p $ are known to have occurred.

Applied to a classification problem:
* $ A_i $ corresponds to an event of the kind $ \boldsymbol{\mathsf{C}}=c_i $, the $ B_j, j = 1, \dots, p $ correspond to $ p $ events of the kind $ \boldsymbol{\mathsf{X}}_j = x_j $
* Observable relation (in the prevalent setting): $ B_1, \dots, B_p \mid A_i $
* Reversed relation (in a diagnosis setting): $ A_i \mid B_1, \dots, B_p $

If sufficient data for estimating $ P(A_i) $ and $ P(B_1, \dots B_p \mid A_i) $ is provided, then $ P(A_i, B_1, \dots, B_p) $ can be computed with the Theorem of Bayes:

$ P(A_i \mid B_1, \dots, B_p) = \frac{P(A_i) \cdot P(B_1, \dots, B_p \mid A_i)}{P(B_1, \dots, B_p)} $

### Naive Bayes

The compilation of a database from which realiable values for the $ P(B_1, \dots, B_p \mid A_i) $ can be obtained is often infeasable. The way out:

(a) Naive Bayes Assumption: Given condition $ A_i $, the $ B_1, \dots, B_n $ are statistically independent.

Notation:

$ P(B_1, \dots, B_j \mid A_i) = \prod\limits_{j = 1}^{p} P(B_j \mid A_i) $

(b) Given a set $ \{A_1, \dots, A_k\} $ of alternative events (causes or classes), the most probable event under the Naive Bayes Assumption can be computed with the Theorem of Bayes:

$ \text{argmax}_{A_i \in \{A_1, \dots, A_k\}} \frac{P(A_i) \cdot P(B_1, \dots, B_p \mid A_i)}{P(B_1, \dots B_p)} $

$ = \text{argmax}_{A_i \in \{A_1, \dots, A_k\}} P(A_i) \cdot \prod\limits_{j = 1}^{p} P(B_j \mid A_i) = A_{NB} $

We can use the Naive Bayes Assumption in conjunction with a set of $ k $ mutually exclusive events $ A_i $:

$ P(B_1, \dots, B_p) = \sum\limits_{i = 1}^{k} P(A_i) \cdot \prod\limits_{j = 1}^{p} P(B_1, \dots, B_j \mid A_i) $

And with the Theorem of Bayes it now follows for conditional properties:

$ P(A_i \mid B_1, \dots, B_p) = \frac{P(A_i) \cdot \prod_{j = 1}^{p} P(B_j \mid A_i)}{\sum_{i = 1}^{k} P(A_i) \cdot \prod_{j = 1}^{p} P(B_j \mid A_i)} $

### Naive Bayes: Classifier Construction Summary

Let $ X $ be a multiset of feature vectors, $ C $ a set of $ k $ classes and $ D \subseteq X \times C $ a multiset of feature examples. Then the $ k $ classes correspond to the events $ A_1, \dots, A_k $ and the $ p $ feature values of some $ \mathbf{x} \in X $ correspond to the events $ B_{1=x_1}, \dots, B_{p=x_p} $.

Construction and application of a naive Bayes Classifier:
1. Using $ D $, estimate the $ P(A_i), A_i := \boldsymbol{\mathsf{C}}=c_i, i = 1, \dots, k $
2. Using $ D $, estimate the $ P(B_{j=x_j} \mid A_i), B_{j=x_j} := \boldsymbol{\mathsf{X}}_j=x_j, j = 1, \dots, p $
3. Classify the feature vector $ \mathbf{x} $ as $ A_{NB} $ iff:
   - $ A_{NB} = \text{argmax}_{A_i \in \{A_1, \dots, A_k\}} \hat{P}(A_i) \cdot \prod\limits_{x_j \in \mathbf{x}, j = 1, \dots, p} \hat{P}(B_{j=x_j} \mid A_i) $

In this case, $ \hat{P}() $ denotes the relative frequency, since the actual probabilities $ P() $ are unknown.

### Naive Bayes: Example

Compute the class $ c $ of a feature vector $ \mathbf{x} = (sunny, cold, high, strong) $ given the following multiset of examples $ D $:

![img4](img/topic4img4.png)

Let $ B_{j=x_j} $ denote the event that feature $ j $ has the value $ x_j $. Then, our feature vector $ \mathbf{x} $ gives rise to the following four events:

$ B_{j=x_1}: Outlook=sunny $

$ B_{j=x_2}: Temperature=cold $

$ B_{j=x_3}: Humidity=high $

$ B_{j=x_4}: Wind=strong $

Computation of $ A_{NB} $ for $ \mathbf{x} $ using the above formula:

$ A_{NB} = \text{argmax}_{A_i \in \{EnjoySurfing=yes, EnjoySurfing=no\}} \hat{P}(A_i) \cdot \hat{P}(Outlook=sunny \mid A_i) \cdot \hat{P}(Temperature=cold \mid A_i) \cdot \hat{P}(Humidity=high \mid A_i) \cdot \hat{P}(Wind=strong \mid A_i) $

* $ \hat{P}(EnjoySurfing=yes) = \frac{9}{14} = 0.64 $
* $ \hat{P}(EnjoySurfing=no) = \frac{5}{14} = 0.36 $
* $ \hat{P}(Wind=strong \mid EnjoySurfing=yes) = \frac{3}{9} = 0.33 $
* ...

$ \Rightarrow $ Ranking:
1. $ \hat{P}(EnjoySurfing=no) \cdot \prod_{x_j \in \mathbf{x}} P(B_{j=x_j} \mid EnjoySurfing=no) = 0.0206 $
2. $ \hat{P}(EnjoySurfing=yes) \cdot \prod_{x_j \in \mathbf{x}} P(B_{j=x_j} \mid EnjoySurfing=yes) = 0.0053 $

$ \Rightarrow $ See lecture notes for calculations of final probabilities.