### week 9

# Naive Bayes

- classification
- Bernoulli distribution
- naive Bayes classifer

In [1]:
import pods
import notebook as nb
import mlai
import numpy as np
import scipy as sp
from matplotlib import pyplot as plt
%matplotlib inline

### Review and Preview

- last week we Looked at the EM algorithm and its application to solve a maximum entropy model
- approaches to an analytical solution and a numerical solution are introduced
- this week we go back to the main road
- the material concerns classification with Naive Bayes

### Classification

We are given a dataset containing 'inputs' $\mathbf{X}$ and the corresponding 'target values' $\mathbf{y}$.
- each point consists of an input vector $\mathbf{x}_i$ and a class label $y_i$
- $\mathbf{x}_i$ can be thought of as features

**Bernoulli trial** : random experiments with binary classification
- $y_i$ is either $1$ (yes) or $0$ (no)

Wikipedia: [Bernoulli trial](https://en.wikipedia.org/wiki/Bernoulli_trial)

### Classification Examples

- classification of hand written digits from binary images (eg, automatic postcode reading)
- face detection in images (eg, digital cameras)
- who a detected face belongs to (eg, Picasa, Facebook, DeepFace, GaussianFace)
- classification of cancer type given gene expression data
- categorisation of document types (eg, different types of news article on the internet)

### Reminder on the Term 'Bayesian'

We use Bayes' rule to invert probabilities in the Bayesian approach.

Bayesian is not named after Bayes' rule (common confusion).
The use of Bayes' rule does **not** imply you are being Bayesian.
It is just an application of the product rule of probability.

- the term 'Bayesian' refers to the treatment of the parameters as stochastic variables
- proposed by Laplace and Bayes independently
- very controversial for early statisticians (eg, Fisher et al)

### Discrete Probability Distribution

Discrete probability is characterised by a **probability mass function**.

(eg) Poisson distribution, Bernoulli distribution, binomial distribution

- **regression** : a prediction function $f(\mathbf{X})$ may be a real number or sometimes real vector
- **classification** : we are given an input vector $\mathbf{x}$ and the associated label $y$ which takes value $0$ or $1$

Wikipedia: [Probability mass function](https://en.wikipedia.org/wiki/Probability_mass_function)

### Bernoulli Distribution

The **Bernoulli distribution** represents any single experiment that asks a yes–no question.

Its probabbility mass function is given by
$$
  p(y) = \left\{ \begin{array}{ll} \pi & y = 0 \\ 1-\pi & y = 1 \end{array} \right.
$$

(eg) $\pi = 0.5$ for coin toss with a fair coin

Wikipedia: [Bernoulli distribution](http://en.wikipedia.org/wiki/Bernoulli_distribution)

### Bernoulli Distribution

Mathematically we use $y$ as a **mathematical switch**:
$$
  P(y) = \pi^y (1-\pi)^{(1-y)}
$$

It is a clever trick for switching probabilities, as code it would be
```python
def bernoulli(y_i, pi):
    if y_i == 1:
        return pi
    else:
        return 1-pi
```

### Jacob Bernoulli's Bernoulli

Bernoulli described the Bernoulli distribution in terms of an 'urn' filled with balls.

- there are red and black balls, and there is a fixed number of balls in the urn
- the portion of red balls is given by $\pi$
- there is considered to be **epistemic uncertainty** about the distribution parameter by Bernoulli

Wikipedia: [Jacob Bernoulli](https://en.wikipedia.org/wiki/Jacob_Bernoulli)

### Thomas Bayes's Bernoulli

Bayes described the Bernoulli distribution (he didn't call it that!) in terms of a table and two balls.

- each ball is rolled so it comes to rest at a position that is uniformly distributed across the table
- the first ball stops at a position that is $\pi$ times the table width
- after placing the first ball you consider whether the second would land to the left or the right
- there is **aleatoric uncertainty** about Bayes's distribution

Wikipedia: [Thomas Bayes](https://en.wikipedia.org/wiki/Thomas_Bayes)

### Epistemic and Aleatoric Uncertainties

Epistemic uncertainty:
- this is uncertainty we could in principal know the answer; we just have not observed enough yet
- (eg) the result of a football match **after** it's played

Aleatoric Uncertainty:
- this is uncertainty we could not know even if we wanted to
- (eg) the result of a football match **before** it is played

### Maximum Likelihood in the Bernoulli (I)

We assume that
- data $\mathbf{y}$ is a binary sequence of length $n$
- each value was sampled independently from the Bernoulli distribution, given probability $\pi$

The likelihood for the dataset $\mathbf{y}$ is given by
$$
  p(\mathbf{y}|\pi) = \prod_{i=1}^n p(y_i | \pi) = \prod_{i=1}^n \pi^{y_i} (1-\pi)^{1-y_i}
$$

### Maximum Likelihood in the Bernoulli (II)

The objective function is the negative log likelihood:
$$
  E(\pi) \buildrel\triangle\over = -\log p(\mathbf{y} | \pi) = -\sum_{i=1}^n y_i \log \pi - \sum_{i=1}^n (1-y_i) \log(1-\pi)
$$
Gradient with respect to the parameter $\pi$:
$$
  \frac{\partial E(\pi)}{\partial\pi} = -\frac{\sum_{i=1}^n y_i}{\pi} + \frac{\sum_{i=1}^n (1-y_i)}{1-\pi}
$$
At a stationary point we set $\frac{\partial E(\pi)}{\partial\pi} = 0$ and get
$$
  \pi = \frac{1}{n}\sum_{i=1}^n y_i
$$

### Maximum Likelihood in the Bernoulli (III)

The solution:
$$
  \pi = \frac{1}{n}\sum_{i=1}^n y_i
$$
makes intuitive sense, ie, we estimate the probability associated with the Bernoulli by setting it to the number of observed positives, divided by the total number of experiments.

(eg) when you get 47 heads from 100 tosses, what is your best guess of probability for heads from this coin toss experiments?

### Bayes' Rule Reminder

Bayes' rule consists of four components; likelihood, marginal likelihood, prior and posterior distributions in the following relation:
$$
  \text{posterior} = \frac{\text{likelihood}\times\text{prior}}{\text{marginal likelihood}}
$$

### Naive Bayes Classifiers

Concept for naive Bayes classifers:

- reduce the number of parameters we need to optimise by computing the distribution using the product and sum rules

- given inputs $\mathbf{X}$ and the label data $\mathbf{y}$, specify a joint density $p(\mathbf{y}, \mathbf{X})$ for all potential values of $\mathbf{X}$ and $\mathbf{y}$

- given a test input $\mathbf{x}^*$, extend the density as $p(y^* | \mathbf{X}, \mathbf{y}, \mathbf{x}^*)$ to calculate the test label $y^*$

### Naive Bayes Assumptions

In **naive Bayes** we make certain simplifying assumptions that allow us to perform all of the above in practice.
- data conditional independence
- feature conditional independence
- marginal density for $y_i$

They are very strong (naive) assumptions about factorisations that are unlikely to be true in practice.

Wikipedia: [Naive Bayes classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)

(note) **conditional independence**

Two events A and B are **conditionally independent** given a third event C if the occurrence of A and the occurrence of B are independent events in their conditional probability distribution given C.

- A and B are **conditionally independent** given C if and only if, given knowledge that C occurs, knowledge of whether A (or B) occurs provides no information on the likelihood of B (or A) occurring

Wikipedia: [Conditional independence](https://en.wikipedia.org/wiki/Conditional_independence)

### Data Conditional Independence

Given model parameters $\boldsymbol{\theta}$ we assume that all data points in the model are **conditionally independent**:
$$
  p(\mathbf{x}^*, y^*, \mathbf{X}, \mathbf{y} | \boldsymbol{\theta}) = p(\mathbf{x}^*, y^* | \boldsymbol{\theta}) \prod_{i=1}^n p(\mathbf{x}_i, y_i | \boldsymbol{\theta})
$$
- joint density of $\mathbf{X}$ and $\mathbf{y}$ is independent across the data given $\boldsymbol{\theta}$
- similar assumption for **regression**, where $\boldsymbol{\theta} = \{\mathbf{w},\sigma^2\}$

Computing posterior distribution in this case becomes easier, which is known as the **Bayes classifier**.

### Feature Conditional Independence

We now assume that features are also **conditionally independent** given parameters and the class label:
$$
  p(\mathbf{x}_i | y_i, \boldsymbol{\theta}) = \prod_{j=1}^p p(x_{i,j} | y_i, \boldsymbol{\theta})
$$
where $p$ is the dimensionality of the inputs.

- very strong assumption: particular to naive Bayes
- known as the **naive Bayes assumption**
- Bayes classifier + feature conditional independence = naive Bayes

### Marginal Density for $y_i$

To calculate the joint distribution we need the marginal for $p(y_i)$ :
$$
  p(x_{i,j}, y_i | \boldsymbol{\theta}) = p(x_{i,j} | y_i, \boldsymbol{\theta}) p(y_i)
$$
Because $y_i$ is binary, the **Bernoulli density** makes a suitable choice for our prior over $y_i$ :
$$
  p(y_i | \pi) = \pi^{y_i} (1-\pi)^{1-y_i}
$$
where $\pi$ now has the interpretation as being the **prior** probability that the classification should be positive.

### Maximum Likelihood for Naive Bayes (I)

The full joint density of the training data is
$$
  p(\mathbf{X}, \mathbf{y} | \boldsymbol{\theta}, \pi) = \prod_{i=1}^n \prod_{j=1}^p p(x_{i,j} | y_i, \boldsymbol{\theta}) p(y_i | \pi)
$$
The objective function is the negative log likelihood:
$$
  E(\boldsymbol{\theta}, \pi) \buildrel\triangle\over = -\log p(\mathbf{X}, \mathbf{y} | \boldsymbol{\theta}, \pi) = -\sum_{i=1}^n \sum_{j=1}^p \log p(x_{i,j} | y_i, \boldsymbol{\theta}) - \sum_{i=1}^n \log p(y_i | \pi)
$$
which we decompose into two objective functions, one which is dependent on $\pi$ alone and one which is dependent on $\boldsymbol{\theta}$ alone:
$$
  E(\boldsymbol{\theta}, \pi) = E(\boldsymbol{\theta}) + E(\pi)
$$

### Maximum Likelihood for Naive Bayes (II)

To minimise the prior:
$$
  E(\pi) = -\sum_{i=1}^n \log p(y_i | \pi)
$$
is identical to the objective function from the Bernoulli; so we have
$$
  \pi = \frac{1}{n} \sum_{i=1}^n y_i
$$

### Maximum Likelihood for Naive Bayes (III)

To minimise the conditional distribution:
$$
  E(\boldsymbol{\theta}) = -\sum_{i=1}^n \sum_{j=1}^p \log p(x_{i,j} | y_i, \boldsymbol{\theta})
$$
we make an assumption about its form.

- the right assumption will depend on the data
- (eg) we may use Gaussian for real values data:
$$
  p(x_{i,j} | y_i, \boldsymbol{\theta}) = \frac{1}{\sqrt{2\pi\sigma_{y_i,j}^2}} \exp\left( -\frac{(x_{i,j} - \mu_{y_i,j})^2}{\sigma_{y_i,j}^2} \right)
$$

### Making Predictions (I)

Finally we look at how to evluate $p(y^* | \mathbf{X}, \mathbf{y}, \mathbf{x}^*, \boldsymbol{\theta})$ .

Using the product rule:
$$
  p(y^* | \mathbf{X}, \mathbf{y}, \mathbf{x}^*, \boldsymbol{\theta}) p(\mathbf{X}, \mathbf{y}, \mathbf{x}^* | \boldsymbol{\theta})
  = p(\mathbf{X}, \mathbf{y}, \mathbf{x}^*, y^* | \boldsymbol{\theta})
$$
This implies
$$
  p(y^* | \mathbf{X}, \mathbf{y}, \mathbf{x}^*, \boldsymbol{\theta})
  = \frac{p(\mathbf{X}, \mathbf{y}, \mathbf{x}^*, y^* | \boldsymbol{\theta})}{p(\mathbf{X}, \mathbf{y}, \mathbf{x}^* | \boldsymbol{\theta})}
  = \frac{p(\mathbf{X}, \mathbf{y}, \mathbf{x}^*, y^* | \boldsymbol{\theta})}{\sum_{y^*=0,1} p(\mathbf{X}, \mathbf{y}, \mathbf{x}^*, y^* | \boldsymbol{\theta})}
$$

### Making Predictions (II)

Here we use **data conditional independence** :
$$
  p(\mathbf{x}^*, y^*, \mathbf{X}, \mathbf{y} | \boldsymbol{\theta})
  = p(\mathbf{x}^*, y^* | \boldsymbol{\theta}) \prod_{i=1}^n p(\mathbf{x}_i, y_i | \boldsymbol{\theta})
$$
which gives a solution for making predictions:
$$
  p(y^* | \mathbf{X}, \mathbf{y}, \mathbf{x}^*, \boldsymbol{\theta})
  = \frac{p(\mathbf{x}^*, y^* | \boldsymbol{\theta}) \prod_{i=1}^n p(\mathbf{x}_i, y_i | \boldsymbol{\theta})}{\sum_{y^*=0,1} p(\mathbf{x}^*, y^* | \boldsymbol{\theta}) \prod_{i=1}^n p(\mathbf{x}_i, y_i | \boldsymbol{\theta})}
  = \frac{p(\mathbf{x}^*, y^* | \boldsymbol{\theta})}{\sum_{y^*=0,1} p(\mathbf{x}^*, y^* | \boldsymbol{\theta})}
  = \frac{\prod_{j=1}^p p(x_j^* | y^*, \boldsymbol{\theta}) p(y^* | \pi)}{\sum_{y^*=0,1} \prod_{j=1}^p p(x_j^* | y^*, \boldsymbol{\theta}) p(y^* | \pi)}
$$

- recall **Bernoulli density** for calculating $p(y^* | \pi)$
- naive Bayes has derived the class conditional densities, $p(\mathbf{x}_i | y_i, \boldsymbol{\theta})$, which is then used to evaluate $p(x_j^* | y^*, \boldsymbol{\theta})$