### Bayesian Naive bayes

The data set $D$ consists of $n$ pairs of $x$ and $y$

$$ D = ((x^{(1)}, y_1), ..., (x^{(n)}, y_n)) $$

$x^{(i)}$ is the counts of words in invoice $i$. There are $d$ different words in the data set.

$$ x^{(i)} = (x^{(i)}_1, ..., x^{(i)}_d) $$

$y_i$ is the sender of invoice $i$. There are $m$ different senders in the data set.

$$ y_i \in \{1,...,m\} $$

The distribution for the sender $y$ is a categorical distribution, parameterized by $\theta$ probabilities of $y$ being each of the $m$ different senders

$$ 
p(y|\theta) =
Cat(y|\theta) =
\prod_{k=1}^m \theta_k^{[y=k]}
$$

The distribution for the counts of words $x$ in a document is the multinomial distribution parameterized by $\phi_{yj}$, the probability of word $j$ appearing in documents from sender $y$.

$$ 
p(x|y,\phi) =
Mult(x|y,\phi) \propto 
\prod_{j=1}^d \phi_{yj}^{x_j} 
$$

For reasons that are not at all obvious now, but will come in handy later, we'll re-write this as

$$ p(x|y,\phi) \propto
\prod_{k=1}^m
\prod_{j=1}^d 
\phi_{kj}^{x_j[k=y]} 
$$

For the priors on $\theta$ and $\phi$ we'll use dirichlet distributions as they're the conjugate priors for both the  categorical and the multinomial distribution. The priors are parameterized by the hyper parameters $\alpha$ and $\beta$.

$$ 
p(\theta) = 
Dir(\theta|\alpha) \propto
\prod_{k=1}^m \theta_k^{\alpha_k-1} 
$$

$$ 
p(\phi) =
\prod_{k=1}^m p(\phi_k) =
\prod_{k=1}^m Dir(\phi_k|\beta_k) \propto
\prod_{k=1}^m
\prod_{j=1}^d 
\phi_{kj}^{\beta_{kj}-1} 
$$

We'll find the posterior of the parameters first

$$
p(\theta, \phi | D) \propto p(\theta)p(\phi)p(D|\theta, \phi)
$$

Assuming the data is i.i.d

$$
p(\theta, \phi | D) \propto p(\theta)p(\phi)\prod_{i=1}^n p(x^{(i)}|y_i, \phi)p(y_i|\theta)
$$

Re-arranging

$$
p(\theta, \phi | D) \propto 
\left[
p(\theta)
\prod_{i=1}^n 
p(y_i|\theta)
\right]
\left[
p(\phi)
\prod_{i=1}^n 
p(x^{(i)}|y_i, \phi)
\right]
$$

We'll handle the leftmost bracket first. Inserting the definitions.


$$
p(\theta, \phi | D) \propto 
\left[
\prod_{k=1}^m \theta_k^{\alpha_k-1}
\prod_{i=1}^n 
\prod_{k=1}^m \theta_k^{[y=k]}
\right]
\left[
p(\phi)
\prod_{i=1}^n 
p(x^{(i)}|y_i, \phi)
\right]
$$



Introducing $c_k = \sum_{i=1}^n[y_i=k]$, the count of sender $k$ in the data set

$$
p(\theta, \phi | D) \propto 
\left[
\prod_{k=1}^m \theta_k^{\alpha_k-1}
\prod_{k=1}^m \theta_k^{c_k}
\right]
\left[
p(\phi)
\prod_{i=1}^n 
p(x^{(i)}|y_i, \phi)
\right]
$$

Joining the products

$$
p(\theta, \phi | D) \propto 
\left[
\prod_{k=1}^m 
\theta_k^{\alpha_k + c_k -1}
\right]
\left[
p(\phi)
\prod_{i=1}^n 
p(x^{(i)}|y_i, \phi)
\right]
$$

And now the right hand bracket. Inserting the definitions

$$
p(\theta, \phi | D) \propto 
\left[
\prod_{k=1}^m 
\theta_k^{\alpha_k + c_k -1}
\right]
\left[
\prod_{k=1}^m
\prod_{j=1}^d 
\phi_{kj}^{\beta_{kj}-1} 
\prod_{i=1}^n
\prod_{k=1}^m
\prod_{j=1}^d 
\phi_{kj}^{x^{(i)}_j[k=y_i]} 
\right]
$$

Introducing $w_{kj} = \sum_{i=1}^n x^{(i)}_j[k=y_i]$, the sum of occurences of word $j$ in documents from sender $k$

$$
p(\theta, \phi | D) \propto 
\left[
\prod_{k=1}^m 
\theta_k^{\alpha_k + c_k -1}
\right]
\left[
\prod_{k=1}^m
\prod_{j=1}^d 
\phi_{kj}^{\beta_{kj}-1} 
\prod_{k=1}^m
\prod_{j=1}^d 
\phi_{kj}^{w_{kj}} 
\right]
$$

Joining the products

$$
p(\theta, \phi | D) \propto 
\left[
\prod_{k=1}^m 
\theta_k^{\alpha_k + c_k -1}
\right]
\left[
\prod_{k=1}^m
\prod_{j=1}^d 
\phi_{kj}^{\beta_{kj} + w_{kj} - 1} 
\right]
$$

We're interested in the posterior predictive distribution, i.e. given a new input $\tilde{x}$ what is the distribution of senders $\tilde{y}$, conditioned on the data set $D$.

$$ p(\tilde{y} | \tilde{x}, D) \propto \int \int p(\tilde{x}, \tilde{y} | \theta, \phi, D) p(\theta, \phi | D) d\theta d\phi $$

Re-arranging

$$
p(\tilde{y} | \tilde{x}, D) \propto 
\int 
p(\tilde{y} | \theta)
p(\theta | D) 
d\theta 
\int 
p(\tilde{x} | \tilde{y}, \phi)
p(\phi | D) 
d\phi
$$

We'll handle the integral over $\theta$ first. Inserting the definitions

$$
p(\tilde{y} | \tilde{x}, D) \propto 
\int 
\prod_{k=1}^m \theta_k^{[\tilde{y}=k]}
\prod_{k=1}^m 
\theta_k^{\alpha_k + c_k -1}
d\theta 
\int 
p(\tilde{x} | \tilde{y}, \phi)
p(\phi | D) 
d\phi
$$

Joining the products

$$
p(\tilde{y} | \tilde{x}, D) \propto 
\int 
\prod_{k=1}^m 
\theta_k^{\alpha_k + c_k + [\tilde{y}=k] - 1}
d\theta 
\int 
p(\tilde{x} | \tilde{y}, \phi)
p(\phi | D) 
d\phi
$$

This is an integral over the un-normalized $Dir(\theta|\alpha_k + c_k + [\tilde{y}=k])$ distribution.

Using $\int \frac{1}{Z}p(x)dx = 1 \implies \int p(x) dx = Z$

$$
p(\tilde{y} | \tilde{x}, D) \propto 
\frac
{\prod_{k=1}^m\Gamma(\alpha_k + c_k + [\tilde{y}=k])}
{\Gamma \left( \sum_{k=1}^m \alpha_k + c_k + [\tilde{y}=k] \right)}
\int 
p(\tilde{x} | \tilde{y}, \phi)
p(\phi | D) 
d\phi
$$

Using $\Gamma(x+1) = x\Gamma(x)$

$$
p(\tilde{y} | \tilde{x}, D) \propto 
\frac
{(\alpha_\tilde{y} + c_\tilde{y})}
{(\sum_{k=1}^m \alpha_k + c_k)}
\frac
{\prod_{k=1}^m\Gamma(\alpha_k + c_k)}
{\Gamma \left( \sum_{k=1}^m \alpha_k + c_k \right)}
\int 
p(\tilde{x} | \tilde{y}, \phi)
p(\phi | D) 
d\phi
$$

Dropping the constant terms

$$
p(\tilde{y} | \tilde{x}, D) \propto 
(\alpha_\tilde{y} + c_\tilde{y})
\int 
p(\tilde{x} | \tilde{y}, \phi)
p(\phi | D) 
d\phi
$$

Now for the integral over $\phi$. Inserting the definitions

$$
p(\tilde{y} | \tilde{x}, D) \propto 
(\alpha_\tilde{y} + c_\tilde{y})
\int 
\left[
\prod_{k=1}^m
\prod_{j=1}^d 
\phi_{kj}^{\tilde{x}_j[k=\tilde{y}]} 
\right]
\left[
\prod_{k=1}^m
\prod_{j=1}^d 
\phi_{kj}^{\beta_{kj} + w_{kj} - 1} 
\right]
d\phi
$$

Joining the products, and moving the product over $k$ outside the integrals

$$
p(\tilde{y} | \tilde{x}, D) \propto 
(\alpha_\tilde{y} + c_\tilde{y})
\prod_{k=1}^m
\int 
\left[
\prod_{j=1}^d 
\phi_{kj}^{\beta_{kj} + w_{kj} + \tilde{x}_j[k=\tilde{y}] - 1} 
\right]
d\phi
$$

This is an integral over the un-normalized $Dir(\theta_k|\beta_{k} + w_{k} + \tilde{x}_j[k=\tilde{y}])$ distribution

$$
p(\tilde{y} | \tilde{x}, D) \propto 
(\alpha_\tilde{y} + c_\tilde{y})
\prod_{k=1}^m
\frac
{\prod_{j=1}^d\Gamma \left( \beta_{kj} + w_{kj} + \tilde{x}_j[k=\tilde{y}] \right)}
{\Gamma \left( \sum_{j=1}^d \beta_{kj} + w_{kj} + \tilde{x}_j[k=\tilde{y}] \right)}
$$

Splitting the sum in the denominator

$$
p(\tilde{y} | \tilde{x}, D) \propto 
(\alpha_\tilde{y} + c_\tilde{y})
\prod_{k=1}^m
\frac
{\prod_{j=1}^d\Gamma \left( \beta_{kj} + w_{kj} + \tilde{x}_j[k=\tilde{y}] \right)}
{\Gamma \left( \sum_{j=1}^d (\beta_{kj} + w_{kj}) + \sum_{j=1}^d \tilde{x}_j[k=\tilde{y}] \right)}
$$

Using $\Gamma(x+n) = x^{(n)}\Gamma(x)$, where $x^{(n)} = x(x+1)...x(x+n-1)$ denotes the rising factorial

$$
p(\tilde{y} | \tilde{x}, D) \propto 
(\alpha_\tilde{y} + c_\tilde{y})
\prod_{k=1}^m
\frac
{\prod_{j=1}^d (\beta_{kj} + w_{kj})^{(\tilde{x}_j[k=\tilde{y}])}}
{\left(\sum_{j=1}^d \beta_{kj} + w_{kj}\right)^{(\sum_{j=1}^d \tilde{x}_j[k=\tilde{y}])}}
\frac
{\prod_{j=1}^d \Gamma \left( \beta_{kj} + w_{kj} \right)}
{\Gamma \left( \sum_{j=1}^d \beta_{kj} + w_{kj} \right)}
$$

Dropping the constant terms

$$
p(\tilde{y} | \tilde{x}, D) \propto 
(\alpha_\tilde{y} + c_\tilde{y})
\prod_{k=1}^m
\frac
{\prod_{j=1}^d(\beta_{kj} + w_{kj})^{(\tilde{x}_j[k=\tilde{y}])}}
{\left(\sum_{j=1}^d \beta_{kj} + w_{kj}\right)^{(\sum_{j=1}^d \tilde{x}_j[k=\tilde{y}])}}
$$

Using $x^{(0)} = 1$ by definition

$$
p(\tilde{y} | \tilde{x}, D) \propto 
(\alpha_\tilde{y} + c_\tilde{y})
\frac
{\prod_{j=1}^d(\beta_{\tilde{y}j} + w_{\tilde{y}j})^{(\tilde{x}_j)}}
{\left(\sum_{j=1}^d \beta_{\tilde{y}j} + w_{\tilde{y}j}\right)^{(\sum_{j=1}^d\tilde{x}_j)}}
$$

Using $x^{(n)} = \frac{\Gamma(x+n)}{\Gamma(x)}$


$$
p(\tilde{y} | \tilde{x}, D) \propto 
(\alpha_\tilde{y} + c_\tilde{y})
\frac
{
  \prod_{j=1}^d
  \frac{\Gamma((\beta_{\tilde{y}j} + w_{\tilde{y}j}) + \tilde{x}_j)}{\Gamma((\beta_{\tilde{y}j} + w_{\tilde{y}j}))}
}
{
  \frac{\Gamma\left(\left(\sum_{j=1}^d \beta_{\tilde{y}j} + w_{\tilde{y}j}\right) + \sum_{j=1}^d\tilde{x}_j\right)}{\Gamma\left(\sum_{j=1}^d \beta_{\tilde{y}j} + w_{\tilde{y}j}\right)}
}
$$

Simplifying 

$$
p(\tilde{y} | \tilde{x}, D) \propto 
(\alpha_\tilde{y} + c_\tilde{y})
\left[
\prod_{j=1}^d
\frac{\Gamma(\beta_{\tilde{y}j} + w_{\tilde{y}j} + \tilde{x}_j)}{\Gamma(\beta_{\tilde{y}j} + w_{\tilde{y}j})}
\right]
\frac{\Gamma\left(\sum_{j=1}^d \beta_{\tilde{y}j} + w_{\tilde{y}j}\right)}{\Gamma(\left(\sum_{j=1}^d \beta_{\tilde{y}j} + w_{\tilde{y}j}\right) + \sum_{j=1}^d\tilde{x}_j)}
$$

Simplifying 

$$
p(\tilde{y} | \tilde{x}, D) \propto 
(\alpha_\tilde{y} + c_\tilde{y})
\left[
\prod_{j=1}^d
\frac{\Gamma(\beta_{\tilde{y}j} + w_{\tilde{y}j} + \tilde{x}_j)}{\Gamma(\beta_{\tilde{y}j} + w_{\tilde{y}j})}
\right]
\frac{\Gamma\left(\sum_{j=1}^d \beta_{\tilde{y}j} + w_{\tilde{y}j}\right)}
{\Gamma\left(\sum_{j=1}^d \beta_{\tilde{y}j} + w_{\tilde{y}j} + \tilde{x}_j\right)}
$$

Simplifying 

$$
p(\tilde{y} | \tilde{x}, D) \propto 
(\alpha_\tilde{y} + c_\tilde{y})
\frac
{\prod_{j=1}^d\Gamma(\beta_{\tilde{y}j} + w_{\tilde{y}j} + \tilde{x}_j)}
{\Gamma\left(\sum_{j=1}^d \beta_{\tilde{y}j} + w_{\tilde{y}j} + \tilde{x}_j\right)}
\frac
{\Gamma\left(\sum_{j=1}^d \beta_{\tilde{y}j} + w_{\tilde{y}j}\right)}
{\prod_{j=1}^d\Gamma(\beta_{\tilde{y}j} + w_{\tilde{y}j})}
$$

Simplifying, where $B$ is the beta function

$$
p(\tilde{y} | \tilde{x}, D) \propto 
(\alpha_\tilde{y} + c_\tilde{y})
\frac
{B\left( \beta_{\tilde{y}j} + w_{\tilde{y}j} + \tilde{x}_j \right)}
{B\left(\beta_{\tilde{y}j} + w_{\tilde{y}j}\right)}
$$

In [117]:
import numpy as np
from scipy.special import gamma

def beta(x):
    return np.product(gamma(x))/gamma(x.sum())

b = np.ones((2,2))
w = np.array([[0,2],[2,0]])
x = np.array([1, 0])

for y in range(2):
    r1 = beta(b[y]+w[y]+x)/beta(b[y]+w[y])
    print "likelihood sender %d: %f" % (y, r1)

    r2 = 1
    for i in range(2):
        r2 *= beta(b[i]+w[i]+x*(i==y))
    
    print r2/r1

likelihood sender 0: 0.250000
0.111111111111
likelihood sender 1: 0.750000
0.111111111111
