# Bernoulli Distribution

[Back to index](https://shotahorii.github.io/math-for-ds/)

---

## Table of contents
1. Definition
2. Proof that Bernoulli distribution is normalised
3. Expected value
4. Variance
5. Likelihood function
6. Maximum likelihood estimator
7. Conjugate prior
8. Posterior distribution
9. Predictive distribution

---

## 1. Definition

$ Bern(x|\mu)=\mu^x (1-\mu)^{1-x} $

$ where\,\,\,\, x \in \{0,1\} \,\,\,\, \mu \in [0,1] $  

---


## 2. Proof that Bernoulli distribution is normalised

$\mu^0(1-\mu)^{1-0} + \mu^1(1-\mu)^{1-1} = 1-\mu+\mu = 1$

---

## 3. Expected value

$ E[x] = \sum^1_{x=0} x\mu^x(1-\mu)^{1-x}$

$ =0\cdot\mu^0(1-\mu)^{1-0} + 1\cdot\mu^1(1-\mu)^{1-1}$

$ =\mu$

---

## 4. Variance

$Var[x] = \sum^1_{x=0} (x-\mu)^2\mu^x(1-\mu)^{1-x}$

$=\mu^2(1-\mu) + (1-\mu)^2\mu$

$=\mu(1-\mu)$

---

## 5. Likelihood function
Assume we have a set of observed x, $D = \{x_1, x_2, ..., x_N\}$. Assuming that each of $x_n$ is indepentently obtained from $Bern(x|\mu)$, likelihood function is obtained as below.

$P(D|\mu) = \prod^N_{n=1} Bern(x_n|\mu) = \prod^N_{n=1} \mu^{x_n}(1-\mu)^{1-x_n}$

---

## 6. Maximum likelihood estimator
For computational simplicity, maximise Log likelihood function instead of normal likelihood function.

$lnP(D|\mu) = ln\prod^N_{n=1} \mu^{x_n}(1-\mu)^{1-x_n}$

$=\sum^N_{n=1}ln\mu^{x_n}(1-\mu)^{1-x_n}$

$=\sum^N_{n=1}\{ x_nln\mu + (1-x_n)ln(1-\mu) \}$

$=Nln(1-\mu) + \{ln\mu-ln(1-\mu)\}\sum^N_{n=1}x_n$

The value of $\mu$ maximising the likelihood (let's call it as $\mu_{ML}$) can be obtained by having derivative of the likelihood function with respect to $\mu$, and setting it to zero.

$\frac{\partial lnP(D|\mu)}{\partial \mu} = N\frac{1}{\mu-1}+(\frac{1}{\mu}-\frac{1}{\mu-1})\sum^N_{n=1}x_n = 0$

$N-\frac{1}{\mu}\sum^N_{n=1}x_n = 0$

$\mu_{ML} = \frac{1}{N}\sum^N_{n=1}x_n$

---

## 7. Conjugate prior
[Beta distribution](https://github.com/shotahorii/math-for-ds/blob/master/content/statistics_and_probability/beta_distribution.ipynb)

---


## 8. Posterior distribution
Having a set of observed x, $D = \{x_1, x_2, ..., x_N\}$, posterior distribution of $\mu$ is below.

$P(\mu|D) = \frac{P(D|\mu)P(\mu)}{P(D)} = \frac{\{\prod^N_{n=1}P(x_n|\mu)\}P(\mu)}{P(D)}$

For computational simplicity, calculate log.

$lnP(\mu|D) = ln(\{\prod^N_{n=1}P(x_n|\mu)\}P(\mu)) - lnP(D)$

Now, as we are interested in the posterior distribution of $\mu$, consider non-$\mu$ part as $const$.

$lnP(\mu|D) = ln(\{\prod^N_{n=1}P(x_n|\mu)\}P(\mu)) + const$

$= \sum^N_{n=1}lnP(x_n|\mu) + lnP(\mu) + const$

$=\sum^N_{n=1}lnBern(x_n|\mu) + lnBeta(\mu|a,b) + const$

$=\sum^N_{n=1}ln\mu^{x_n}(1-\mu)^{1-x_n} + lnC_B(a,b)\mu^{a-1}(1-\mu)^{b-1} + const$

where $C_B(a,b)$ is the normalisation constant of Beta distribution with parameter $a,b$. Note that this is independent from $\mu$.

$=\sum^N_{n=1}\{x_nln\mu+(1-x_n)ln(1-\mu)\} + (a-1)ln\mu + (b-1)ln(1-\mu) + const$

$=\sum^N_{n=1}x_nln\mu + \sum^N_{n=1}(1-x_n)ln(1-\mu) + (a-1)ln\mu + (b-1)ln(1-\mu) + const$

$=(\sum^N_{n=1}x_n + a - 1)ln\mu + (N-\sum^N_{n=1}x_n+b-1)ln(1-\mu) + const$

$=ln\mu^{\sum^N_{n=1}x_n+a-1}(1-\mu)^{N-\sum^N_{n=1}x_n+b-1}const$

This is the form of (log) Beta distribution with $const$ as its normalisation constant. Hence we can tell below.

$P(\mu|D) = Beta(\mu|\sum^N_{n=1}x_n+a,N-\sum^N_{n=1}x_n+b)$

---


## 9. Predictive distribution
The predictive distribution of $x_*$, value generated from Bernoulli distribution with parameter $\mu$, is calculated by using likelihood and prior distribution as below.

$P(x_*) = \int^1_0 P(x_*|\mu)P(\mu)d\mu$

$=\int^1_0 Bern(x_*|\mu)Beta(\mu|a,b)d\mu$

$=\int^1_0 \mu^{x_*}(1-\mu)^{1-x_*}C_B(a,b)\mu^{a-1}(1-\mu)^{b-1}d\mu$

$=C_B(a,b)\int^1_0 \mu^{x_*+a-1}(1-\mu)^{1-x_*+b-1}d\mu$

Note: $\int^1_0x^{A-1}(1-x)^{B-1}dx = 1/C_B(A,B)$ 

$=\frac{C_B(a,b)}{C_B(x_*+a,1-x_*+b)}$

$=\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\frac{\Gamma(x_*+a)\Gamma(1-x_*+b)}{\Gamma(a+b+1)}$

Note: $\Gamma(z+1)=z\Gamma(z)$

$=\frac{\Gamma(x_*+a)\Gamma(1-x_*+b)}{(a+b)\Gamma(a)\Gamma(b)}$

As $x_* \in \{0,1\}$, we can check both cases as below.

$P(x_*=0)=\frac{\Gamma(a)\Gamma(b+1)}{(a+b)\Gamma(a)\Gamma(b)}=\frac{b}{a+b}$

$P(x_*=1)=\frac{\Gamma(a+1)\Gamma(b)}{(a+b)\Gamma(a)\Gamma(b)}=\frac{a}{a+b}$

Hence, 

$P(x_*) = (\frac{a}{a+b})^{x_*}(\frac{b}{a+b})^{1-x_*}$

$= (\frac{a}{a+b})^{x_*}(1-\frac{a}{a+b})^{1-x_*}$

$=Bern(x_*|\frac{a}{a+b})$

To use posterior distribution instead of prioir distribution, we can just replace $a,b$ to the posterior distribution's parameters. 

$Bern(x_*|\frac{\sum^N_{n=1}x_n+a}{\sum^N_{n=1}x_n+a+N-\sum^N_{n=1}x_n+b}) = Bern(x_*|\frac{\sum^N_{n=1}x_n+a}{a+N+b})$