# Binomial Distribution

[Back to index](https://shotahorii.github.io/math-for-ml/index.html)

---

## Table of contents
1. Definition
2. Proof that Binomial distribution is normalised
3. Expected value
4. Variance
5. Likelihood function
6. Maximum likelihood estimator
7. Conjugate prior
8. Posterior distribution
9. Predictive distribution

---

## 1. Definition

$ Bin(m|N,\mu)=\left(\begin{array}{c}N\\m\\\end{array}\right)\mu^m (1-\mu)^{N-m} $

$ where\,\,\,\, N,m \in \mathbb{N}_0 \,\,\,\, m \le N \,\,\,\, \mu \in [0,1] $

See [Binomial Coefficient](https://github.com/shotahorii/math-for-ml/blob/master/content/fundamental/binomial_coefficient.ipynb)

---

## 2. Proof that Binomial distribution is normalised

$\sum^N_{m=0}\left(\begin{array}{c}N\\m\\\end{array}\right)\mu^m (1-\mu)^{N-m} 
= \sum^N_{m=0}\left(\begin{array}{c}N\\m\\\end{array}\right)\mu^m \frac{(1-\mu)^N}{(1-\mu)^m}$

$=(1-\mu)^N\sum^N_{m=0}\left(\begin{array}{c}N\\m\\\end{array}\right) (\frac{\mu}{1-\mu})^m$

Note that $(1+x)^N = \sum^N_{m=0}\left(\begin{array}{c}N\\m\\\end{array}\right) x^m$ by definiton of binomial coefficient. 

$=(1-\mu)^N(1+\frac{\mu}{1-\mu})^N$

$=(1-\mu)^N(\frac{1}{1-\mu})^N = 1$

---

## 3. Expected value

$E[m]=\sum^N_{m=0}m\cdot Bin(m|N,\mu)$

Note that when $m=0$, inside sigma is $0\cdot Bin(m|N,\mu)=0$. Hence we can replace $\sum^N_{m=0}$ by $\sum^N_{m=1}$

$=\sum^N_{m=1}m\cdot Bin(m|N,\mu)$

$=\sum^N_{m=1}m\left(\begin{array}{c}N\\m\\\end{array}\right)\mu^m (1-\mu)^{N-m} $

Note: $\left(\begin{array}{c}N\\m\\\end{array}\right) 
=\frac{N}{m}\left(\begin{array}{c}N-1\\m-1\\\end{array}\right)$

$=\sum^N_{m=1}m\frac{N}{m}\left(\begin{array}{c}N-1\\m-1\\\end{array}\right)\mu^m(1-\mu)^{N-m}$

$=N\mu\sum^N_{m=1}\left(\begin{array}{c}N-1\\m-1\\\end{array}\right)\mu^{m-1}(1-\mu)^{N-m}$

Let: $m'=m-1$

$=N\mu\sum^{N-1}_{m'=0}\left(\begin{array}{c}N-1\\m'\\\end{array}\right)\mu^{m'}(1-\mu)^{N-(m'+1)}$

Let: $N'=N-1$

$=N\mu\sum^{N'}_{m'=0}\left(\begin{array}{c}N'\\m'\\\end{array}\right)\mu^{m'}(1-\mu)^{N'-m'}$

$=N\mu$

---

## 4. Variance

$E[m^2]=\sum^N_{m=0}m^2\left(\begin{array}{c}N\\m\\\end{array}\right)\mu^m (1-\mu)^{N-m}$

Note: $m^2 = m(m-1)+m$

$=\sum^N_{m=0}m(m-1)\left(\begin{array}{c}N\\m\\\end{array}\right)\mu^m (1-\mu)^{N-m}
+\sum^N_{m=0}m\left(\begin{array}{c}N\\m\\\end{array}\right)\mu^m (1-\mu)^{N-m}$

Note that the second part is $E[m]=N\mu$

$=\sum^N_{m=0}m(m-1)\left(\begin{array}{c}N\\m\\\end{array}\right)\mu^m (1-\mu)^{N-m} + N\mu$

When $m=0, m=1$, inside sigma is 0. Hence we can replace $\sum^N_{m=0}$ by $\sum^N_{m=2}$

$=\sum^N_{m=2}m(m-1)\left(\begin{array}{c}N\\m\\\end{array}\right)\mu^m (1-\mu)^{N-m} + N\mu$

Note: $\left(\begin{array}{c}N\\m\\\end{array}\right) 
=\frac{N}{m}\frac{N-1}{m-1}\left(\begin{array}{c}N-2\\m-2\\\end{array}\right)$

$=N(N-1)\mu^2\sum^N_{m=2}\left(\begin{array}{c}N-2\\m-2\\\end{array}\right)\mu^{m-2}(1-\mu)^{N-m} + N\mu$

Let: $m'=m-2$

$=N(N-1)\mu^2\sum^{N-2}_{m'=0}\left(\begin{array}{c}N-2\\m'\\\end{array}\right)\mu^{m'}(1-\mu)^{N-(m'+2)} + N\mu$

Let: $N'=N-2$

$=N(N-1)\mu^2\sum^{N'}_{m'=0}\left(\begin{array}{c}N'\\m'\\\end{array}\right)\mu^{m'}(1-\mu)^{N'} + N\mu$

$=N(N-1)\mu^2 + N\mu$

Hence,

$Var[m] = E[m^2] - (E[m])^2$

$=N(N-1)\mu^2+N\mu - (N\mu)^2$

$=N\mu-N\mu^2 = N\mu(1-\mu)$

---

## 5. Likelihood function
From here, use $\left(\begin{array}{c}M\\x\\\end{array}\right)$ 
instead of $\left(\begin{array}{c}N\\m\\\end{array}\right)$ above.
Assume we have a set of observed x, $D = \{x_1, x_2, ..., x_N\}$. Assuming that each of $x_n$ is indepentently obtained from $Bin(x|M,\mu)$, likelihood function is obtained as below.

$P(D|M,\mu) = \prod^N_{n=1} Bin(x_n|M,\mu)$

$= \prod^N_{n=1}\left(\begin{array}{c}M\\x_n\\\end{array}\right) \mu^{x_n}(1-\mu)^{M-x_n}$

---

## 6. Maximum likelihood estimator
For computational simplicity, maximise Log likelihood function instead of normal likelihood function.

$lnP(D|M,\mu) = ln\prod^N_{n=1}\left(\begin{array}{c}M\\x_n\\\end{array}\right) \mu^{x_n}(1-\mu)^{M-x_n}$

$=\sum^N_{n=1}ln\left(\begin{array}{c}M\\x_n\\\end{array}\right) \mu^{x_n}(1-\mu)^{M-x_n}$

$=\sum^N_{n=1}\{ln\left(\begin{array}{c}M\\x_n\\\end{array}\right) + x_nln(\mu) + (M-x_n)ln(1-\mu)\}$

$=\sum^N_{n=1}ln\left(\begin{array}{c}M\\x_n\\\end{array}\right) + \sum^N_{n=1}x_nln(\mu) + (NM-\sum^N_{n=1}x_n)ln(1-\mu)$

The value of $\mu$ maximising the likelihood (let's call it as $\mu_{ML}$) can be obtained by having derivative of the likelihood function with respect to $\mu$, and setting it to zero.

$\frac{\partial lnP(D|M,\mu)}{\partial \mu} = \frac{1}{\mu}\sum^N_{n=1}x_n + \frac{1}{\mu-1}(NM-\sum^N_{n=1}x_n)= 0$

$\mu_{ML} = \frac{1}{NM}\sum^N_{n=1}x_n$

---

## 7. Conjugate prior
[Beta distribution](https://github.com/shotahorii/math-for-ml/blob/master/content/probability_distribution/beta_distribution.ipynb)

---

## 8. Posterior distribution
Having a set of observed x, $D = \{x_1, x_2, ..., x_N\}$, posterior distribution of $\mu$ is below.

$P(\mu|M,D) = \frac{P(D|M,\mu)P(\mu)}{P(D)} = \frac{\{\prod^N_{n=1}P(x_n|M,\mu)\}P(\mu)}{P(D)}$

Note that $P(D,M,\mu)=P(D|M,\mu)P(M,\mu)=P(D|M,\mu)P(M|\mu)P(\mu)=P(D|M,\mu)P(\mu)$ because $P(M|\mu)=P(M)=1$

Also, $P(D,M,\mu)=P(\mu|M,D)P(M,D)=P(\mu|M,D)P(M|D)P(D)=P(\mu|M,D)P(D)$ because $P(M|D)=P(M)=1$

For computational simplicity, calculate log.

$lnP(\mu|M,D) = ln(\{\prod^N_{n=1}P(x_n|M,\mu)\}P(\mu)) - lnP(D)$

Now, as we are interested in the posterior distribution of $\mu$, consider non-$\mu$ part as $const$.

$lnP(\mu|M,D) = ln(\{\prod^N_{n=1}P(x_n|M,\mu)\}P(\mu)) + const$

$= \sum^N_{n=1}lnP(x_n|M,\mu) + lnP(\mu) + const$

$=\sum^N_{n=1}lnBin(x_n|M,\mu) + lnBeta(\mu|a,b) + const$

$=\sum^N_{n=1}ln\left(\begin{array}{c}M\\x_n\\\end{array}\right)\mu^{x_n}(1-\mu)^{M-x_n} + lnC_B(a,b)\mu^{a-1}(1-\mu)^{b-1} + const$

where $C_B(a,b)$ is the normalisation constant of Beta distribution with parameter $a,b$. Note that this is independent from $\mu$.

$=\sum^N_{n=1}\{ln\left(\begin{array}{c}M\\x_n\\\end{array}\right)+x_nln\mu+(M-x_n)ln(1-\mu)\}
+lnC_B(a,b)+(a-1)ln\mu+(b-1)ln(1-\mu)+const$

$=\sum^N_{n=1}x_nln\mu+\sum^N_{n=1}(M-x_n)ln(1-\mu)+(a-1)ln\mu+(b-1)ln(1-\mu)+const$

$=(\sum^N_{n=1}x_n+a-1)ln\mu+(NM-\sum^N_{n=1}x_n+b-1)ln(1-\mu)+const$

$=ln\mu^{\sum^N_{n=1}x_n+a-1}(1-\mu)^{NM-\sum^N_{n=1}x_n+b-1}const$

This is the form of (log) Beta distribution with $const$ as its normalisation constant. Hence we can tell below.

$P(\mu|M,D) = Beta(\mu|\sum^N_{n=1}x_n+a,NM-\sum^N_{n=1}x_n+b)$

---

## 9. Predictive distribution
The predictive distribution of $x_*$, value generated from Binomial distribution with parameter $\mu$, is calculated by using likelihood and prior distribution as below.

$P(x_*) = \int^1_0 P(x_*|\mu)P(\mu)d\mu$

$=\int^1_0 Bin(x_*|M,\mu)Beta(\mu|a,b)d\mu$

$=\int^1_0 \left(\begin{array}{c}M\\x_*\\\end{array}\right)\mu^{x_*}(1-\mu)^{M-x_*}C_B(a,b)\mu^{a-1}(1-\mu)^{b-1}d\mu$

$=C_B(a,b)\left(\begin{array}{c}M\\x_*\\\end{array}\right)\int^1_0 \mu^{x_*+a-1}(1-\mu)^{M-x_*+b-1}d\mu$

Note: $\int^1_0x^{A-1}(1-x)^{B-1}dx = 1/C_B(A,B)$ 

$=\left(\begin{array}{c}M\\x_*\\\end{array}\right)\frac{C_B(a,b)}{C_B(x_*+a,M-x_*+b)}$

$=\left(\begin{array}{c}M\\x_*\\\end{array}\right)\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\frac{\Gamma(x_*+a)\Gamma(M-x_*+b)}{\Gamma(a+b+M)}$

To use posterior distribution instead of prioir distribution, we can just replace $a,b$ to the posterior distribution's parameters. 

$\left(\begin{array}{c}M\\x_*\\\end{array}\right)\frac{\Gamma(a+b+NM)}{\Gamma(\sum^N_{n=1}x_n+a)\Gamma(NM-\sum^N_{n=1}x_n+b)}\frac{\Gamma(x_*+\sum^N_{n=1}x_n+a)\Gamma(M-x_*+NM-\sum^N_{n=1}x_n+b)}{\Gamma(a+b+NM+M)}$