# Multinomial Distribution

[Back to index](https://shotahorii.github.io/math-for-ds/)

---

## Table of contents
1. Definition
2. Proof that Multinomial distribution is normalised
3. Expected value
4. Variance
5. Likelihood function
6. Maximum likelihood estimator
7. Conjugate prior
8. Posterior distribution
9. Predictive distribution

---

## 1. Definition
Suppose there's a deck of $K$ cards (each card has different number $1,2,3,..., K$), and you draw a card from the deck $N$ times, with replacing the extracted card after each draw. If you represent the number of times you extract $k$-th card after $N$ draws as $m_k$, the result of this experiment is described by below discrete variable. 

${\bf m} = (m_1,m_2,...,m_K)^T$

$where \,\,\, m_k \in \mathbb{N}_0, \,\,\, \sum^K_{k=1}m_k=N$

For example, there're 5 cards in the deck ($K=5$), and you draw 6 times from the deck ($N=6$) with the following result: 3 times $1$-st card, 1 time $2$-nd card, and 2 times $3$-rd card. Then ${\bf m}$ is described as below.

${\bf m} = (3,1,2,0,0)^T$

Probability of each ${\bf m}$ occurs is described by Multinomial distribution, and defined as below. 

$ Mult({\bf m}|{\bf \mu},N)=\frac{N!}{m_1!m_2!...m_K!}\prod^K_{k=1} \mu_k^{m_k}$

$where \,\,\, {\bf \mu} = (\mu_1,\mu_2,...,\mu_K)^T, \,\,\, \mu_k \in [0,1], \,\,\, \sum^K_{k=1}\mu_k=1$ 

**Example**

For example, for ${\bf m}_{example} = (2,1,0,0,0)^T$ is below.

$Mult({\bf m}_{example}|{\bf \mu})=\frac{3!}{2!1!0!0!0!}\mu_1^2\mu_2^1\mu_3^0\mu_4^0\mu_5^0 = 3\mu_1^2\mu_2$

---


## 2. Proof that Multinomial distribution is normalised

By [multinomial theorem](https://github.com/shotahorii/math-for-ds/blob/master/content/others/binomial_and_multinomial_theorem.ipynb), below is true.

$(\mu_1+\mu_2+...+\mu_K)^N = \sum_{m_1+m_2+...+m_K=N} \frac{N!}{m_1!m_2!...m_K!} \prod_{k=1}^K \mu_k^{m_k}$

$= \sum_{m_1+m_2+...+m_K=N} Mult({\bf m}|{\bf \mu},N)$

Note: $\sum^K_{k=1}\mu_k=1$

$1^N = 1 = \sum_{m_1+m_2+...+m_K=N} Mult({\bf m}|{\bf \mu},N)$

---





## 3. Expected value

$E[{\bf x}|{\bf \mu}] = \sum^K_{k=1}{\bf x}_{(k)}Cat({\bf x}_{(k)}|{\bf \mu})$ 

$= \sum^K_{k=1}\mu_k{\bf x}_{(k)}$ 

$=(\mu_1,0,...,0)^T+(0,\mu_2,...,0)^T+...+(0,0,...,\mu_K)^T$

$=(\mu_1,\mu_2,...,\mu_K)^T = {\bf \mu}$


---

## 4. Variance
Considering ${\bf x} = (x_1,x_2,...,x_K)^T$ as a $K$ dimentional random vector, its $k$-th random variable ($x_k$) 's variance is the $k$-th diagonal element of the covariance matrix of ${\bf x}$. 

$V[x_k] = E[(x_k-E[x_k])(x_k-E[x_k])] = E[x_k^2]-(E[x_k])^2$

$= 1 \cdot \mu_k + 0 \cdot (1-\mu_k) - \mu_k^2$

$= \mu_k(1-\mu_k)$

---

## 5. Likelihood function
Assume we have a data set $D = \{{\bf x}_1, {\bf x}_2, ..., {\bf x}_N\}$. Assuming that each of ${\bf x}_n$ is indepentently obtained from $Cat({\bf x}|{\bf \mu})$, likelihood function is obtained as below.

$P(D|{\bf \mu}) = \prod^N_{n=1} Cat({\bf x}_n|{\bf \mu}) = \prod^N_{n=1} \prod^K_{k=1} \mu_k^{x_k^{(n)}}$

$= \prod^K_{k=1} \mu_k^{(\sum_{n=1}^Nx_k^{(n)})}$

Let: $m_k = \sum_{n=1}^Nx_k^{(n)}$ ,which is the number of data points with $x_k=1$

$= \prod^K_{k=1} \mu_k^{m_k}$

---

## 6. Maximum likelihood estimator
Maximise $P(D|{\bf \mu})$ under the constraint $\sum^K_{k=1}\mu_k=1$. For the computational simplicity, maximise log likelihood.

$lnP(D|{\bf \mu}) = ln\prod^K_{k=1} \mu_k^{m_k} = \sum_{k=1}^K m_k ln \mu_k$

With the method of Lagrange multiplier, 

$L({\bf \mu},\lambda) = \sum_{k=1}^K m_k ln \mu_k - \lambda (\sum^K_{k=1}\mu_k-1)$

$\frac{\partial L}{\partial \mu_1} = \frac{\partial L}{\partial \mu_2} = ... = \frac{\partial L}{\partial \mu_K} =\frac{\partial L}{\partial \lambda} = 0 $

Let's think about the partial derivative of L with respect to $\mu_k$ and set to zero.

$\frac{\partial L}{\partial \mu_k} = \frac{m_k}{\mu_k} - \lambda = 0$

Hence, $\mu_k = \frac{m_k}{\lambda}$

Then, the partial derivative of L with respect to $\lambda$ and set to zero.

$\frac{\partial L}{\partial \lambda} = -(\sum^K_{k=1}\mu_k-1) = 0 $

$\sum^K_{k=1}\mu_k = 1$

$\sum^K_{k=1}\frac{m_k}{\lambda} = 1$

$\lambda = \sum^K_{k=1}m_k$

Hence,

$\mu_k = \frac{m_k}{\sum^K_{k=1}m_k} = \frac{m_k}{N}$

---

## 7. Conjugate prior
Dirichlet distribution

---


## 8. Posterior distribution
Having a set of observed x, $D = \{x_1, x_2, ..., x_N\}$, posterior distribution of $\mu$ is below.

$P({\bf \mu}|D) = \frac{P(D|{\bf \mu})P({\bf \mu})}{P(D)} = \frac{\{\prod^N_{n=1}P({\bf x}_n|{\bf \mu})\}P({\bf \mu})}{P(D)}$


For computational simplicity, calculate log.

$lnP({\bf \mu}|D) = ln\prod^N_{n=1}P({\bf x}_n|{\bf \mu}) + lnP({\bf \mu}) - lnP(D)$

Now, as we are interested in the posterior distribution of $\mu$, consider non-$\mu$ part as $const$.

$= ln\prod^N_{n=1}P({\bf x}_n|{\bf \mu}) + lnP({\bf \mu}) + const$

$= ln\prod^N_{n=1}Cat({\bf x}_n|{\bf \mu}) + lnDir({\bf \mu}|{\bf \alpha}) + const$

$= ln\prod^N_{n=1}\prod^K_{k=1}\mu_k^{x_k^{(n)}} + ln \frac{1}{B({\bf \alpha})} \prod^K_{k=1} \mu_k^{\alpha_k-1} + const$

Let: $m_k = \sum_{n=1}^Nx_k^{(n)}$

$=ln\prod^K_{k=1} \mu_k^{m_k} + ln\prod^K_{k=1} \mu_k^{\alpha_k-1} + const$

$=ln\prod^K_{k=1} \mu_k^{m_k+\alpha_k-1} + const$

This is the form of (log) Dirichlet distribution with the const as its normalisation constant. Hence we can tell below.

$P({\bf \mu}|D) = Dir({\bf \mu}|{\bf \hat{\alpha}})$

$where \,\,\, {\bf \hat{\alpha}} = (\hat{\alpha}_1,\hat{\alpha}_2,...,\hat{\alpha}_K)^T,\,\,\,
\hat{\alpha}_k = \alpha_k + m_k$

---


## 9. Predictive distribution
The predictive distribution of ${\bf x}_* = (x_1^*,x_2^*,...,x_K^*)^T$, value generated from Categorical distribution with parameter ${\bf \mu}$, is calculated by using likelihood and prior distribution as below.

$P({\bf x}_*) = \int P({\bf x}_*|{\bf \mu})P({\bf \mu})d{\bf \mu}$

$= \int Cat({\bf x}_*|{\bf \mu})Dir({\bf \mu}|{\bf \alpha})d{\bf \mu}$

$= \int \prod^K_{k=1} \mu_k^{x_k^{(*)}}  \frac{1}{B({\bf \alpha})}\prod^K_{k=1} \mu_k^{\alpha_k-1}d{\bf \mu}$

$= \frac{1}{B({\bf \alpha})}\int \prod^K_{k=1} \mu_k^{x_k^{(*)}+\alpha_k-1}d{\bf \mu}$

Note: $\int Dir({\bf \mu}|{\bf A})d{\bf \mu}=1 \Longleftrightarrow
\int \frac{1}{B({\bf A})} \prod^K_{k=1} \mu_k^{A_k-1}d{\bf \mu}=1 \Longleftrightarrow
\int \prod^K_{k=1} \mu_k^{A_k-1}d{\bf \mu}=B({\bf A})$

$= \frac{1}{B({\bf \alpha})}B({\bf \alpha}+{\bf x}_*)$

$= \frac{\Gamma(\sum_{k=1}^K\alpha_k)}{\prod_{k=1}^K\Gamma(\alpha_k)}
\frac{\prod_{k=1}^K\Gamma(\alpha_k+x_k^{(*)})}{\Gamma(\sum_{k=1}^K\alpha_k+x_k^{(*)})}$

Note: $\sum_{k=1}^K x_k^{(*)} = 1$

$= \frac{\Gamma(\sum_{k=1}^K\alpha_k)}{\prod_{k=1}^K\Gamma(\alpha_k)}
\frac{\prod_{k=1}^K\Gamma(\alpha_k+x_k^{(*)})}{\Gamma(1+\sum_{k=1}^K\alpha_k)}$

$= \frac{1}{\sum_{k=1}^K\alpha_k}\frac{\prod_{k=1}^K\Gamma(\alpha_k+x_k^{(*)})}{\prod_{k=1}^K\Gamma(\alpha_k)}$

Let's think about the case where $k'$-th element is $1$. For example, if $k'=2$ then ${\bf x}_*=(0,1,0,...,0)^T$.

$P(x_{k'}^{(*)} =1) = \frac{1}{\sum_{k=1}^K\alpha_k}\frac{\Gamma(\alpha_{k'}+1)}{\Gamma(\alpha_{k'})} 
=\frac{\alpha_{k'}}{\sum_{k=1}^K\alpha_k}$

Generalising this,

$P({\bf x_*}) = \prod_{k=1}^K (\frac{\alpha_k}{\sum_{i=1}^K\alpha_i})^{x_k^{(*)}}$

$=Cat({\bf x_*}|{\bf \hat{\mu}})$

$where \,\,\, {\bf \hat{\mu}} = (\hat{\mu}_1,\hat{\mu}_2,...,\hat{\mu}_K)^T,\,\,\,
\hat{\mu}_k = \frac{\alpha_k}{\sum_{i=1}^K\alpha_i}$