## Exercise 3.13 Posterior predictive distribution for a batch of data with the dirichlet-multinomial model
In Equation 3.51, we gave the the posterior predictive distribution for a single multinomial trial using a dirichlet prior. Now consider predicting a batch of new data, $D = (X_1,\dots,X_m)$, consisting of $m$ single multinomial trials (think of predicting the next m words in a sentence, assuming they are drawn iid). Derive an expression for $p(\tilde{D}|D, α)$.

Your answer should be a function of $\alpha$, and the old and new counts (sufficient statistics), defined as 

\begin{aligned}
N^{\mathrm{old}}_k & = \sum_{i\in D}I(x_i=k)\\
N^{\mathrm{new}}_k & = \sum_{i\in\tilde{D}}I(x_i = k)
\end{aligned}

Hint: recall that, for a vector of counts, $N_{1:K}$, the marginal likelihood (evidence) is given by

$$
p(D|\alpha) = \frac{\Gamma(\alpha)}{\Gamma(N+\alpha)}\prod_k\frac{\Gamma(N_k+a_k)}{\Gamma(\alpha_k)}
$$
where $\alpha = \sum_k\alpha_k$ and $N =\sum_kN_k$.

### Solution
In this question, we will calculate the posterior predictive for a batch of data, using the Dirichlet-multinomial model. This is a straightforward math question, that will follow the same pattern as the previous posterior predictives as we saw before. Nevertheless, there is easier and hard ways to arrive at the results. We will see two easy ways to arrive at the final expression.

\begin{aligned}
N & = N^{\mathrm{old}} + N^{\mathrm{new}} \\
N_j & = N_j^{\mathrm{old}} + N_j^{\mathrm{new}}
\end{aligned}

Let's start by writing down the posterior predictive for one single trial:

$$
p(X=j|D,\alpha) = \frac{a_j+N}{a+N}
$$

Now, we need to express the batch of data as a series of single trials:

$$
p(\tilde{D}|D, \alpha) = p(\tilde{x_1}|D)p(\tilde{x_2}|\{D, \tilde{x_1}\})p(\tilde{x_3}|\{D, \tilde{x_1}, \tilde{x_2}\})\ldots
$$

Now we just have to substitute (1) in (2), updating the number of empirical counts of the total amount and of the specific trial for each instance:

\begin{aligned}
p(\tilde{D}|D, \alpha) & = \frac{1}{\prod_{i=0}^{N-1}(\alpha+N^{\mathrm{old}}+i)}\prod_{j=1}^K\prod_{i=0}^{N_j^{new}-1}(\alpha_j + N_j^{old}+i) \\
& = \frac{\Gamma(\alpha + N^{old})}{\Gamma(\alpha+N)}\prod_{j=1}^K\frac{\Gamma(\alpha_j + N_j)}{\Gamma(\alpha_j + N_j^{old})}
\end{aligned}

### Second solution
The posterior distribution, given by the original dataset $D$, is given by

$$
p(\theta|D, \alpha) = \mathrm{Dir}(\theta|\alpha + N^{old})
$$

which has the same form as the prior. So, we may think of the posterior as a prior with updated hyperparameters ($\alpha_j^{new} = N_j^{old} + \alpha_j$).

Now, we can use the marginal likelihood expression given by the author, using the new dataset $\tilde{D}$ as input and the new hyperparameters as the parameter:

\begin{aligned}
p(\tilde{D}|\alpha^{new}) & = \frac{\Gamma(\alpha^{new})}{\Gamma(N^{new}+\alpha^{new})}\prod_j\frac{\Gamma(N_j^{new} + \alpha_j^{new})}{\Gamma(\alpha_j^{new})}\\
& = \frac{\Gamma(N^{old}+\alpha)}{\Gamma(N^{new} + N^{old} + \alpha)}\prod_j\frac{\Gamma(N_j^{new} + N_j^{old} + \alpha_j)}{\Gamma(N_j^{old}+\alpha_j)}\\
& = \frac{\Gamma(\alpha + N^{old})}{\Gamma(\alpha + N)}\prod_{j=1}^K\frac{\Gamma(\alpha_j + N_j)}{\Gamma(\alpha_j + N_j^{old})}
\end{aligned}

In this question we derived the expression for the posterior predictive of a batch of data using the Dirichlet multinomial model. The second one used updated hyperparameters together with the expression for the marginal likelihood.